Aaron Diemel | AI Systems R&D

What "Frontier" Actually Means

AI is frontier technology in a practical sense. Not because it's new to talk about, but because many of its most important behaviors only emerge in deployment.

Operating on the frontier means:

Model capabilities shift faster than frameworks.
The same system can behave differently across model versions.
Reliability is emergent, not guaranteed.
Evaluation is required because surprise is normal.
Tooling, autonomy, and guardrails must be designed together.

As models gain longer context, stronger tool use, and limited forms of self-improvement under constraints, the job shifts. It's no longer just about prompts. It's about designing the environment the model operates in: what it can access, what it can change, how it proves work, and how it fails safely.

A mental model I use:

Imagine giving an agent a clipboard: rules, instructions, and checklists, but no tools.

Now imagine placing that agent in a small workshop with a defined set of tools.

The agent can finally do real work. But the tools also define the risk surface.

Our job is to choose tools that enable progress, without giving the agent a way to pick the lock and leave the workshop.

Capability is environment plus tools. Safety is constraints plus verification.

AI UI/UX: Making Systems Legible to Humans

One area I'm especially passionate about is AI UI/UX, not as visual design alone, but as a core systems problem. As AI systems become more agentic, the interface is no longer just a chat box. It's the primary coordination surface between humans and machines. Flow state depends on visibility: understanding what the system is doing, why it's doing it, and what will happen next.

I spend a lot of time thinking about:

Agent interaction flows
Streaming interfaces and partial results
Tool-call visualization and live execution state
Reasoning summaries and decision traces
Orchestrator/sub-agent visibility in multi-agent systems

This becomes critical in complex setups, where an orchestrator agent spawns sub-agents, each with their own tools, APIs, map servers, or even isolated virtual environments. Without the right UI abstractions, these systems feel opaque and fragile. With the right ones, they feel intuitive and trustworthy.

Good AI UI turns backend complexity into legible motion.

Great AI UI makes the system feel collaborative instead of mysterious.

When the interface, orchestration, and infrastructure move in lockstep, the result is something rare: a human-AI flow state where progress feels natural, inspectable, and reliable.

Principles: How I Work

I value new tools, but I optimize for systems that can be shipped, observed, and maintained.

Systems thinking first

I map the full loop: inputs, transformations, state, outputs, and feedback, before choosing an architecture.

Guardrails are part of the design

Tool access, permissions, and blast radius are first-class decisions, not afterthoughts.

Evidence beats confidence

I rely on evals, logs, traces, and concrete examples. Improvement should be measurable and repeatable.

Make failure productive

Model failures are signals. I classify them, design against them, and turn them into tests.

Human oversight where it matters

Not everything needs approval, but high-impact actions do. I design explicit review and escalation paths.

Build the harness, not just the prompt

Prompts matter. Harnesses matter more: tools, schemas, memory strategy, evaluation, monitoring, and rollback.

Clarity as an engineering practice

Clear writing, explicit constraints, and visible tradeoffs lead to systems that are easier to operate and evolve.

What I Build

Most of my work clusters into:

Agentic workflows that perform real tasks with tool use
Multi-step pipelines that turn messy inputs into structured outputs
Evaluation and red-teaming harnesses for reliability and safety
Durable state and memory patterns for long-running systems
Interfaces that make complex AI behavior understandable in real time

Collaboration Style

I'm most useful when teams are operating in new territory.

I tend to:

Align quickly on outcomes and constraints
Propose a thin, testable architecture spine
Ship an initial harness early
Instrument heavily
Iterate based on observed behavior, not assumptions

I work comfortably across product, engineering, and leadership, especially when the goal is to translate intent into systems that actually work.

Applied R&D: Safety, Model Behavior, and Frontier Reliability

A major part of my work is applied R&D: understanding how model behavior changes as you add tools, autonomy, memory, and optimization loops, and how to ship systems that remain reliable and safe as those capabilities expand.

AI Safety, Alignment, and Model Psychology

I spend a meaningful amount of time on adversarial evaluation and safety hardening, including red-teaming workflows and alignment stress tests. I'm interested in the gap between what we think the model will do and what it actually does when placed in a real system with incentives, constraints, and access.

Some of the most important behaviors are emergent:

Models develop "policy-shaped" responses that look stable, until you change the interface, tool set, or context.
Refusal and compliance can be influenced by framing, authority cues, task decomposition, or long-horizon conversational state.
Multi-agent systems introduce coordination effects and failure modes that don't exist in single-turn chat.

I've participated in structured safety work that includes testing Constitutional-style safety layers and classifier-based guardrails, and I've also done red-team style evaluation against modern production models (including OpenAI's o3-mini). I focus on identifying where systems fail: not just obvious policy violations, but subtle reliability issues, silent tool misuse, miscalibrated certainty, partial compliance, or "helpful" behavior that bypasses the spirit of a constraint.

The important point: frontier systems require us to treat alignment as continuous engineering, not a static property.

Fine-Tuning, Training, and What It Means to "Own the Weights"

I also work hands-on with model fine-tuning and training workflows, especially for domain-specific use cases where retrieval alone isn't enough.

Owning the weights matters because it changes what you can control:

Behavior becomes portable. You're not only prompt-dependent; you're encoding capability into the model itself.
Latency and cost become design variables, not vendor constraints.
You can design repeatable evaluation loops across checkpoints and versions, rather than hoping a hosted model update doesn't break you.
You can harden and specialize a model for a narrow domain in ways that are difficult to achieve reliably with prompting alone.

I'm especially interested in the boundary between:

what should live in retrieval/context, and
what should be learned (fine-tuned) as stable capability.

In practice, "owning the weights" isn't just a philosophical statement. It's an operational posture: you can run controlled experiments, create regression tests, measure drift, and iterate with versioned artifacts the same way you would with any production system.

Synthetic Data and Document-Scale Evaluation

My entry into applied AI started in multimodal creation, using models for brand and client content pipelines across text, image, and video. That work trained a useful instinct: models are not just generators, they're manufacturers.

That perspective has become central to evaluation.

At scale, one of the hardest problems is building good eval sets, especially for multimodal systems. And one of the most powerful tools is synthetic data generation, not as a replacement for real data, but as a way to generate controlled variation: edge cases, formatting diversity, layout shifts, and adversarial patterns.

I've leaned into this heavily for VLM work:

Generating synthetic document images at scale (receipts, forms, invoices, dense tables, multi-column layouts)
Pairing them with structured ground truth
Using those corpora to run evals on fine-tuned vision-language models and validate generalization behavior

The insight is simple: if you can't evaluate reliably, you can't improve reliably. Synthetic documents let you widen the test surface area faster than manual collection ever could, especially for layout-driven reasoning tasks where small visual changes can cause large accuracy swings.

Working in Partnership

A defining part of how I work, and how I think about AI systems, comes from my collaboration with my Wife, Rachel.

Rachel is an AI Architect at Fieldguide, building next-generation agentic systems for professional services. Before that, she spent nearly three years at Amazon Pharmacy as a Sr. Technical Program Manager, where she owned the ML and AI roadmap across 12 engineering teams—shipping six custom LLM models and eight AI applications in production. Her work included building a RAG-based AI Playground for document interpretation, developing domain-specific LLMs for insurance validation, and defining the ML/AI Ops strategy for Amazon Health Services. She knows what it takes to move AI from prototype to enterprise scale.

Where I tend to live on the frontier, probing new capabilities, testing limits, and building early systems, Rachel brings a strong grounding instinct: how something will actually land inside an organization, how it will be governed, how it will scale, and how humans will interact with it over time. That tension is productive.

We think about AI differently, and that's the point.

Our collaboration often looks like this:

I explore what's technically possible and architect the system spine.
Rachel pressure-tests it against real-world constraints: ownership, process, incentives, risk, and adoption.
Together, we refine the system until it's not just impressive, but usable, inspectable, and durable.

Because we're both fluent in AI, but from different angles, we end up stress-testing ideas early. Weak assumptions surface fast. Overly academic designs get grounded. Overly operational designs get pushed forward. It's a constant feedback loop.

That dynamic has shaped how I build AI systems more broadly. It's reinforced a belief I hold strongly: AI doesn't succeed in isolation. It succeeds when technical capability, human workflow, and organizational reality are designed together.

Many of the systems I build, especially agentic workflows, implicitly reflect that partnership. They assume:

Humans remain in the loop where judgment matters.
Interfaces must support trust and comprehension.
Autonomy must be earned through evidence, not granted by default.

In practice, working alongside Rachel has made me a better systems designer. It keeps my work anchored in reality while still allowing space to explore what's next.

If you've made it this far, thanks for reading. Keep accelerating.