Field Note · Technical report

Building effective AI agents

AGENT ENGINEERING 2026·06·26

The gap

A language model that performs well in a demonstration will often behave differently once it is deployed and left to run on its own, and the change is not a loss of capability. The conditions changed. A demonstration is one attended run on a short chain of steps, while production is many unattended runs on long chains, and across that shift the property that decides whether the system works is no longer how good the model is in isolation but whether the surrounding system returns the same correct result on the two-hundredth run as on the first. Capability and reliability are different axes, and it is reliability that ships.

Reliability of this kind cannot be obtained from the model, because the model is where the variance originates. A larger model raises the height of the task an agent can attempt and leaves the run-to-run variance roughly where it was, so reliability has to be engineered into the layer the model cannot see, the code and the prompt that surround it. The design question is how that layer should be divided between the two.

The principle

The division we have converged on, across the agents we run in production, expresses as much of the system's behavior as possible in the language the model reads, and reserves code for the small set of outcomes that can tolerate no variance at all. We call the line between the two the prompt–code boundary, and where it falls is a conclusion rather than a preference, following from three measured properties of capable models. General methods that scale with computation outperform handcrafted ones by a large margin, so a prompt is a control surface that improves on its own as the model improves while control flow written in code is handcrafted knowledge the next model outgrows. In a non-deterministic system code is the higher-variance surface, since a small change to control flow can cascade into large and unpredictable changes in behavior, whereas a change in prose is a change to a single inspectable artifact, which inverts the usual intuition that code is the safe, deterministic choice. And attention is a finite budget, so the surface on which behavior is expressed must itself be economical. The full derivation, traced through every subsystem, is set out in the technical report.

Prose is the steering wheel, and code is the chassis.

The boundary

Delegating behavior to the prompt is frequently read as granting the model unbounded discretion, and the code side of the boundary is what forecloses that reading. Reserving code does not mean removing it; it means placing it in exactly one position, around the outcomes that can tolerate no variance. Every decision the system makes is partitioned by a single question, how much variance the outcome can tolerate. The outcomes that can tolerate none, the irreversible, the financial, the cross-tenant, and those that can fabricate trust, are guaranteed in code and fail closed, with the model not trusted to produce them, while everything advisory, interpretive, or recoverable is delegated to the model and steered by prose. The result is bounded autonomy rather than free will, and drawing the boundary explicitly, then defending it, is the whole of the framework.

The model decides

Code guarantees

Which tools to call, and in what order

A scoped identifier is real-shaped before it touches data

How to interpret the data and what to recommend

A citation renders only when it maps to a real retrieved result

What to assume when the user is silent

Identity is bound to the authenticated session, denied on failure

How to recover from an error and how to phrase the answer

Every state-changing action passes a human gate and a schema check

Every subsystem of a production agent is this same boundary drawn in a different place, which is why getting it right is most of the work.

The harness

The most consequential structural decision concerns the loop that drives a turn. A turn is a loop in which the system assembles a prompt, the model responds, the tools it requested are executed, their results are returned to it, and the loop repeats until the model signals that it is done. The reliable arrangement separates loop ownership from decision ownership, so the surrounding system owns the turn, assembling context, dispatching, persisting, and enforcing limits, while the model owns the loop, deciding which tool to call and when to stop. The only loop the application itself runs iterates the model's output events; it iterates what the model produces and does not direct what the model decides.

A second property follows: there is exactly one seam at which the model meets the world. Every tool call routes through a single proxy, and that proxy, rather than checks scattered through the codebase, is where permission, human approval, and response-size limits are enforced, so the gate holds independently of the model's cooperation. A denied call returns a structured error indistinguishable, from the model's vantage, from any other failure, which is the difference between a guarantee and a request.

The harness: a stateless web tier, the runner that owns the turn, the model that owns the loop, and a single tool proxy that is the one place the model touches the world. — **The harness:** the runner owns the turn, the model owns the loop, and one proxy is the only seam to the world.

The consequences

The rest of the system follows from the same partition. Tools are contracts whose description and schema are part of the prompt rather than inert plumbing, returning structured errors the model can act on rather than stack traces, and standardized on the Model Context Protocol so that one tool serves every model and the steering travels with the tool to hosts that never read the system prompt. Retrieval and memory stop being prefixes stuffed into every request and become tools the model invokes on demand, so only what is small and always relevant is injected and everything large or situational is fetched, with the citations the model proposes verified in code before they reach the user.

Consequential actions pass a human gate that pauses the tool call rather than the agent, holding the side effect behind a promise that resolves on a human decision and auto-denies on a timeout, so a model that ignores its instructions still cannot execute a gated write. Because a turn can run for minutes, the interface streams the model's reasoning and every tool call as it happens, and the prompt is split into a stable cacheable prefix and a volatile tail so the cache, rather than the budget, does most of the paying. Observability traces every turn while treating the reported cost as an estimate to reconcile against the real bill, since a widely used tool can over-report it several-fold. Evaluation is layered, from deterministic contracts that can be unit-tested, through a model-graded judge that is informational rather than blocking, to live probes against the real upstreams. Unfinished work ships dark behind flags that default to off, and delegation to multiple agents is reserved for breadth-first reading, where the reading fans out in parallel while the writes that cannot tolerate conflict stay on a single thread.

Each of these is the boundary drawn once more, which is why the difficulty of building an agent is concentrated in drawing it well rather than in any single pattern. The full technical report derives the boundary from first principles and traces it through every subsystem, with the diagrams and the production evidence behind each.

Contact

X-Arc is an applied AI research lab. The patterns analyzed here are drawn from the agent systems we run in production, recorded in vendor-neutral form so that they transfer beyond any single stack, and none of the specifics are proprietary. If the work bears on a system you are building, we would be glad to compare notes.

Get in touch →