X-ARC

Reliability is built around the model, not chosen with it.

The wall

An AI agent that performs well in a demonstration will often behave differently once it is deployed and left to run on its own. The model has not become worse; the conditions have. The work is now unattended, repeated, and chained across many steps, and this is the gap between a capable model and a reliable system, the gap where most deployments fail.

Capability and reliability are not the same axis. Capability is how good the model is at a task in isolation, whereas reliability is whether it produces the same correct result on the two-hundredth unattended run as on the first. In our own production agents, the failures that have cost us were rarely a matter of the model being insufficiently capable; they were instances of the same capable model behaving one way on one run and differently on the next, with nothing changed between them.

That gap has a name we have used for a while: the reliability wall. A capable model is unreliable by default, and the wall is not crossed by selecting a better one. It is crossed by building reliability into the layer the model itself cannot see.

Determinism is not a setting

The control most people reach for first is temperature, on the assumption that setting it to zero makes the model deterministic and returns the same output for a given prompt. That assumption now fails on two counts. On the reasoning models that most agents run on, the control is largely gone, since temperature is fixed, ignored, or, in the case of Gemini 3, explicitly discouraged. And on the models that still accept it, setting it to zero does not make the output reproducible.

The popular explanation, that floating-point arithmetic is non-associative and parallel GPU reductions reorder operations, describes the substrate rather than the cause: re-running the same matrix multiply on the same hardware yields bitwise-identical results, so the non-associativity alone changes nothing. The cause that actually bites is batch non-invariance. Each request is computed inside whatever batch happens to be on the server at that instant, and the order in which reductions are performed inside normalization, matmul, and attention shifts with the size of that batch, by an amount that is small but sufficient to change a token. Server load, which you neither control nor observe, changes your output.

In mixture-of-experts models there is a second path to the same place. Experts have a fixed capacity per batch, so the other requests batched alongside yours can change which expert handles your tokens, and therefore what the model returns.

None of this is beyond repair. Batch-invariant kernels now exist, and with them enabled the output becomes bitwise-reproducible, at a measurable cost in throughput. Almost no one enables them, and no hosted frontier API offers reproducibility by default; a seed parameter is best-effort, and determinism is explicitly not guaranteed. In practice, therefore, on every API an agent is actually built on, the model underneath will not reliably repeat itself.

If reliability has to come from somewhere, and the component underneath will not repeat itself, it cannot come from the model. It has to be built around it.

Why a small wobble becomes a coin flip

A single output that varies is tolerable. An agent, however, is not a single output but a chain of them, each conditioned on the one before it.

The familiar version of the problem is arithmetic: a step that succeeds 95% of the time, repeated ten times in sequence, succeeds roughly 59% of the time, because 0.95 to the tenth power is about 0.59. That figure is an idealization rather than a measurement, since it assumes the steps are independent, and they are not: errors correlate through shared failure modes, models frequently recover mid-task, and failed steps can be retried.

The more accurate account is also the harder one to design around. Models have become markedly more capable at long tasks, and the length of work a model can complete continues to climb, but capability and reliability do not follow the same curve. A model that can complete an hour of work at even odds is not a model that can be left to run unattended for an hour, because reliability at a given horizon lags capability at that horizon, and reliability is what is actually shipped.

The consequence is the same in either framing: the longer the autonomous chain, the wider the gap between what the model can do and what it does dependably. The response is to shorten the horizon, or to build the chain so that a single bad step cannot carry through to the end.

What we changed

What follows is four moves, all of them running in our stack today, and not one of them is a better prompt.

Narrow the chain. Scoping a mission so that its horizon is short and its dependent steps few leaves less chain in which error can compound, which makes this the oldest discipline we have and still the highest-leverage one.

Make the loop a state machine, not a vibe. Our per-mission loop runs verification as a phase the agent cannot skip, with separate stages for research, planning, writing the acceptance criteria, doing the work, and validating against those criteria, and with the criteria written before any work lands, so that the agent does not get to grade its run against its own definition of done.

Verify in a fresh context. A model judging its own output carries the bias that produced it, and when the worker and the judge share a failure mode, self-review adds confidence without adding information, so the judge is a separate pass with a clean context that sees the input, the output, and the criteria, and nothing else. We hold our own systems to the same rule, in that our memory layer does not trust itself but re-checks what it believes against the live source before updating, and defers to a human when the agreement is weak rather than rewriting silently.

Make failure recoverable, not catastrophic. Determinism cannot be guaranteed, but a bad run can be made survivable. Our memory heartbeat is idempotent at the entry, so that when the API went unreachable for roughly seven hours it failed eight cycles in succession, lost no state, and returned to current on the first cycle after the connection was restored, each failed tick simply re-attempting its work on the next one. Bounded per-task cost ceilings ensure that a runaway loop is capped rather than left to run up an open-ended bill.

The line

The common response is to wait for the next model to make the agent reliable, but a better model raises the ceiling without removing the variance. On every hosted API an agent runs on, the model underneath remains non-deterministic in practice, down to which batch a given request landed in.

The reliability wall is not a model you are waiting on. It is a layer you have not built yet.

Contact

If something on this page is relevant to work you are running, write to us. The form is on the landing page. We come back within two working days.

Book a discovery call