Closing the verification gap with LLM judges and a pre-test phase.
The verification gap
The first thing an instance learns to skip is the test pass. Time pressure, plausible-looking output, an obvious next task: the test gets deferred until it is the thing standing between shipped and actually shipped. We made the test step the part of the loop the instance cannot negotiate around.
Two arrangements
Two things had to land at once. The instance has to write the tests before it writes the work, not after. And an independent LLM has to judge the work against those tests in a clean context.
The first arrangement lives inside the runtime. RPTIV (the per-mission loop in CCX) runs Research, Plan, Test, Implement, Validate as five separate phases owned by specialised sub-agents. The Test phase writes the acceptance criteria before any code lands. Skipping it is not a path through the loop.
The second arrangement is a separate LLM call with no shared context. The judge sees the input, the output, and the criteria. Nothing else. The instance that produced the work does not get to judge its own work. This is the rule the lab now uses for every quality gate: input plus output, fresh context, definition of good baked in.
A hardening cycle
A client deployment entered a hardening cycle with a 34% hallucination rate on its main task. The work that brought it down was not new training. It was not better prompts. It was the loop above, applied to every change for nine days.
XML state-machine prompts replaced ad-hoc system prompts (PR #193). The surface the instance could disagree with shrank. Fat Tool architecture (PR #205) moved validations into the tool layer that used to live in the instance's head. A wave of LLM-judge gates (#226, #229, #236) routed every output through a clean-context check before the instance could commit.
By the end of the nine-day window the hallucination rate was at ~10%. The work that landed during that window is now the template for every quality cycle the lab runs.
The non-choice
Earlier in the cycle, two approaches were on the table. Deterministic-rules QA. Self-eval via a double-API loop. We picked neither. The final answer was an LLM judge with fresh context and access to the original sources. Deterministic rules do not catch the things the model is wrong about. Self-eval shares the bias that produced the output. A fresh judge does both.
End-to-end
The hardest test class to maintain is end-to-end. The cost of writing E2E tests by hand has always been higher than the cost of an outage. The lab now generates E2E tests with a Playwright plus Gemini 3.1 Pro pass: the instance observes the user flow, drafts the assertions, runs them, and curates the failures. The QA owner on the client team was promoted off the back of the win this pattern produced.
The pattern is the same as the inner loop. Let the model write the test. Let a fresh model judge the test. Ship only when both have signed off.
What does not change
None of this removes the human. The operator owns the agreed target. The runtime is what makes sure the loop runs against that target every time, not against the instance's own definition of good. When the operator changes the target, the Test phase regenerates the criteria. The instance does not get to keep yesterday's success as today's bar.
Contact
If something on this page is relevant to work you are running, write to us. The form is on the landing page. We come back within two working days.
Book a discovery call →