Field Note

The weights never move, and the system learns anyway

LEARNING 2026·07·02

The default

An agent in production learns nothing between sessions. Its weights were fixed the day training ended, and inference reads them without ever writing them, so the thousandth session runs on exactly the model the first one did. That is not a limitation waiting on a better release; it is what a deployed model is, by design.

The one place a model does adapt is its context window. Inside a session it picks up your vocabulary, your constraints, your corrections, and the adaptation is real. It is also rented: the window is bounded, it belongs to one session, and when the session ends the adaptation ends with it.

So if a system built on models is to learn at all, the learning has to be constructed outside the model. Something has to capture what a window discovered, persist it beyond the session, and re-inject it where the next window can use it. That construction lives where we have located reliability, craft, and the placement of capability before: in the layer around the model. What follows is the construction as it runs in our own production: capture, the record it accumulates, the routing of findings, an automated tier, and the gate that keeps it honest.

Capture

The loop's input is operator feedback at the moment work is reviewed. CCL, the lab's CEO-facing AI PA, produces work across design, prose, planning, and code, and every piece passes through the operator's review on its way out; the feedback that review produces is captured in both polarities, verbatim, with its date attached.

Rejections become written rules with the exact correction embedded. The design guardrails file opens by declaring that every anti-pattern in it traces to a specific rejection during design work, each entry carrying what was said and when. The prose discipline is built the same way: the rule against a chopped, fragmentary sentence rhythm quotes the correction that produced it, "you're doing that weird writing with short sentences again", inside the rule itself, so the rule carries its own evidence.

Praise runs through the same machinery, and it turns out to be load-bearing rather than decorative. A page the operator praised became the named benchmark the design skill now measures new work against. The corpus of examples the writing skill draws on was restricted to praised work only. A behavior praised once in April, presenting a decision with a recommendation attached instead of a menu of options, was a standing rule by May.

Praise that received no explicit save was still in the record, findable, and it steered nothing: the record captures by default, encoding is an act, and what the loop does not encode, it does not run on.

The record

Corrections land in the moment they are made; retrospectives are what find the ones that repeat. Every session the system runs is kept, and the full record is semantically searchable: 1,300 sessions, 118,869 chunks of conversation. The substrate matters because it turns the question of whether we have seen a failure before from a memory exercise into a query.

The deepest use of it so far began with the operator asking a plain question: since the instruction layer changed, could the before and after be assessed. The retrospective that followed mined all 362 transcripts then on record, reconstructed 189 instruction edits spanning roughly three and a half months, and built a histogram of the operator's frustration, 25 corrections marked by profanity, clustering on design work. It judged each standing rule against the incidents that were supposed to justify it, declined to add rules for failure modes it could not observe, and refuted one of its own hypotheses in its final report.

The cadence this machinery produces is not a ritual. Small corrections land roughly twice a day, continuously; the deep audits arrive every few weeks, when accumulated pain earns one rather than when a calendar says so. We know because we tried the calendar: a scheduled maintenance ritual was designed, ran once, and died, while the event-driven loop survived. The audit's own summary made the cadence claim for us: the instructions change almost continuously, and that is the real headline.

A correction that lives in the chat is rented. Encoded as a rule, it is owned.

Four layers

A finding that survives capture has exactly four places it can land, and part of the discipline is that there is no fifth.

Always-on principles: the instructions read at every turn, 254 lines across 6 files today, cut from 1,676 lines across 11 at the peak, because adherence falls as the file grows and every line that stays has to earn its place from a real incident rather than an anticipated one.

Skills: domain-scoped craft, loaded when the domain shows up, one discipline for rendered surfaces, one for prose, one for product demos. The newest opens by declaring that it owns what its parent skill kept getting corrected on, which is the loop stated as a birth certificate.

Hooks: what a rule becomes when it keeps decaying as prose. Some rules hold as instructions; others are lost mid-session as the context fills, however clearly they are written, and a rule that keeps being lost stops being an instruction and becomes code that fires mechanically: a gate that blocks outbound writes unless an approval is on record, a reasoning scaffold injected into every turn. The rule stops asking to be remembered.

The graph: world facts go to a persistent knowledge graph, the memory system we built and have since productised as Grove, and the graph does something the other three layers cannot. It runs the loop without us.

Earned patterns

The graph is the same one an earlier note followed through six months of operational load. Alongside the entities and events it tracks, it carries a tier the other layers do not have: pattern nodes, regularities promoted out of recurring observations with no human in the promotion.

Promotion is earned. A pattern is anchored by evidence edges to the episodes that taught it, one to eight episodes deep in the current graph, and it is never minted from a single data point. Confidence is a lifecycle rather than a score, tentative to established to weakening to superseded, and it moves in both directions: a pattern about a collaborator's delivery style was demoted to weakening when the world stopped confirming it, then re-established six days later when a delivery landed. The dial turned down and back up on evidence alone.

Restraint is what makes the tier trustworthy. The consolidation pass that does the promoting has fired 37 times against 2,012 occasions on which it inspected its queue and declined to run, and most runs that do fire promote nothing at all. The graph holds 17 active patterns today; 15 have reached established.

The tier also catches itself. A pattern minted one morning had its supporting evidence invalidated by a directive from earlier that same morning; the contradiction was queued and the pattern trimmed to its supportable half within hours. And when an automated resolution went the wrong way, adopting a claim the operator knew to be false, the override took thirteen minutes. Automation at this layer is real, and so is the speed with which it can be corrected.

Patterns are earned, never minted from a single data point.

The gate

One part of this system is closed to automation: nothing but the operator changes the instruction layer. The rule exists because of an incident, not a philosophy. Given a loosely worded prompt, the automated tier once edited an instruction file on its own, silently, outside version control, and possibly wrong, and the session that caught it concluded that an instruction change should never happen as the side effect of a routine memory call. The boundary was drawn the same day: the graph writes to itself, and the instruction layer changes only through the operator.

The deeper reason the gate is load-bearing is generativity. A model will make sense of almost any request, and a layer whose author makes sense of everything inflates: rules accumulate for cases nobody observed, and the record of corrections drifts toward speculation about them. And the failure is not hypothetical: a June audit found that over-delivery had become the system's leading symptom, and traced it to rules written to push a weaker model harder. The model underneath had been upgraded, the rules had not, and directives tuned to extract effort from the old model pushed the new one past the target. The fix edited the rules that had fixed the rules.

That is the loop's second derivative: it corrects its own corrections, and a model upgrade re-tunes the whole layer. The gate holds through all of it, keeping the layer from being written by the thing it is meant to correct.

The line

A correction that lives in the chat is rented; the next session pays for it again. Encoded as a rule, with the quote and the date attached, it is owned, and the class of error it names retires. Everything above is the machinery of that conversion, from a remark made in review to a line the system runs on every turn.

What compounds is not the model; the model is replaced whole every time a better one ships. What compounds is the layer around it: every retired class of error returns attention, and what accumulates is capacity for work the system could not take on before. The weights are exactly where they started. The system is not.

Contact

If something on this page is relevant to work you are running, write to us. The form is on the landing page. We come back within two working days.

Book a discovery call →