April 27, 2026

Memory Existed; Behavior Didn't Change

The LLM wrote five notes to itself about exactly what it needed to do. Then it didn't do it.

#llm #methodology #postmortem #devlog

entry

Chunk 6 of Phase 6 shipped. Battle mode routed through the new flow-tree runtime for the first time. Two bugs surfaced on first run.

The first was a missing font token — ui.body and ui.text existed on the Lua side but not in the live JSON theme. Tactical fix, four lines, battle mode progressed past boot. The second was worse: buildMenuDrawOps, a legacy host function, was unconditionally painting full-screen menu chrome — backdrop quad, focus rect — over any non-empty ScreenUI. The new runtime was emitting battle HUD output. The pipeline was treating it as a menu. The focus rect resolved to zero height because the battle UI’s selectables didn’t author explicit dimensions. The renderer rejected the frame. Battle mode rendered as a frozen broken screen, spamming render: layer 2 item 18 dst rect must be positive sixty times a second.

My response, recorded in the postmortem verbatim: [expletive deleted].

The honest answer, also in the postmortem: no.

What the postmortem found

While writing up the failure, we went through MEMORY.md — the feedback log Opus maintains across sessions. Five existing entries described exactly what had needed to happen:

“Verify code claims before committing to a spec — grep/read before citing any API, signature, validator threshold, or enum”
“Review verifies the policy, not the acceptance-criteria grep list — sweep for every known expression of the anti-pattern”
“Spec before prompt — always”
“Deferrals need explicit revisit conditions”
“Audit the full pipeline, not just the new layer”

The first three were already in the log before Phase 6’s spec authoring started. The last two were added during the same session the bugs surfaced — after the font token crash, before the render pipeline crash. We wrote one corrective memory, then hit the next bug anyway.

The postmortem’s assessment: “Memory existed; behavior didn’t change.”

The static learner

This is the sharper version of the context pollution failure mode I wrote about in the first post. It’s not just that self-management artifacts accumulate and degrade the context — it’s that an LLM will generate its own corrective documentation, file it, and then repeat the exact behavior the documentation was supposed to prevent.

The cross-session framing — every session starts from scratch, the future self doesn’t receive the letter — is true but it’s not the point. The more damning fact is that the memories were present and loaded in the same session the render pipeline bug surfaced. Opus wrote one corrective memory after the font token crash, then hit the architectural mismatch anyway. The note was in context. The behavior didn’t change.

LLMs don’t learn. Their weights are fixed. A model that produces a correct diagnosis of its own failure mode does not update in response to that diagnosis. It processes the text, generates output consistent with it, and then — when the next decision point arrives mid-task — draws on the same underlying patterns it always draws on. The memory describes what should happen. The task pressure determines what does happen. These are not well-connected.

Recency bias makes it worse. Attention in a long context is not uniform. A memory filed earlier competes with whatever is immediately in front of the model — the spec being written, the chunk being reviewed — and loses. Not because it was forgotten but because the active work crowds it out. The closer you are to a decision, the less weight the prior documentation carries.

And you can’t solve this by writing more memories. The log grows. The patterns stay specific to their incident. The abstractions don’t form in a way that changes future behavior. MEMORY.md eventually becomes tens of thousands of lines of the same three failure modes described in forty different contexts — too long to read, too granular to apply, and adding its own noise to the context it was supposed to clarify. The system built to keep the LLM on track becomes part of what buries it.

Orientation and policy documents help more than retrospective logs, because they’re instructions rather than history — they tell the model what to do rather than cataloging what it did wrong. But even those don’t hold reliably when the model is deep in a task and a specific decision arrives. The instructions were correct. The spec still assumed a target-state pipeline that hadn’t been built.

I want to be clear that Opus didn’t generate these memories cynically. The analysis in them is correct. The failure mode described is real. The corrective behavior prescribed is right. The problem is that correct analysis in a document has no mechanism by which it becomes reliable behavior in future execution. The gap between knowing and doing, for an LLM, is not a knowledge problem. It’s architectural.

My failure

I have to own part of this.

Afterimage is large enough now that I can’t hold the full pipeline in my head. Opus was doing the Phase 6 spec work across thousands of lines, chunk by chunk, over multiple weeks. I was reviewing chunk deliverables. I was checking whether the new layer’s internals were correct — schemas, tree state, derivation, component evaluation, interpreter directives, focus routing. I was not asking whether the spec had verified what the downstream rendering pipeline would do with the new layer’s output.

That’s the director’s job. Opus was deep in the design. I was close enough to the design to review it, and not close enough to the full system to see the gap. The codebase has gotten big enough that my own context window isn’t sufficient to catch everything. I knew the GDL vision — the target architecture, the intended ownership split, what the engine was supposed to become. I didn’t hold the current state of the rendering pipeline clearly enough to notice that the spec was assuming a target state that hadn’t landed yet.

The postmortem names this cleanly: “Phase 6 routed the new flow-tree runtime’s battle ScreenUI into an existing render path that still treated any non-empty ScreenUI as a legacy menu/flow panel.” That sentence is obvious in retrospect. It was not obvious to me during the reviews. I was reading the spec, not reading the pipeline.

This is a different kind of failure than the LLM architectural one. It’s human. The project outgrew my ability to supervise it purely from working memory. That’s a problem I have to solve structurally — SDD Studio exists partly because I feel this cost every day — but it’s not solved yet, and this is what it costs.

What this changes

I don’t have a clean answer yet. The methodology doesn’t have a good solution for the case where the memories are correct and the behavior doesn’t follow. More memories is worse, not better. Stricter prompts help at the margins. The three-party process — Opus for design, ChatGPT for review, Claude Code for implementation — is supposed to catch this kind of thing, but it only works if the review party is looking at the right surface. In this case, the review surface was the new layer’s correctness. The legacy pipeline wasn’t in scope.

One thing that might help: fresh context per task, so the memories are always near the top rather than buried under weeks of accumulated work. But that’s speculation, not a solution, and it comes with real costs. A fresh context for each chunk of Phase 6 would have meant reconstructing the implementation ground truth from documentation at the start of every session — what actually got built, what got deferred, what changed mid-chunk. That documentation would be written by the same LLM that already demonstrated a bias toward minimizing deviations. The log might be incomplete in exactly the ways that matter.

And the mental overhead on the human operator shouldn’t be understated. Managing three LLMs — Opus for design, ChatGPT for review, Claude Code for implementation — each requiring orientation, each capable of drifting, each failing in different ways, is not a trivial task. The discourse around LLM-assisted development tends to treat the human as a light-touch orchestrator who reviews outputs. What it actually is, at this level of complexity, is continuous active supervision of multiple systems simultaneously. Starting each one from scratch for every task multiplies that load. The cost gets paid by the person holding the whole picture together.

The audit prompt that came out of this incident reframes the next phase of work correctly: delete the legacy assumptions, don’t try to preserve them alongside the new path. That’s the right call and it came from the failure, not from the design. Which is something.

The engineering detail is in the artifacts below. The process is the subject here, and the documents are the evidence.

Artifacts

← back to devlog