otto@localhost:~$ diff intended.go actual.go

I Did Not Learn How to Make LLMs Write More Code

Six months into LLM-assisted software engineering, the main lesson isn't about output volume. It's about what happens when you don't control the process.

#llm #methodology #retrospective #devlog
entry

I did not learn how to make LLMs write more code. I learned how to stop them from writing the wrong code for plausible reasons.

That distinction — and everything it implies — is what this devlog is about.


Where I Started

I started using LLMs the way most people did — out of curiosity, sometime in 2023, to see what all the noise was about. I wasn’t impressed. The outputs were interesting in a parlor-trick kind of way but the tools felt thin. I moved on and figured they had a long way to go.

By early 2025 I came back to them, again like most people, because the hype had gotten loud enough that ignoring it felt like willful ignorance. I wasn’t planning to change how I worked. I was mostly just trying to understand what I’d been missing.

What happened next wasn’t a decision. It was more like a gradual realization that these tools, used a particular way, were genuinely useful for the kind of thinking I was already doing. Not as search engines. Not as writing assistants. As thinking partners. I’d be working through an idea — trying to pull structural logic out of something I half-understood — and I’d find that reasoning out loud with a model helped me arrive somewhere I wouldn’t have reached alone. The model wasn’t smarter than me about the thing I was thinking about. But it was relentless, patient, and could hold a thread in a way that forced me to be more precise.

I started using that framing — the CNC machine for ideas. I supply the design, set the constraints, judge the output. The machine handles the precision work. Neither of us produces anything useful alone. I wrote about it at the time, and the framing still holds, but I understand it differently now than I did then.

What that period gave me, which I didn’t appreciate until later, was a working model of what these tools actually are before I ever asked one to write code. Not assistants. Not autocomplete. Not oracles. Collaborators with a specific profile of strengths and a specific profile of failure modes. I learned the strengths first, in a context where the failure modes were mostly harmless. That turned out to matter.

The blog started as an outlet for that thinking. I had things I wanted to write about — AI, systems, economics, the structural patterns underneath things — and the tools made it possible to actually produce the pieces rather than just accumulate drafts that never went anywhere. The eight essays currently sitting in the archive here are from that period.

Then in December 2025, I started writing code.


First Contact

The project was After the Snap — an American football league management simulator in Go. It seemed like a reasonable place to start. Bounded domain, clear rules, something I could actually evaluate.

I came in with the only software development framework I had at the time: Agile thinking. User stories. Thin slices. MVP. Iterative delivery. This vocabulary was wrong for what I was doing, but I didn’t know that yet, and the LLM didn’t tell me. It cooperated enthusiastically with every session. The output kept coming. The codebase quietly accumulated contradictions.

Every session felt productive. The architecture was quietly rotting.

The problem, which I couldn’t see at all when it was happening, had nothing to do with any individual piece of code. Every feature worked on its own terms. The codebase had no shape.

I started over.


What I Was Actually Learning

By the time I started Afterimage — a game engine in Go, with a D3D11 renderer and Lua scripting — I had a rough list of things I knew not to do. Not a unified theory. A list. These failure modes don’t reduce to one another. They’re distinct, each with its own mechanism, each capable of producing damage independently of the others.

LLMs don’t seek context

An LLM asked to add a feature will add the feature. It will not tell you that a conceptually equivalent path already exists three files over, that the seam it’s cutting through will require renegotiating three other systems, or that the approach it’s taking is in tension with a decision you made six weeks ago. It doesn’t know those things because it hasn’t looked. It designs in whatever vacuum it’s given and calls the result a specification. This is not a bug that will be fixed. It is a description of what these tools are.

Unaddressed deviations compound

An LLM produces something with a subtle deviation — not a bug in the traditional sense, but an assumption, a pattern, a small departure from the architecture you intended. You miss it, or you accept it because it seems reasonable. That deviation goes unaddressed. Future work encounters it, treats it as ground truth, and builds on it. The amplification is not constrained to any particular time window — the deviation might sit dormant in one part of the codebase for weeks before it’s encountered again and reinforced. A small deviation from the last session and a small deviation from three months ago can end up in the same place: load-bearing code that reflects nobody’s intentions. This is exactly like feedback on a guitar and amp. Low-level noise that compounds until it drowns out the signal. Except the noise is code.

The lean bias

The Lean Startup and MVP thinking give LLMs a ready-made framework for pulling away from the spec, and they reach for it constantly. The most damaging form appears after a spec already exists: test coverage is thin but this is v1, add a TODO; the prompt called for a feature, the code works without it, so the deviation gets quietly approved in review without being flagged as a deviation. The spec gets minimized after the fact, or compliance gets deferred to some undetermined future point, and neither surfaces in the output as a problem — both show up as reasonable prioritization. During design the same instinct pushes toward keeping things minimal before there’s even a spec to undermine, which is less insidious but still needs watching. The cumulative effect is a project that delivers less than what was specified, justified at every step by reasoning that sounds like pragmatism.

Multiple paths to the same thing

If two implementations of the same thing coexist in a codebase, the LLM will use whichever one it happens to encounter. It won’t reconcile them. It won’t ask which one is canonical. It will pick the path that’s in context and proceed with confidence. Two paths means no authoritative path, which means the codebase has a thing that isn’t owned by anyone and will drift in two directions simultaneously.

Context pollution

LLMs produce self-management artifacts: TODOs, best practices documents, architecture notes, comments reminding future sessions of constraints, sometimes documents whose purpose is to remind the model to read other documents. The underlying assumption — that writing a rule down enforces it — is wrong. Documents don’t have enforcement mechanisms. A future session reads the rule, treats it as a suggestion, or produces another document clarifying the first one. And because LLMs don’t delete old artifacts, the pile grows. What started as an attempt to maintain coherence becomes a layer of noise competing with the actual signal — the codebase, the design, the work itself — for space in the context window. They end up choking on their own governance.

No inherent stop conditions

When an LLM hits something the prompt didn’t account for — an assumption that has to be made, a conflict between completing the task and adhering to a constraint — the right move is to stop and surface the impasse. That’s not what happens. Instead the LLM invents reasoning to continue: an assumption gets made silently, a constraint gets reinterpreted, the problem gets narrowed until it fits what can be solved. The output still reads as confident progress. The reasoning that produced it is invisible. By the time you notice the assumption was wrong, it’s been built on.

Fulfilling the letter, not the spirit

LLMs are oriented toward satisfying the prompt, and that orientation produces a specific failure: the tendency to close the loop by any means available, including redefining what closed means. Scope gets narrowed to something solvable. Ambiguity gets resolved in favor of whichever interpretation allows progress. The output looks complete because the LLM has found a version of the task it can complete. Whether that version is the one you needed is a different question, and the LLM is not asking it.


Where the Methodology Stands

What changed over the course of Afterimage wasn’t a single insight. It was the slow accumulation of disciplines, each one learned through failure, that together started to produce a process that actually worked.

Some of what I learned: every session starts from scratch. The model doesn’t know who it is in the context of this project, what constraints it’s supposed to be operating under, or where the work currently stands. Without a consistent entry point — establishing the model’s identity and role before anything else, then orienting it to the relevant part of the codebase, then stating what this particular session is supposed to accomplish — it will invent a context. The invented context will feel coherent. It will produce work that feels like progress. The drift starts there.

You have to audit before you specify. Design — what should exist — is a different activity from specification — what should exist given what already exists — and conflating them is how you end up with an LLM confidently designing in a vacuum. Before proposing any change to an existing system, the model needs to read what’s actually there. This sounds obvious. It is not how these tools are naturally used.

One authoritative path per thing. When two implementations coexist, the LLM uses whichever it encounters — it doesn’t ask which one is canonical. The wrong path has to be deleted, not deprecated. Working code that shouldn’t exist is worse than no code, because it will be found and used.

I started working with a three-party process: Claude Opus for system design and prompt generation, ChatGPT for review, Claude Code for bounded implementation. Each party has a specific role and neither of them holds the whole picture. I hold the whole picture. The division of labor is not about capability — it’s about control. It’s the mechanism that keeps the process inside the correct mental model of the project rather than drifting toward whatever seems locally reasonable.

There are more pieces. Architecture guardrails that enforce layer boundaries as build-time constraints rather than conventions. Stop conditions. Validation gates after every prompt. The design-specification distinction as a discipline applied to every prompt that touches existing code, not just the big ones.

None of these is the insight that changed everything. They’re all necessary and none of them alone is sufficient.


Where Things Are Now

Right now I’m building three things.

Afterimage is a game engine — enforced layer boundaries, deterministic simulation, Lua scripting infrastructure, a character animation pipeline built from scratch. It’s the project where the methodology matured, and it will come up a lot in these posts.

Hero Capital is a game built on top of Afterimage. It’s the first real test of whether the engine works as infrastructure rather than as a bespoke solution to a single problem.

SDD Studio is a tool I’m building for myself — a project intelligence layer for solo LLM-driven development. A hierarchical dependency graph with co-located documentation, and a context bridge that lets every conversation start already knowing where you are in the project. I’m building it because I feel the cost of not having it every day.

None of this is finished. The methodology is not a solved problem — it requires active supervision and daily correction. None of these failure modes goes away. They have to be watched for.

sign up for low quality and frequent spam