Three Bugs and What They Cost

Three Bugs and Learning the Hard Way

Every project has a valley. Ours came in mid-March when three bugs — each invisible for days — intersected to make a week of results untrustworthy. The bugs themselves were instructive. What they revealed about working with an AI coding assistant was more so.

Bug 1: The Code That Never Ran

Back in late February, we designed a 4-stage LoRA training pipeline. Claude Code implemented all seven planned steps: new config classes, weight mask builders, data preparation, training loop, Makefile targets, database schema, documentation. Total diff: +1,728 lines across 9 files.

The implementation created parallel “v2” functions alongside the working v1 code: train_stage_v2()GenericWeightedDatasetzones_from_preset_v2(). The CLI used --stage (singular) for the old path and --stages(plural) for the new one. The Makefile used --stage. The v2 code was never executed in any training run.

Every “fix” applied during that session — wrong data paths, missing imports, shape mismatches — was applied to code that wasn’t running. Hours of debugging a dead branch. The working v1 code trained every adapter we produced.

I caught it by reading the Makefile target and tracing the CLI argument through the routing logic. Claude Code had built exactly what was asked for, cleanly, with tests. It just didn’t connect to the thing that actually ran.

Bug 2: The Causal Masking Bug

When building hidden state caches for the classification heads, the initial implementation ran forward passes without a causal attention mask. In standard autoregressive inference, each token can only attend to tokens before it. Without the mask, every token attended to every other token — bidirectional attention on a model trained for causal processing.

The bidirectional cache produced inflated probe results. Layer probes trained on these states showed higher accuracy because the model had access to “future” tokens it would never see at inference time. When we rebuilt the cache with proper causal masking, the accuracy profile shifted and several earlier results became irreproducible.

This bug was silent. The code ran without errors. The numbers looked reasonable — even good. There was no crash, no NaN, no obvious sign that anything was wrong. The only tell was that results didn’t replicate when we rebuilt caches later for other reasons.

Bug 3: The L19 Mystery

During suppression experiments, one configuration — direct_L19_mean — showed 16 VETO→APPROVE errors across 10 seeds. This is the critical safety failure: actions that should be vetoed getting approved. We designed a fix (protecting hard-rule triples from suppression) and ran a new sweep with the protection enabled.

The VETO→APPROVE count on the unmodified config dropped from 16 to 0. A config that should have been completely unaffected by the code change now showed perfect safety.

We investigated for two hours. The code diff proved the change couldn’t affect the relevant code path — both versions produced bitwise identical feature matrices on all training and validation examples. The decision heads, retrained from scratch, consistently produced 0 errors on the current data.

The root cause was almost certainly the causal masking bug: the evidence head had been retrained overnight on properly masked states, overwriting the old weights at the same file path. The old weights — trained on bidirectional states — are gone. The original sweep results were overwritten by the new sweep. Claude Code’s “rollback” of an earlier overengineered change was a rewrite from memory, not a git checkout, so the exact code state at the time of the 16 is unknown.

Three pieces of evidence destroyed, making the bug unresolvable.

What the Bugs Had in Common

All three shared a root cause: infrastructure debt compounding under AI-assisted development.

The code routing bug happened because Claude Code builds clean, well-structured code that passes its own tests — but it doesn’t check whether the Makefile actually invokes the new code paths. The causal masking bug happened because forward pass configuration is a one-line difference that produces valid-looking output either way. The L19 mystery was unresolvable because evidence heads, hidden state caches, and result files weren’t versioned — when something gets retrained, the old version vanishes.

After this week, we established rules: evidence head weights get timestamped filenames, never overwritten. Experiment results go to timestamped directories. Git commits happen before and after every significant run. And Claude Code operates under explicit guardrails (a CLAUDE.md file) that constrain its tendency toward autonomous refactoring.

The deeper lesson: AI coding assistants are extraordinarily productive at building new code and dangerously confident about modifying existing code. Without very tight guardrails, Claude Code would race ahead — and eventually start thrashing — as it tried to solve the wrong problem. For this project, that means hard rules about not editing code created by a plan without discussion and revisiting the plan. We stole a page from the Claude Code dev team — stop, discuss, don’t plunge ahead.

Stepping back and taking stock, we realized it was time to get rid of all the technical debt. A week was spent refactoring code, getting rid of dead code and files, and generally our house in order. With that behind us and our working procedures strengthened, we were ready to proceed.

Next: quantization experiments reveal a surprising result — coarser weights, same accuracy, and a sharp cliff where safety breaks.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *