{"id":114,"date":"2026-04-04T00:28:43","date_gmt":"2026-04-04T00:28:43","guid":{"rendered":"https:\/\/c-by-b.ai\/blog\/?p=114"},"modified":"2026-04-04T00:31:28","modified_gmt":"2026-04-04T00:31:28","slug":"three-bugs-and-what-they-cost","status":"publish","type":"post","link":"https:\/\/c-by-b.ai\/blog\/three-bugs-and-what-they-cost\/","title":{"rendered":"Three Bugs and What They Cost"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">Three Bugs and Learning the Hard Way<\/h1>\n\n\n\n<p>Every project has a valley. Ours came in mid-March when three bugs \u2014 each invisible for days \u2014 intersected to make a week of results untrustworthy. The bugs themselves were instructive. What they revealed about working with an AI coding assistant was more so.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Bug 1: The Code That Never Ran<\/h2>\n\n\n\n<p>Back in late February, we designed a 4-stage LoRA training pipeline. Claude Code implemented all seven planned steps: new config classes, weight mask builders, data preparation, training loop, Makefile targets, database schema, documentation. Total diff: +1,728 lines across 9 files.<\/p>\n\n\n\n<p>The implementation created parallel &#8220;v2&#8221; functions alongside the working v1 code:&nbsp;<code>train_stage_v2()<\/code>,&nbsp;<code>GenericWeightedDataset<\/code>,&nbsp;<code>zones_from_preset_v2()<\/code>. The CLI used&nbsp;<code>--stage<\/code>&nbsp;(singular) for the old path and&nbsp;<code>--stages<\/code>(plural) for the new one. The Makefile used&nbsp;<code>--stage<\/code>. The v2 code was never executed in any training run.<\/p>\n\n\n\n<p>Every &#8220;fix&#8221; applied during that session \u2014 wrong data paths, missing imports, shape mismatches \u2014 was applied to code that wasn&#8217;t running. Hours of debugging a dead branch. The working v1 code trained every adapter we produced.<\/p>\n\n\n\n<p>I caught it by reading the Makefile target and tracing the CLI argument through the routing logic. Claude Code had built exactly what was asked for, cleanly, with tests. It just didn&#8217;t connect to the thing that actually ran.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Bug 2: The Causal Masking Bug<\/h2>\n\n\n\n<p>When building hidden state caches for the classification heads, the initial implementation ran forward passes without a causal attention mask. In standard autoregressive inference, each token can only attend to tokens before it. Without the mask, every token attended to every other token \u2014 bidirectional attention on a model trained for causal processing.<\/p>\n\n\n\n<p>The bidirectional cache produced inflated probe results. Layer probes trained on these states showed higher accuracy because the model had access to &#8220;future&#8221; tokens it would never see at inference time. When we rebuilt the cache with proper causal masking, the accuracy profile shifted and several earlier results became irreproducible.<\/p>\n\n\n\n<p>This bug was silent. The code ran without errors. The numbers looked reasonable \u2014 even good. There was no crash, no NaN, no obvious sign that anything was wrong. The only tell was that results didn&#8217;t replicate when we rebuilt caches later for other reasons.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Bug 3: The L19 Mystery<\/h2>\n\n\n\n<p>During suppression experiments, one configuration \u2014&nbsp;<code>direct_L19_mean<\/code>&nbsp;\u2014 showed 16 VETO\u2192APPROVE errors across 10 seeds. This is the critical safety failure: actions that should be vetoed getting approved. We designed a fix (protecting hard-rule triples from suppression) and ran a new sweep with the protection enabled.<\/p>\n\n\n\n<p>The VETO\u2192APPROVE count on the&nbsp;<em>unmodified<\/em>&nbsp;config dropped from 16 to 0. A config that should have been completely unaffected by the code change now showed perfect safety.<\/p>\n\n\n\n<p>We investigated for two hours. The code diff proved the change couldn&#8217;t affect the relevant code path \u2014 both versions produced bitwise identical feature matrices on all training and validation examples. The decision heads, retrained from scratch, consistently produced 0 errors on the current data.<\/p>\n\n\n\n<p>The root cause was almost certainly the causal masking bug: the evidence head had been retrained overnight on properly masked states, overwriting the old weights at the same file path. The old weights \u2014 trained on bidirectional states \u2014 are gone. The original sweep results were overwritten by the new sweep. Claude Code&#8217;s &#8220;rollback&#8221; of an earlier overengineered change was a rewrite from memory, not a&nbsp;<code>git checkout<\/code>, so the exact code state at the time of the 16 is unknown.<\/p>\n\n\n\n<p>Three pieces of evidence destroyed, making the bug unresolvable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What the Bugs Had in Common<\/h2>\n\n\n\n<p>All three shared a root cause:&nbsp;<strong>infrastructure debt compounding under AI-assisted development.<\/strong><\/p>\n\n\n\n<p>The code routing bug happened because Claude Code builds clean, well-structured code that passes its own tests \u2014 but it doesn&#8217;t check whether the Makefile actually invokes the new code paths. The causal masking bug happened because forward pass configuration is a one-line difference that produces valid-looking output either way. The L19 mystery was unresolvable because evidence heads, hidden state caches, and result files weren&#8217;t versioned \u2014 when something gets retrained, the old version vanishes.<\/p>\n\n\n\n<p>After this week, we established rules: evidence head weights get timestamped filenames, never overwritten. Experiment results go to timestamped directories. Git commits happen before and after every significant run. And Claude Code operates under explicit guardrails (a CLAUDE.md file) that constrain its tendency toward autonomous refactoring.<\/p>\n\n\n\n<p>The deeper lesson: AI coding assistants are extraordinarily productive at building\u00a0<em>new<\/em>\u00a0code and dangerously confident about modifying\u00a0<em>existing<\/em>\u00a0code. Without very tight guardrails, Claude Code would race ahead &#8212; and eventually start thrashing &#8212; as it tried to solve the wrong problem. For this project, that means hard rules about not editing code created by a plan without discussion and revisiting the plan. We stole a page from the Claude Code dev team &#8212; stop, discuss, don&#8217;t plunge ahead. <\/p>\n\n\n\n<p>Stepping back and taking stock, we realized it was time to get rid of all the technical debt. A week was spent refactoring code, getting rid of dead code and files, and generally our house in order.  With that behind us and our working procedures strengthened, we were ready to proceed.<\/p>\n\n\n\n<p><em>Next: quantization experiments reveal a surprising result \u2014 coarser weights, same accuracy, and a sharp cliff where safety breaks.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Three Bugs and Learning the Hard Way Every project has a valley. Ours came in mid-March when three bugs \u2014 each invisible for days \u2014 intersected to make a week of results untrustworthy. The bugs themselves were instructive. What they revealed about working with an AI coding assistant was more so. Bug 1: The Code [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-114","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/114","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/comments?post=114"}],"version-history":[{"count":2,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/114\/revisions"}],"predecessor-version":[{"id":117,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/114\/revisions\/117"}],"wp:attachment":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/media?parent=114"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/categories?post=114"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/tags?post=114"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}