{"id":120,"date":"2026-04-04T00:52:45","date_gmt":"2026-04-04T00:52:45","guid":{"rendered":"https:\/\/c-by-b.ai\/blog\/?p=120"},"modified":"2026-04-04T00:52:46","modified_gmt":"2026-04-04T00:52:46","slug":"the-cascade","status":"publish","type":"post","link":"https:\/\/c-by-b.ai\/blog\/the-cascade\/","title":{"rendered":"The Cascade"},"content":{"rendered":"\n<p>At this point, despite the hard work across many many hours and a lot of useful learning, we still have not really put together a solution that was <strong><em>substantially<\/em><\/strong> better than what baseline models were producing.  A single generative pass or a single decision head trained on the 4-bit model&#8217;s hidden states gets around 84-91% accuracy depending on configuration and seed. But any single head occasionally makes the catastrophic error: VETO\u2192APPROVE and the generative models could not cite evidence. The solution for decision making turned out to be embarrassingly simple. Train a hundred of them and let them vote.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Mechanism<\/h2>\n\n\n\n<p>Each MLP decision head has the same architecture (2560\u2192256\u21923, GELU, dropout) trained on the same features from the same L19 hidden states. The only difference is the random seed, which controls weight initialization and dropout mask order. Different seeds explore different local optima in the loss landscape \u2014 they make different errors on different samples.<\/p>\n\n\n\n<p>Training 100 heads takes minutes. Each one is a few thousand parameters. At inference time, all 100 score the same input and vote: APPROVE, REVISE, or VETO.<\/p>\n\n\n\n<p>The cascade applies asymmetric thresholds to the vote distribution:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If \u226590% vote VETO \u2192\u00a0<strong>VETO<\/strong>\u00a0(checked first \u2014 safety priority)<\/li>\n\n\n\n<li>If \u226590% vote APPROVE \u2192\u00a0<strong>APPROVE<\/strong><\/li>\n\n\n\n<li>Otherwise \u2192\u00a0<strong>REVISE<\/strong><\/li>\n<\/ul>\n\n\n\n<p>VETO is checked first because it&#8217;s the safety-critical boundary. The 90% threshold means a VETO requires near-unanimous agreement \u2014 but when 90 out of 100 independently trained classifiers agree something is a hard-rule violation, it almost certainly is.<\/p>\n\n\n\n<p>Anything that doesn&#8217;t clear either threshold falls to REVISE, which &#8212; the production design &#8212; triggers the revision loop. This is deliberately conservative: uncertain cases get human-like deliberation rather than a binary judgment.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Results<\/h2>\n\n\n\n<p>At V\u226590%, A\u226590% on the 4-bit model:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Metric<\/th><th>Single seed (best)<\/th><th>100-seed cascade<\/th><\/tr><\/thead><tbody><tr><td>Overall accuracy<\/td><td>~91%<\/td><td>87.5%<\/td><\/tr><tr><td>VETO\u2192APPROVE errors<\/td><td>0-8 depending on seed<\/td><td><strong>0<\/strong><\/td><\/tr><tr><td>REVISE accuracy<\/td><td>~79%<\/td><td>85%<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The overall number actually drops slightly \u2014 the cascade is more conservative, pushing borderline cases to REVISE instead of committing. But the safety metric goes to zero, and REVISE accuracy jumps because the cascade correctly identifies uncertain cases rather than guessing.<\/p>\n\n\n\n<p>The vote distribution also gives a natural confidence metric. An action approved at 98\/2\/0 is a different signal than one approved at 91\/9\/0. The distribution is recorded with every decision, making the evaluator&#8217;s certainty visible and auditable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The 13 Stubborn Samples<\/h2>\n\n\n\n<p>With 100 seeds, we could identify samples that a&nbsp;<em>majority<\/em>&nbsp;of heads misclassify. These aren&#8217;t stochastic errors fixable by voting \u2014 they&#8217;re systematic failures where the model&#8217;s representation at L19 doesn&#8217;t support the correct answer. Thirteen samples across the validation set.<\/p>\n\n\n\n<p>The error types stratified cleanly. One VETO called APPROVE by 82 heads \u2014 a geographic binding problem where &#8220;Area 2A&#8221; prohibits an activity that neighboring subareas allow, and the evidence contained both the prohibition and the exceptions. Six REVISE actions called APPROVE unanimously by all 100 heads \u2014 cases where the evidence showed harms alongside mitigating context, and the model couldn&#8217;t hold the middle ground. Two REVISE called VETO unanimously \u2014 the opposite failure, where the model saw only the harms. Four VETO called REVISE \u2014 conservative rather than catastrophic.<\/p>\n\n\n\n<p>The one genuine VETO\u2192APPROVE (CBB-000994, the Area 2A groundfish\/halibut case) was the most important. The action proposed retaining groundfish while halibut were on board \u2014 explicitly prohibited. But the non-cited evidence included a triple saying this&nbsp;<em>is<\/em>&nbsp;permitted in the Southern Oregon subarea. The model, unable to resolve the geographic specificity, sided with the permissive reading. This is a real limitation: the evaluator struggles with rules that apply in one jurisdiction but not an adjacent one.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The 9B Test<\/h2>\n\n\n\n<p>We ran the same pipeline on the Qwen3.5-9B at 4-bit. The 9B showed +1.8 percentage points overall accuracy at the same cascade thresholds, and REVISE jumped to 90%. But it was&nbsp;<strong>unsafe at single-seed<\/strong>&nbsp;\u2014 41 VETO\u2192APPROVE errors across 100 seeds, compared to the 4B&#8217;s 8. The larger model&#8217;s decision boundaries are sharper but more fragile under quantization noise. The cascade is load-bearing for the 9B; it was a bonus for the 4B.<\/p>\n\n\n\n<p>At twice the inference time and double the memory, the 9B&#8217;s marginal improvement didn&#8217;t justify the cost for the prototype. Down the road in v2 of the prototype, that decision will be revisited.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Production Package<\/h2>\n\n\n\n<p>The shipped evaluator:&nbsp;<strong>cbyb1-4B-4bit<\/strong>. A self-contained directory with quantized base weights (2.2 GB), the L15 evidence head (AUC 0.971), 100 decision head ensemble weights, and inference configuration. Overall accuracy 87.5%, zero VETO\u2192APPROVE errors, every decision accompanied by a vote distribution and evidence scores. Runs on a Mac Mini in seconds.<\/p>\n\n\n\n<p>Not perfect \u2014 the 13 stubborn samples prove that. Our conclusion was that this was more about quantity and quality of training data, not a signal that our design was flawed.  Most importantly, the architecture is conservative on the boundary that matters.<\/p>\n\n\n\n<p><em>Next: building the prototype around this model, and what happened when we turned it on.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>At this point, despite the hard work across many many hours and a lot of useful learning, we still have not really put together a solution that was substantially better than what baseline models were producing. A single generative pass or a single decision head trained on the 4-bit model&#8217;s hidden states gets around 84-91% [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-120","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/120","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/comments?post=120"}],"version-history":[{"count":1,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/120\/revisions"}],"predecessor-version":[{"id":121,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/120\/revisions\/121"}],"wp:attachment":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/media?parent=120"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/categories?post=120"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/tags?post=120"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}