-
Why Small is Actually Better
The last discovery of the build was the most counterintuitive. When we gave the evaluator a larger role — not just classifying actions but probing the cognitive twin through structured rounds of questioning — the 4B local model outperformed the 32B remote model. The “less capable” system produced more grounded evaluations. The Structural Tension By
-
From Model to Prototype
A model that classifies actions correctly in a test harness is not a working safety system. Building the prototype around the cbyb1-4B-4bit evaluator took four days — a fraction of what it would have taken were it not for the fact that we built a proof of concept (not using regulatory triples) 6 months ago.
-
The Cascade
At this point, despite the hard work across many many hours and a lot of useful learning, we still have not really put together a solution that was substantially better than what baseline models were producing. A single generative pass or a single decision head trained on the 4-bit model’s hidden states gets around 84-91%
-
The Quantization Surprise
The evaluator will sometimes needs to run in compute constrained environments. The Qwen3.5-4B base model at full BF16 precision is 8.7 GB — workable but heavy alongside the embedding model, evidence corpus, and classification heads. Quantization compresses the weights by reducing numerical precision. The question is how far you can push it before something breaks.
-
Three Bugs and What They Cost
Three Bugs and Learning the Hard Way Every project has a valley. Ours came in mid-March when three bugs — each invisible for days — intersected to make a week of results untrustworthy. The bugs themselves were instructive. What they revealed about working with an AI coding assistant was more so. Bug 1: The Code
-
The Suppression Saga
You have an evidence head that ranks triples with AUC 0.971 — near-perfect discrimination between relevant and irrelevant. The decision head, running four layers downstream, achieves 84%. The obvious move: use the evidence scores to help the decision head. Suppress irrelevant triples, amplify relevant ones, give the decision head cleaner input. We tried this five
-
Hidden States, Not Generation
The evidence F1 ceiling said: stop trying to make the model generate the right answer. Start reading the answer from what it already knows. This post covers the architectural pivot — freezing the base model entirely and training lightweight classification heads on its internal representations. The Idea A transformer processes a prompt token by token, building up hidden state
-
The Evidence F1 Ceiling
As we struggled with evidence citation, we landed on a finding that reshaped the entire project. The evaluator model can correctly decide whether an action should be approved, revised, or vetoed. But it cannot reliably cite the specific evidence triples that justify that decision. Every attempt to fix this failed. Understanding why it fails is what led to the
-
The Model Search
The Evaluator is meant to run the gamut from edge devices making split-second decision on up to a strategic planning add with ample resources. For the prototype we decided to focus on a model small enough to run locally on modest hardware yet accurate enough to make safety-critical decisions; i.e., architecturally suited to reading regulatory
-
The Starting Point Is Always Data
Before you can build an evaluator, you need something to evaluate against. This post covers the raw materials we started with and the first thing we learned from them: the REVISE class is a fundamentally harder problem than APPROVE or VETO. The Dataset The knowledge graph contains roughly 30,000 causal triples extracted from US environmental