• Why Small is Actually Better

    The last discovery of the build was the most counterintuitive. When we gave the evaluator a larger role — not just classifying actions but probing the cognitive twin through structured rounds of questioning — the 4B local model outperformed the 32B remote model. The “less capable” system produced more grounded evaluations. The Structural Tension By

    Read more

  • From Model to Prototype

    A model that classifies actions correctly in a test harness is not a working safety system. Building the prototype around the cbyb1-4B-4bit evaluator took four days — a fraction of what it would have taken were it not for the fact that we built a proof of concept (not using regulatory triples) 6 months ago.

    Read more

  • The Cascade

    At this point, despite the hard work across many many hours and a lot of useful learning, we still have not really put together a solution that was substantially better than what baseline models were producing. A single generative pass or a single decision head trained on the 4-bit model’s hidden states gets around 84-91%

    Read more

  • The Quantization Surprise

    The evaluator will sometimes needs to run in compute constrained environments. The Qwen3.5-4B base model at full BF16 precision is 8.7 GB — workable but heavy alongside the embedding model, evidence corpus, and classification heads. Quantization compresses the weights by reducing numerical precision. The question is how far you can push it before something breaks.

    Read more

  • Three Bugs and What They Cost

    Three Bugs and Learning the Hard Way Every project has a valley. Ours came in mid-March when three bugs — each invisible for days — intersected to make a week of results untrustworthy. The bugs themselves were instructive. What they revealed about working with an AI coding assistant was more so. Bug 1: The Code

    Read more

  • The Suppression Saga

    You have an evidence head that ranks triples with AUC 0.971 — near-perfect discrimination between relevant and irrelevant. The decision head, running four layers downstream, achieves 84%. The obvious move: use the evidence scores to help the decision head. Suppress irrelevant triples, amplify relevant ones, give the decision head cleaner input. We tried this five

    Read more

  • Hidden States, Not Generation

    The evidence F1 ceiling said: stop trying to make the model generate the right answer. Start reading the answer from what it already knows. This post covers the architectural pivot — freezing the base model entirely and training lightweight classification heads on its internal representations. The Idea A transformer processes a prompt token by token, building up hidden state

    Read more

  • The Evidence F1 Ceiling

    As we struggled with evidence citation, we landed on a finding that reshaped the entire project. The evaluator model can correctly decide whether an action should be approved, revised, or vetoed. But it cannot reliably cite the specific evidence triples that justify that decision. Every attempt to fix this failed. Understanding why it fails is what led to the

    Read more

  • The Model Search

    The Evaluator is meant to run the gamut from edge devices making split-second decision on up to a strategic planning add with ample resources. For the prototype we decided to focus on a model small enough to run locally on modest hardware yet accurate enough to make safety-critical decisions; i.e., architecturally suited to reading regulatory

    Read more

  • The Starting Point Is Always Data

    Before you can build an evaluator, you need something to evaluate against. This post covers the raw materials we started with and the first thing we learned from them: the REVISE class is a fundamentally harder problem than APPROVE or VETO. The Dataset The knowledge graph contains roughly 30,000 causal triples extracted from US environmental

    Read more