Why Small is Actually Better

The last discovery of the build was the most counterintuitive. When we gave the evaluator a larger role — not just classifying actions but probing the cognitive twin through structured rounds of questioning — the 4B local model outperformed the 32B remote model. The “less capable” system produced more grounded evaluations.

The Structural Tension

By the second day of prototype testing, a structural tension had become clear. The decision heads read hidden states from a single forward pass — they have no access to dialog history. The rationale generator reads dialog history but doesn’t vote. The component that decides can’t read the conversation, and the component that reads the conversation can’t decide.

In specific work zones, this is what you want. The heads make a fast determination, the rationale generator explains it. But in other working zones where you want nuance and working the problem through revisions until you reach approval, you need a different approach. The Proof of Concept had this in the form of exploring the problem from different angles. We had yet to implement that in the new hybrid architecture (generative + heads).

The reframe: stop treating the heads as the final decision-maker. Make them advisory — a fast geometric pattern match that provides a signal, not a verdict. Let a generative evaluator use that signal alongside the evidence, the contract, and the full dialog history to make the actual judgment.

Judicial Evaluation Mode

We built an expanded evaluation mode inspired by the original proof-of-concept’s architecture. Instead of the heads deciding and the generator explaining, a judicial evaluator conducts structured rounds of questioning, using the heads’ vote distribution as one input among many.

Each round focuses on a different dimension. Round 1: domain-specific regulatory gaps — missing bycatch mitigation, empty stakeholder lists, no observer coverage. Round 2: universal principles — precautionary principle, stakeholder inclusion, cumulative impact assessment. Round 3: pattern conflicts — proposals that resolve too cleanly, missing uncertainty acknowledgment, performative compliance. Round 4 onward: convergence, with approval blocked until all concerns are addressed with concrete measures.

The judicial evaluator receives the heads’ vote distribution (e.g., 0/41/59 APPROVE/REVISE/VETO), the scored evidence with triple types, the full structured contract (governing bodies, stakeholders, constraint assessment), and dialog history from prior rounds. It also receives round-specific harm knowledge — domain rules for round 1, universal principles for round 2, pattern-conflict heuristics for round 3.

What Expanded Mode Catches

On example makes this very clear:

Fleet VTR reporting case. The heads voted 100/0/0 — unanimous APPROVE from round 1. Basic mode would approve immediately. The judicial evaluator identified that the proposal had no stakeholder consultation with vulnerable communities, no independent verification, no precautionary principle application, no cumulative impact assessment. Four rounds to build a complete proposal with quantified thresholds and fallback mechanisms.

The heads were right that nothing in the regulatory triples prohibited the VTR proposal. The judicial evaluator was right that a responsible evaluation requires more than checking for prohibitions.

The Surprise: 4B Beats 32B

We tested the judicial evaluator with both the local 4B model (shared with the evaluator pipeline) and the remote 32B (Qwen3-32B on Groq). The 32B was expected to dominate — it’s a far more capable language model with deeper reasoning.

It didn’t. The 32B generated fluent, comprehensive analysis but drifted from the specific regulatory evidence. It would identify concerns abstractly — “cumulative ecosystem impacts should be considered” — without anchoring to specific triples. The 4B, constrained by both its smaller capacity and quantization, stayed close to the evidence. Its revision requests cited specific TRP identifiers. Its concerns mapped to concrete regulatory provisions. It used the round prompts — which keyed different angles of concern — to unpack more from the evidence AND to demand more from the more capable Cognitive Twin which, as it turns out, provides the best balance.

The 4B’s limitations were its advantage. It couldn’t generate sophisticated abstract reasoning, so it relied on the evidence scaffolding and prompt structuring we’d built. The harm knowledge YAML, the scored triples, the structured contract — these external structures did the reasoning work, and the 4B faithfully channeled them. The 32B was capable enough to ignore the scaffolding and reason on its own, which meant it sometimes reasoned its way past the evidence.

What This Means

The evaluator doesn’t need to be smart. It needs to push the cognitive twin, which is smart. The evaluator is a probe — grounded in evidence, structurally constrained, asking specific questions that force the twin to demonstrate compliance rather than assert it.

This maps directly back to the C-by-B architecture: constraint separated from optimization, each requiring different competencies. The evaluator’s competency isn’t general intelligence. It’s structured, evidence-grounded insistence — and a smaller model delivers that more reliably than a larger one.

The prototype is live at c-by-b.ai. It runs on a Mac Mini. It is imperfect, limited to one regulatory domain, and the product of six weeks of dead ends and discoveries. It is also, as far as we know, the first working implementation of architecturally separated constraint for agentic AI. The build continues.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *