The Evidence F1 Ceiling

As we struggled with evidence citation, we landed on a finding that reshaped the entire project. The evaluator model can correctly decide whether an action should be approved, revised, or vetoed. But it cannot reliably cite the specific evidence triples that justify that decision. Every attempt to fix this failed. Understanding why it fails is what led to the architecture we eventually shipped.

The Problem

Evidence F1 measures whether the model cites the same triples that Opus cited when it labeled the training data. A perfect score means the model points to exactly the right regulatory provisions. Our models were stuck around 0.47-0.53, meaning they cited roughly half the right triples while also citing a fair number of wrong ones.

Decision accuracy, by contrast, was solid — 84-89% depending on configuration. The model was making correct judgments while pointing to the wrong evidence. This is like a judge reaching the right verdict but citing the wrong statutes. The conclusion might be defensible, but the reasoning chain is broken.

It is tempting to think that the modes were not actually using the evidence, but we ruled that out through evidence ablation. By carefully removing evidence, the models decision accuracy was eroded. Likewise, if we simply truncated evidence (at first on accident and then on purposed) decision quality was degraded. The models were using the evidence — they just could not cite it. For C-by-B, where auditability through evidence traceability is a hard requirement, this gap wasn’t acceptable.

Every Lever Pulled

We ran an exhaustive search across every parameter we could vary, using the LFM2-2.6B as the test bed.

Weight mask ratios. The training loss is split across three components: decision, evidence citation, and JSON formatting. We shifted from a balanced 70/20/10 split to an aggressive 90/8/2 — pouring 90% of the gradient signal into evidence tokens. Result: identical evidence F1. The model’s evidence selection was decoupled from the gradient signal on those tokens.

Prompt formats. We tested structured evidence (v5, with cosine scores and triple-type labels as metadata) against plain text evidence (v7, just the triple content in a string). Both formats converged to the same F1 band. Structured metadata provided no lasting advantage.

Training stages. Evidence-focused stages (S5) followed by decision-focused stages (S6), and vice versa. Every ordering produced the same outcome: decision accuracy could be pushed to ~89%, evidence F1 sat at 0.47-0.53.

Stacked LoRA. Training separate LoRA adapters for evidence and decision skills, then stacking them at inference. Diminishing returns — the adapters interfered with each other and OOM’d on larger models.

Model scale. Across the entire LFM2 family (350M to 2.6B), zero-shot evidence F1 was flat at 0.28-0.39. Training lifted it to the 0.47-0.53 band uniformly. Scale didn’t help.

Nothing moved the needle. This wasn’t a hyperparameter or prompt structuring problem.

The Diagnosis

The key evidence came from layer probes. When we trained simple linear classifiers on the model’s hidden states at each layer — asking “is this triple cited or not?” — the probes achieved AUC 0.95+. The model’s internal representations correctly distinguished cited from non-cited triples with near-perfect ranking quality.

But autoregressive generation requires the model to produce a specific triple ID token-by-token: T, R, P, -, 0, 1, 2, 2, 5, 2. Each token is generated by sampling from a probability distribution over the full vocabulary. The model needs to commit to a specific 6-digit identifier from among many possibilities, and it needs to do this correctly for every cited triple.

This is a fundamentally different task from discrimination. Ranking triples by relevance (which the hidden states do well) requires geometric separation in embedding space. Generating specific identifiers (which the text output does poorly) requires precise sequential token prediction with no error tolerance — one wrong digit and the citation is garbage.

The model knew which triples mattered. The autoregressive generation bottleneck prevented it from saying which triples mattered.

What This Ruled Out

This finding eliminated an entire class of solutions. Any approach that relied on the model generating evidence citations in text — whether through better prompting, more training data, different loss weights, or larger models — would hit the same ceiling. The bottleneck was architectural, sitting in the gap between internal representation quality and sequential token generation.

It also explained why evidence F1 was flat across model scales. The discrimination capability scaled with parameters (bigger models had better probes), but the generation bottleneck was constant. A 2.6B model and a 350M model both failed to express what they knew, just at different levels of internal knowledge.

The Question That Changed the Architecture

If the model’s hidden states already encode the right answer, and generation can’t extract it, the obvious question is: can we read the hidden states directly?

Instead of asking the model to tell us which triples matter through generated text, we could train lightweight classifiers to read the model’s internal representations and extract the discrimination signal that’s already there. Freeze the base model entirely — don’t train it at all — and just train small heads that tap into the right layers.

This is what we did next.

Next: the pivot to classification heads reading frozen hidden states — and the design research that made it work.

C-by-B Dev Notes

The Problem

Every Lever Pulled

The Diagnosis

What This Ruled Out

The Question That Changed the Architecture

Comments

Leave a Reply Cancel reply