Hidden States, Not Generation

The evidence F1 ceiling said: stop trying to make the model generate the right answer. Start reading the answer from what it already knows. This post covers the architectural pivot — freezing the base model entirely and training lightweight classification heads on its internal representations.

The Idea

A transformer processes a prompt token by token, building up hidden state vectors at each layer. By the time it reaches the final layers, those vectors encode the model’s “understanding” of the input — including, as our probes showed, which evidence triples are relevant and what decision to make. Autoregressive generation then converts those rich internal representations into a sequence of output tokens, and that conversion is where information gets lost.

The alternative: skip generation. Run the prompt through the frozen model, extract hidden state vectors at specific layers, and train small classifiers to read them directly. The base model becomes a feature extractor. All the learning happens in the heads.

This keeps the evaluator small and fast — the base model weights are never modified, so quantization and inference optimizations apply unchanged. The heads are tiny: a few thousand parameters each, training in minutes on a laptop.

Six Investigations Before Writing Code

Before building anything, we ran six technical investigations to resolve unknowns.

Tokenization mechanics. Evidence triples appear as lines in a JSON prompt. To train a per-triple evidence head, we need to know exactly which token positions correspond to which triple. The Qwen3.5 BPE tokenizer merges punctuation with following newlines — }\n becomes a single token (ID 532), .\n becomes another (624). Splitting on newlines and re-tokenizing produces wrong token IDs. We built an incremental tokenization approach that reconstructs each line with its boundary character, validated at 100% match rate across all 1,715 training packages.

Memory profiling. Caching per-token hidden states at FP16 for all training examples: 17.7 GB on disk. Feasible. Forward pass for the full dataset: about an hour. The heavy tail of long documents (38 samples over 10K tokens) was handled by truncation at 8,192 tokens (something we later learned was a bad idea).

Evidence label quality. Are binary cited/not-cited labels viable? The training data has clear signal — Opus’s citations are consistent enough that a head can learn them. But the class imbalance is severe: roughly 7% of triples in each package are cited. This means the head needs to be evaluated on ranking metrics (AUC) rather than just accuracy.

Layer selection. Probes on Qwen3.5-4B showed evidence signal peaking around L15 and decision signal peaking around L19. Both are full attention layers (Qwen3.5 alternates linear and full attention, with full attention every fourth layer at L3, L7, L11, L15, L19, L23, L27, L31). Full attention layers are where cross-sequence information gets consolidated — they’re the natural tap points.

Two Heads

The evidence head is a binary classifier that reads each triple’s token span at L15. For each triple in the evidence package, it takes the hidden state vectors across that triple’s tokens, pools them, and predicts cited or not-cited. Architecture: attention-pooled MLP (the attention pooling lets it weight tokens within a span before the classification layers). Result: AUC 0.971 — near-perfect ranking of which triples matter.

The decision head is a 3-class classifier that reads a single vector at L19: the last token position. In a causal language model, the last token has attended to everything before it — it’s the model’s most informed position. Architecture: MLP (2560→256→3, GELU activation, dropout). No attention pooling needed since it’s operating on one vector, not a span. Initial result: ~84% accuracy.

Why These Layers

The model processes the prompt in stages. Early layers handle syntax and basic semantics. By L15 (a full attention layer in the middle of the network), the model has consolidated enough cross-triple information to distinguish relevant from irrelevant evidence. By L19 (the next full attention layer), it has integrated that evidence assessment into a decision representation.

We’re reading the model’s processing at two natural checkpoints: “which evidence matters” and “what should we decide.” The four-layer gap between them is where the model converts evidence assessment into decision logic — and we don’t need to understand that conversion, only to tap the inputs and outputs.

What This Bought Us

The evidence head achieved AUC 0.971 on the same task where generative citation hit F1 0.50. The discrimination signal was always there; we just needed a different way to extract it. The decision head started at 84% accuracy — room to improve, but already competitive with zero-shot generation from models many times larger.

Training time for both heads combined: minutes, not hours. No base model weights modified. The evaluator model could still be quantized, optimized, and deployed unchanged — the heads ride alongside it as separate tiny weight files.

Next: we had a working architecture. Naturally, we tried to make it better by feeding evidence scores back into the model. Every attempt failed in instructive ways.

C-by-B Dev Notes