You have an evidence head that ranks triples with AUC 0.971 — near-perfect discrimination between relevant and irrelevant. The decision head, running four layers downstream, achieves 84%. The obvious move: use the evidence scores to help the decision head. Suppress irrelevant triples, amplify relevant ones, give the decision head cleaner input.
We tried this five different ways. Every one failed or showed negligible improvement. The failures taught us more about how the model works than the successes did.
Attempt 1: Soft Masking
The plan: after the evidence head scores triples at L15, multiply each triple’s hidden state vectors by its normalized score before the forward pass continues to L19. High-scoring triples pass through at full strength; low-scoring ones get attenuated. The decision head at L19 should see amplified signal from relevant evidence.
We built the full pipeline — suppressed cache builder, five-step orchestrator, go/no-go thresholds. Ran it. The “golden architecture” showed 92.3% overall accuracy with 80% REVISE. An improvement over the 90.5% baseline. We celebrated briefly.
Then we investigated why the 0.8B model showed almost no masking effect, and found the obvious (in retrospect) answer — we briefly forgot that the decision head uses last-token pooling — it reads the hidden state at position seq_len - 1, the very last token in the sequence. That position is always 28 tokens past the last evidence triple span, sitting in the action text footer. The masking modifies positions within triple spans. The decision head reads a position outside all spans.
We verified: the masked and unmasked feature vectors were bitwise identical. Every byte the same. The 92.3% result was seed variance, not suppression. The entire pipeline was a structural no-op.
Attempt 2: Triple Reordering
If we can’t modify hidden states, maybe we can modify the prompt. Train a separate 0.8B model as an evidence scorer, use its scores to reorder triples so the most relevant appear first (or last, near the decision head’s reading position).
Built the full pipeline: 0.8B evidence scoring, prompt rewriting, new 4B caches, fresh head training. Result: each intervention made things progressively worse. Reordering dropped accuracy by 2.4 percentage points and REVISE by 4.7 points. Adding score replacement (evidence scores instead of cosine values) degraded further.
The post-mortem revealed why. The evidence head produces bimodal scores — 54% of triples score exactly 1.0, 32% score near 0.0. It’s a binary classifier, not a ranker. Using it for ordering replaces the smooth cosine gradient (which the model was trained on and relies on) with a binary partition plus noise. The model’s existing ordering was more informative than the “improved” one.
Attempt 3: Adding a Feature Dimension
Pivot from modifying the model’s processing to giving it additional input. Append a discretized evidence score as a 2561st dimension to the hidden state vectors before pooling. The evidence signal becomes additive information alongside whatever the model already encoded.
Swept seven layers, ten seeds each. Results: +0.3 to +1.2 percentage points depending on layer. Statistically marginal. The model’s internal representations already contain the evidence discrimination signal — adding it explicitly as a feature tells it something it already knows.
Attempt 4: Attention-Pooled Extra Dimension
Same concept with a more expressive architecture — an attention-pooled head that could learn to weight the evidence channel differently from the 2,560 hidden state dimensions. Also marginal. The attention mechanism couldn’t find a useful weighting because the signal was redundant.
Attempt 5: Prompt-Level Injection
The most expensive test: inject evidence scores as a new field in the JSON prompt alongside the cosine values, rebuild the entire hidden state cache (2+ hours), and train fresh heads. This puts the evidence signal into the token stream where the model can attend to it during its forward pass.
Same result. The model converges to the same accuracy band whether it sees evidence scores in the prompt or not.
What the Failures Showed
Five attempts, one clear conclusion: the evidence head’s discrimination signal is already encoded in the hidden states the decision head reads. You can’t improve the decision by re-injecting information the model already has.
But the failures also revealed a structural insight about pooling. Mean pooling averages across all token positions — it’s invariant to prompt ordering (we later confirmed this: different prompt orderings produce byte-identical confusion matrices with mean pooling). Last-token pooling reads whatever the model has routed to its final generation position — it’s order-sensitive and more powerful but more fragile.
This meant the evidence head’s real value wasn’t in modifying the decision pipeline. It was as a standalone signal — either a gate or a filter or something else — surely it had some use! But what?
Next: three bugs that cost days to find and taught us hard lessons about reproducibility.
Leave a Reply