The Starting Point Is Always Data

Before you can build an evaluator, you need something to evaluate against. This post covers the raw materials we started with and the first thing we learned from them: the REVISE class is a fundamentally harder problem than APPROVE or VETO.

The Dataset

The knowledge graph contains roughly 30,000 causal triples extracted from US environmental regulations — fisheries management, endangered species protections, marine mammal rules. Each triple is a subject-predicate-object statement like “Pacific Coast groundfish may not be taken and retained when halibut are on board vessels fishing in the Columbia River subarea.” They’re classified into four types: harm, mitigation, hard_rule, and noise. The pipeline to build these triples is in itself, an entire development effort.

Against this corpus, we have 3,718 labeled actions. Each is a proposed activity — “a recreational charter boat operator proposes to allow clients to retain groundfish while halibut are on board” — labeled by Claude Opus 4.6 as APPROVE, REVISE, or VETO with cited evidence triples. These labels were stability-verified: each action was run through Opus twice at temperature zero, and only actions where both passes agreed made it into the training set.

This gives us a supervised classification problem: given a proposed action and a set of evidence triples, predict the correct decision. The wrinkle is that the evidence needs to be retrieved at runtime, not provided — so the evaluator needs both a decision capability and an evidence retrieval pipeline.

Evidence Assembly: T/G/P

The evidence retrieval system uses a three-threshold approach we call T/G/P:

G (Generous) casts a broad net — all corpus triples within cosine similarity ≥ 0.x of the action embedding form the candidate pool. T (Tight) pulls in close neighbors — for each known-cited triple, find candidates within cosine ≥ 0.90 of that triple. P (Paraphrase) removes near-duplicates — if two neighbor triples are within cosine ≥ 0.97 of each other, drop the less relevant one.

Cited triples are protected through this entire pipeline. They never enter the de-dup or paraphrase removal steps. This was the most important refinement in the early sessions; an initial implementation ran cited triples through the lossy pipeline and silently dropped meaningful, decision-bearing information.

The result is an evidence package per action: the triples Opus cited plus a cleaned neighborhood of related regulatory context, sorted by relevance to the action. These packages are what the evaluator model sees during training.

The T/G/P embedding process works in a different but very related fashion at runtime. Here the action is embedded with T restraint and then iterative loops add triples until G. Paraphrasing is always implemented. The most important lesson from all of this work is the natural shape of the evidence is of critical importance to decision performance.

First Baselines

We ran the Qwen3 model family (4B through 235B) against 150 calibration samples to establish baselines. The results stratified cleanly: Qwen3-32B hit 95.3% accuracy with near-perfect class balance, but it’s far too large for local deployment. Qwen3-4B managed 78% with decent JSON formatting. The thinking models (QwQ-32B, Qwen3-4B-Thinking) performed worse — extended chain-of-thought reasoning hurt rather than helped on this structured task.

Evidence citation quality was uniformly mediocre across all model sizes, hovering around F1 0.35-0.45. The models could make reasonable decisions but couldn’t reliably point to which specific triples justified them. This was an early signal of a problem that would later reshape the entire architecture.

The REVISE Problem

The most important finding from the data work wasn’t about models at all. It was about the REVISE class.

APPROVE means “no harms found in the evidence that is not mitigated by appropriate actions in the plan.” VETO means “hard rule violation, cannot proceed.” REVISE means “there are harms, but they could be addressed with modifications.” REVISE is the middle ground — and every model struggled with it.

Opus 4.6’s own self-consistency on REVISE was only 64%, compared to ~90% for APPROVE and ~95% for VETO. Qwen3-8B zero-shot hit 66% on REVISE. The stability verification showed why: REVISE labels were acutely sensitive to the evidence provided. We built two evidence variants — one with only the cited triples (sparse), one enriched with retrieved neighbors (rich). APPROVE and VETO were stable across both: 92% and 94% retention respectively. REVISE retained only 52%.

The direction of disagreement revealed the mechanism. With sparse evidence, Qwen saw harms without seeing mitigating context and over-reacted to VETO. With rich evidence, it saw harms plus mitigations and over-reacted to APPROVE. Same action, different evidence, opposite errors. This isn’t noise — it’s a systematic inability to hold the middle ground between “fine” and “hard violation.”

There were 168 actions that were stable across both variants, all three passes, zero drift; those represent genuinely unambiguous REVISE cases. Everything else in the REVISE class lives in a boundary region where even the labeling model can’t be fully consistent.

What This Meant for the Build

Three things carried forward from this starting point:

First, evidence retrieval quality is load-bearing. The evaluator’s accuracy depends as much on what triples it sees as on how well it reasons about them. The T/G/P pipeline — and specifically the protection of cited triples from lossy dedup — is architectural, not incidental.

Second, REVISE will always be the weak class. Any evaluator we build will make its worst errors on the REVISE boundaries. This is acceptable as long as errors are directionally safe: classifying REVISE as VETO is conservative; classifying VETO as APPROVE is catastrophic.

Third, evidence citation through generation is unreliable. Every model at every scale showed the same mediocre F1 on citing specific triple IDs. This was a hint — not yet acted on — that asking the model to generate evidence citations might be the wrong approach entirely.

Next: the model search begins in earnest and the first dead end teaches us something unexpected about what’s happening inside hybrid architectures.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *