The Evaluator is meant to run the gamut from edge devices making split-second decision on up to a strategic planning add with ample resources. For the prototype we decided to focus on a model small enough to run locally on modest hardware yet accurate enough to make safety-critical decisions; i.e., architecturally suited to reading regulatory evidence. The search took two weeks, passed through three model families, and the most instructive stop was a dead end.
The Falcon Detour
Falcon H1-Tiny-90M is a hybrid SSM+attention model — Mamba2 layers interleaved with multi-head attention. Hybrid architectures were theoretically appealing: the SSM layers handle sequential context efficiently while attention layers do precise cross-reference. For regulatory evidence where you need to connect a proposed action to specific prohibitions scattered across a long context, this seemed right. In addition, the remote yet tantalizing prospect of a 90M parameter model as a Gate Keeper evaluator was very attractive. Alas, it was not to be, at least for this iteration.
Falcon wouldn’t train on MLX. LoRA produced zero gradients on all layers except layer 0. Rather than debug the framework, we built an entire PyTorch training pipeline (~880 lines) to work around it. Got a working model on the second run after discovering that gradient accumulation of 1 — noisy single-example updates — actually prevented mode collapse on small models, while the smoother gradients from larger batches enabled shortcut learning. But the PyTorch tooling was unbelievably slow on our Mac Mini M4 Pro.
So we went back to figure out why MLX failed. Traced gradient flow layer by layer and found the problem: 23 of 24 layers were dead. MLX’s Falcon implementation had a bug where the SSM computation blocked autodiff. The model was running inference through all layers correctly but only training the first one. Every prior “result” from Falcon on MLX was essentially a 1-layer model.
The PyTorch pipeline worked, but MPS was 3-4x slower than MLX due to missing CUDA-specific Mamba optimizations. We had a working model on a slow backend, and a fast backend with a framework bug. We solved this by fixing the bug in MLX allowing us to run Falcon. We likewise worked to normalize chat templates and JSON generation across a number of small models to look for the highest performers. Across all the models (GPT2-124M, Pythia-160M, SmolLM2-135M, Qwen3-0.6B Base and Instruct, Falcon-90M-Inst and LFM2-350M) one emerged as clearly dominant.
LFM2: Beautiful Probes, Mediocre Generation
LiquidAI’s LFM2 family — another hybrid SSM+attention design — showed extraordinary promise on layer probes. We ran four-skill probes (triple classification, decision, evidence selection, revision) across six LFM2 models and three Qwen3 models for comparison.
The results were unambiguous. LFM2-350M (354M parameters) beat Qwen3-0.6B (600M) on triple classification. LFM2-700M (700M) beat Qwen3-4B (4B) — a model with 5.7x more parameters. LFM2-2.6B hit 92.4% on decision probes, the first model to break 90%. The hybrid architecture produced better internal representations at every comparable parameter count.
The probes also showed cleaner skill separation in hybrid models. In the LFM2-350M, four skills peaked at four distinct layers (L7, L9, L10, L12) across only 16 layers. Qwen3 models clustered all skills in the same mid-network region. This mattered because our original plan was multi-zone LoRA training — targeting different skills at different layers — and that requires the skills to actually live in different places.
We moved forward with LFM2 and ran the full pipeline: zero-shot baselines, Stage 1 LoRA training on triple classification, generation quality testing. LFM2-2.6B hit 84.7% decision accuracy zero-shot with 100% valid JSON. Then we tried to push further.
Evidence F1 wouldn’t budge. We tried everything: weight mask ratios shifted from 70/20/10 to 90/8/2, two prompt formats (structured metadata vs plain text), parallel training stages, evidence-focused and decision-focused loss weighting. Every configuration converged to the same band: 0.47-0.53 evidence F1, roughly 88% decision accuracy. The model could make correct decisions but couldn’t cite the right triples in its generated output. In response to evaluated another round of model families (Gemma 3, Granite) with results that were similar or worse.
This ceiling was model-independent — it appeared identically across prompt formats, weight configurations, and training stages. It was an autoregressive generation bottleneck: a model can encoded correct evidence discrimination in its hidden states (AUC 0.95+ on probes) but cannot express that knowledge through token-by-token text generation. The model knew which triples mattered. It couldn’t say which triples mattered.
Qwen3.5 Arrives
While the LFM2 ceiling was becoming clear, the remarkable team of engineers landed Qwen3.5 in MLX with native support. Another hybrid architecture — linear attention layers interleaved with full attention every fourth layer — but with the advantage of a mature ecosystem: no framework bugs, no missing optimizations, direct safetensor loading.
Four models (0.8B through 9B) baselined in a single session. The 4B hit 84% decision accuracy zero-shot, matching LFM2-2.6B. The 9B reached 90.7%. Layer probes showed decision signal at 92-93% — right in LFM2’s range but in a model with production-ready MLX support. And a separate extraction comparison had already established Qwen3-4B-Instruct as the best triple extractor across nine models including Claude Sonnet and GPT-OSS-120B, beating models 15-30x its size on semantic fidelity.
With the range of sizes available in the Qwen3.5 family — potentially aligning to the various agent working zones — this because our go forward target family. The probes said the information was there. The generation ceiling said we couldn’t get it out through text. The next question was whether we could get it out some other way.
Next: the evidence F1 ceiling forces an architectural pivot — from training the model to generate answers, to reading answers from what it already knows.
Leave a Reply