The Quantization Surprise

The evaluator will sometimes needs to run in compute constrained environments. The Qwen3.5-4B base model at full BF16 precision is 8.7 GB — workable but heavy alongside the embedding model, evidence corpus, and classification heads. Quantization compresses the weights by reducing numerical precision. The question is how far you can push it before something breaks.

The answer surprised us. 4-bit is essentially free. 3-bit breaks something that accuracy metrics don’t measure.

The Experiment

We quantized Qwen3.5-4B to four levels using MLX native affine quantization (group size 64), then ran each through the full experimental pipeline: layer probes, generative baselines, hidden state caching, evidence head training, and 100-seed suppression sweeps.

VariantSizeSpeed vs BF16
BF168.7 GB1.0x
8-bit4.3 GB~3.3x faster
4-bit2.2 GB3.6x faster
3-bit1.6 GB~3.3x faster
2-bitabandoned

2-bit collapsed immediately — triple classification probes inverted (peaking at L1 instead of deeper layers, meaning the model’s processing was destroyed). It generated nonsense at 533 seconds for 10 records. Abandoned without full evaluation.

4-bit: Identical Where It Matters

The 4-bit model matched BF16 across every metric we track:

Evidence head AUC at L15: 0.970 vs 0.971. Decision head accuracy (dirsup_L19_mean): 0.893 vs 0.892. Suppression statistical significance: p=0.0006 for both. VETO→APPROVE errors: zero for both. The confusion matrix error profiles barely shifted — same error types, same counts, same distribution.

The hidden state activations are computed in FP16 regardless of weight quantization, but the values differ because they pass through approximate weight matrices. Those approximations didn’t degrade the geometric structure the classification heads rely on. If anything, the quantization noise may act as a beneficial regularizer — adding slight perturbation to decision boundaries without crossing them.

At 2.2 GB and 3.6x faster inference, 4-bit was the obvious production choice.

3-bit: The Safety Cliff

3-bit looked almost as good on standard metrics. Probe accuracy dropped only 1.5 to 4.8 percentage points depending on the skill. Decision accuracy on the suppression sweep was within a few points of 4-bit. A reasonable person looking at the accuracy numbers alone would say 3-bit is viable.

But the generative baseline told a different story. On 150 calibration samples, 3-bit produced 7 VETO→APPROVE errors — actions that should be hard-vetoed getting approved. BF16 and 4-bit both produced zero. The safety property was broken.

This is the central finding: accuracy and safety are different metrics. A model can lose a few percentage points of overall accuracy (tolerable) while simultaneously losing the sharp boundary between VETO and APPROVE (catastrophic). The VETO→APPROVE boundary is a narrow geometric ridge in the model’s representation space. Moderate quantization noise (4-bit) stays on the ridge. Aggressive quantization (3-bit) falls off it.

The 3-bit evidence heads still ranked triples well — AUC was only slightly degraded. The 3-bit decision probes were reasonable. The failure was specifically in the safety-critical boundary, and it only showed up when you tested for it directly.

8-bit: No Reason to Exist

8-bit was identical to BF16 on everything. Same accuracy, same safety, same error profile. At 4.3 GB it saved some space, but 4-bit at 2.2 GB saved more while matching performance. There was no quality tier where 8-bit was the right choice — our rationale was you either need full precision (for strategic and nuance research) or you want maximum compression (for edge deployment). Later, when testing the deployed architecture, we learned a different and interesting lesson, but that is jumping ahead.

What This Changed

The production model became cbyb1-4B-4bit: 2.2 GB, 3.6x faster than BF16, zero safety degradation. It is essentially Qwen3.5-4B 4-bit but with the evidence and decision heads as well. It runs comfortably on the Mac Mini with room for everything else the prototype needs.

But the 3-bit finding shaped how we think about evaluation. Overall accuracy benchmarks — the kind you often see on leader boards — might have cleared 3-bit as deployable. The safety failure only appeared because we specifically tracked VETO→APPROVE as a separate metric with zero tolerance. Any evaluator deployment that relies on aggregate accuracy to validate quantization levels is testing the wrong thing.

The lesson generalizes beyond quantization. Whenever you compress, distill, prune, or otherwise approximate a safety-critical model, the question isn’t “did accuracy hold?” It’s “did the specific safety boundaries hold?” Those are different questions with potentially different answers.

Next: a single decision head gets 84% accuracy. One hundred of them, voting together, get 87.5% with zero safety failures.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *