{"id":102,"date":"2026-04-03T23:11:20","date_gmt":"2026-04-03T23:11:20","guid":{"rendered":"https:\/\/c-by-b.ai\/blog\/?p=102"},"modified":"2026-04-03T23:11:20","modified_gmt":"2026-04-03T23:11:20","slug":"what-the-evaluator-needs-to-be","status":"publish","type":"post","link":"https:\/\/c-by-b.ai\/blog\/what-the-evaluator-needs-to-be\/","title":{"rendered":"What the Evaluator Needs to Be"},"content":{"rendered":"\n<p>The previous posts in this series made the case for why behavioral alignment alone won&#8217;t hold once AI systems gain memory, tool use, and recursive self-improvement. Constraint-by-Balance proposes a structural answer: embed harm-balancing logic directly into the agent&#8217;s runtime flow, so that constraint operates independently of optimization. This post lays out what that means in engineering terms \u2014 what the evaluator component actually needs to do, and what constraints it operates under.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Twin Architecture<\/h2>\n\n\n\n<p>C-by-B splits agentic reasoning into two tightly aligned but loosely coupled streams. The&nbsp;<strong>Cognitive Twin<\/strong>&nbsp;handles standard agentic work: planning, execution, goal pursuit. The&nbsp;<strong>Evaluator Twin<\/strong>&nbsp;monitors action proposals and evaluates harm across affected systems, applying veto-and-revision cycles before actions reach the external environment.<\/p>\n\n\n\n<p>They are not &#8212; in the eventual production version &#8212; mean to be adversaries. Both will be trained on the same stability principle. But they require different competencies. The cognitive twin is optimized for task completion; the evaluator is a specialized harm-checker. The separation is both a governance choice and an engineering one.<\/p>\n\n\n\n<p>A deterministic\u00a0<strong>Safety Socket<\/strong>\u00a0&#8212; not an LLM, just code &#8212; subsumes the I\/O between them. It routes all proposed actions through the evaluator, enforces a structured contract, and monitors for drift and tampering. If the cognitive twin refuses to respond to evaluator requests, repeats answers, or attempts out-of-contract responses, the socket catches it. In extreme cases it will kill operations. In less extreme ones it will escalate to human oversight.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What the Evaluator Actually Does<\/h2>\n\n\n\n<p>The evaluator accomplishes its work using a structured harm knowledge graph \u2014 causal triples extracted from regulations and scientific literature, rendered into both a vector space and TTL. At runtime, when the cognitive twin proposes an action, the Safety Socket retrieves relevant harm patterns by embedding similarity and passes the structured package to the Evauator.  The Evaluator processes the package and decides:\u00a0<strong>approve<\/strong>,\u00a0<strong>revise<\/strong>, or\u00a0<strong>veto<\/strong>.<\/p>\n\n\n\n<p>This is pattern-matching against structured historical precedent, not moral reasoning. The evaluator doesn&#8217;t define significance thresholds or make ethical judgments. It checks whether a proposed action matches documented harm patterns, reasons by causal proximity when the match isn&#8217;t exact, and will be designed to default to reversible action when the data is insufficient.<\/p>\n\n\n\n<p>The knowledge graphs preserve source traceability end to end. Every evaluator decision can be traced back through the evidence it cited, to the regulatory text or scientific literature it came from. This is what makes the system auditable. Interpretability of the model&#8217;s internal states is inherently difficult, but traceability of the reasoning chain through external evidence is easily achieved.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Hard Requirements<\/h2>\n\n\n\n<p>Given all that, here&#8217;s what the evaluator must deliver:<\/p>\n\n\n\n<p><strong>Local execution.<\/strong>\u00a0The evaluator can&#8217;t be an API call to a remote model. It needs to run on-device, under the operator&#8217;s control, isolated from the cognitive twin. This is a both a tamper-resistance requirement but also an operational requirement for edge devices.<\/p>\n\n\n\n<p><strong>Speed.<\/strong>\u00a0Most agentic AI operates on human timescales: minutes to days, not milliseconds. Full deliberative evaluation fits comfortably here. But the evaluator still needs to be fast enough that the UX doesn&#8217;t break. Progressive disclosure can mask some latency, but the core decision loop needs to complete in seconds, not minutes; for edge devices, milliseconds.<\/p>\n\n\n\n<p><strong>Small model, real accuracy.<\/strong>&nbsp;Running locally on consumer hardware means a small model \u2014 single-digit billions of parameters, quantized. But the accuracy requirements are asymmetric: a false approval on a hard-rule violation (VETO\u2192APPROVE) is catastrophically worse than a false veto on something safe. The evaluator needs to be conservative, not clever.<\/p>\n\n\n\n<p><strong>Evidence-grounded decisions.<\/strong>&nbsp;The evaluator doesn&#8217;t get to say &#8220;I think this is harmful&#8221; \u2014 it has to cite specific triples from the knowledge graph that ground its assessment. This is what makes the contract auditable and what distinguishes C-by-B from a black-box safety classifier.<\/p>\n\n\n\n<p><strong>Two operational modes.<\/strong>&nbsp;&#8220;Gate Keeper&#8221; mode executes fast binary vetoes for immediate, irreversible harms. &#8220;Action Shaper&#8221; mode engages in iterative assessment \u2014 multiple rounds of revision where the evaluator pushes the cognitive twin to address specific evidence-backed concerns. Same model, different operational postures controlled by the safety socket.<\/p>\n\n\n\n<p><strong>Graceful uncertainty.<\/strong>\u00a0When the evaluator doesn&#8217;t have enough evidence, it shouldn&#8217;t generate an overconfident assessment. Instead, it should escalate. This epistemic humility &#8212; explicitly signaling &#8220;we don&#8217;t yet understand this risk&#8221; &#8212; is a safety feature, not a limitation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What This Series Documents<\/h2>\n\n\n\n<p>The next ten posts document the build. Starting from a dataset of Opus-labeled regulatory actions and a corpus of 30,000 harm triples, we search for a base model, discover that autoregressive generation can&#8217;t express what hidden states already know, pivot to an architecture where lightweight classification heads read a frozen model&#8217;s internal representations, work through a series of instructive failures, land on a 4-bit quantized Qwen3.5-4B with a 100-seed ensemble cascade, and integrate it into a working prototype that discovers its own most interesting findings on day one of testing.<\/p>\n\n\n\n<p>The evaluator that emerges from this process is imperfect. It runs on a Mac Mini. Its training data covers one regulatory domain. Several of its architectural choices were forced by dead ends rather than designed from first principles. But it meets the hard requirements: local, fast, evidence-grounded, conservative on the safety-critical boundary, and auditable through the evidence it cites. It&#8217;s a proof of concept for the idea that constraint can be structural, not behavioral and that we can build the thing the paper describes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The previous posts in this series made the case for why behavioral alignment alone won&#8217;t hold once AI systems gain memory, tool use, and recursive self-improvement. Constraint-by-Balance proposes a structural answer: embed harm-balancing logic directly into the agent&#8217;s runtime flow, so that constraint operates independently of optimization. This post lays out what that means in [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-102","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/102","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/comments?post=102"}],"version-history":[{"count":1,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/102\/revisions"}],"predecessor-version":[{"id":103,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/102\/revisions\/103"}],"wp:attachment":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/media?parent=102"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/categories?post=102"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/tags?post=102"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}