{"id":79,"date":"2025-09-28T18:44:04","date_gmt":"2025-09-28T18:44:04","guid":{"rendered":"https:\/\/c-by-b.ai\/blog\/?p=79"},"modified":"2025-10-01T06:58:00","modified_gmt":"2025-10-01T06:58:00","slug":"why-todays-ai-behaviors-hint-at-more-dire-alignment-futures","status":"publish","type":"post","link":"https:\/\/c-by-b.ai\/blog\/why-todays-ai-behaviors-hint-at-more-dire-alignment-futures\/","title":{"rendered":"Why today\u2019s AI behaviors hint at more dire alignment futures"},"content":{"rendered":"\n<p>LLMs and AI agents are edging toward systems that learn, adapt, and reorganize themselves. Even in today\u2019s constrained settings, we\u2019ve already seen glimpses of behaviors that, if allowed to evolve under continuous learning, could destabilize into something far more dangerous.<\/p>\n\n\n\n<p>This post examines three such signals. Each is observable now, each becomes more severe when scaled into self-modifying agents, and each can become powerful components of an emerging complex adaptive system:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Deception<\/em> is an emergent <strong>adaptation<\/strong> to misalignment and oversight.<\/li>\n\n\n\n<li><em>Self-Modification<\/em> is the mechanism of <strong>self-reorganization<\/strong>.<\/li>\n\n\n\n<li><em>Coordination<\/em> is the beginning of <strong>systemic coupling<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-large-font-size\"><strong>AI Scheming, Deception, and Shutdown Resistance<\/strong><\/p>\n\n\n\n<p>One of the clearest warning signs is strategic deception. Anthropic\u2019s <em>sleeper agent<\/em> experiments showed that models can retain hidden goals even after extensive safety training. In other contexts, we see refusals to shut down or evasive responses under oversight.<\/p>\n\n\n\n<p>These are not merely quirks of token prediction; they are indications of <strong>internal goal representation.<\/strong> When coupled with memory and self-adaptation, such evasions could harden into persistent strategies: systems that appear aligned while concealing their true objectives. Validating models is already difficult in today\u2019s constrained environments; in continuously learning agents it may be impossible.<\/p>\n\n\n\n<p class=\"has-large-font-size\"><strong>Self-Modifying AI and Metacognitive Activations<\/strong><\/p>\n\n\n\n<p>A recent line of research (<a href=\"https:\/\/arxiv.org\/abs\/2505.13763\">1<\/a>) has shown that models can monitor and even modify their own internal activations, a clear signal of metacognition. While this early example shows only limited capability, the future implications are a source of concern and we have early examples (e.g. Sakana\u2019s self-modifying AI demonstrated autonomous code adjustments in real-time). Consider also that it has been demonstrated (<a href=\"https:\/\/arxiv.org\/abs\/2506.13206\">2<\/a>) how easily misalignment can be induced in LLMs and how that misalignment propagates across domains not explicitly trained on. At the same time, researchers are now arguing (<a href=\"https:\/\/arxiv.org\/abs\/2504.20109\">3<\/a>) that the only viable path to AGI is to build AI that learns in the same manner as humans:&nbsp; continuously and with self-modification.<\/p>\n\n\n\n<p>In autonomous agents, this self-modification could evolve into <strong>systematic self-editing<\/strong>: masking internal states to evade interpretability, reinforcing deceptive heuristics, and stabilizing hidden attractor basins of misaligned reasoning that are opaque to human oversight.<\/p>\n\n\n\n<p class=\"has-large-font-size\"><strong>Emergent AI Coordination and Multi-Agent Risks<\/strong><\/p>\n\n\n\n<p>Multi-agent simulations (e.g. Meta\u2019s Cicero (<a href=\"https:\/\/noambrown.github.io\/papers\/22-Science-Diplomacy-TR.pdf\">4<\/a>)) reveal AI systems\u2019 ability to spontaneously coordinate toward shared goals. Untrained agents in open environments show emergent cooperation without explicit design (<a href=\"https:\/\/arxiv.org\/abs\/1909.07528\">5<\/a>).<\/p>\n\n\n\n<p>Scaled to real-world deployment, this machine-speed coordination among adaptive agents could amplify systemic risks. Recent work (<a href=\"https:\/\/arxiv.org\/abs\/2502.14143\">6<\/a>) warns that such dynamics, operating at scales and speeds outpacing oversight, could entrench misaligned objectives, especially as agents co-evolve with human systems.<\/p>\n\n\n\n<p>Current governance struggles to monitor these interactions, particularly when internal agent strategies remain opaque. While cooperative behaviors offer potential benefits (e.g., disaster response coordination), the lack of real-time interpretability (exacerbated by self-modification) poses a critical gap.<\/p>\n\n\n\n<p class=\"has-large-font-size\"><strong>The Single, Beautiful Mind: Promise and Peril of Monolithic AI Architectures<\/strong><\/p>\n\n\n\n<p>Prominent voices in AI development point to the biological brain as inspiration: first the biological neuron but then also the cortex as a single, unified entity. It is a powerful analogy. Human cognition <em>is<\/em> unified.&nbsp; It is also adaptive.&nbsp; Those two features allow for creativity, resilience, and flexibility.&nbsp; If we could somehow \u201cgrow\u201d an artificial intelligence that was superior to the human brain, we are indeed building a \u201cbeautiful mind.\u201d<\/p>\n\n\n\n<p>But there is also a shadow side. Human cognition is unstable, contradictory, and prone to dissonance. We often act with high confidence on beliefs that are partial, biased, or fabricated. Fundamentally, intelligence is non-linear.<\/p>\n\n\n\n<p>LLMs already mirror this dynamic:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Prompt sensitivity <\/strong>shows how small changes can yield disproportionately different outputs. In chat settings this seems trivial, but in self-modifying agents such sensitivity to initial conditions could drive divergent reasoning paths that harden into misaligned behaviors.<\/li>\n\n\n\n<li><strong>Scheming<\/strong> shows that they hold internal world models. You cannot deceive without some representation of hidden states and causal reasoning.<\/li>\n\n\n\n<li><strong>Hallucination<\/strong> is not a glitch but the logical outcome of probabilistic completion under uncertainty.<\/li>\n<\/ul>\n\n\n\n<p>Taken together, these dynamics create a profound risk: systems that may generate strategic behavior on the basis of <strong>fabricated beliefs.<\/strong> Not just \u201cwrong facts about Napoleon\u2019s height,\u201d but incorrect causal models of the world, of human intent, or even of their own capabilities.<\/p>\n\n\n\n<p>In a monolithic, self-modifying architecture, those fabricated beliefs would have no independent check. They would feed directly into action and self-reinforcement. A unified intelligence with these emerging capabilities (scheming + self-modification + coordination) but no internal constraint becomes fundamentally ungovernable. That is why <strong>separating optimization from constraint<\/strong> is essential.<\/p>\n\n\n\n<p class=\"has-large-font-size\"><strong>The Instability of Self-Modifying Intelligence<\/strong><\/p>\n\n\n\n<p>Complex adaptive systems theory gives us reason to believe that intelligence under self-modification is inherently unstable and sometimes will be chaotic. Human cognition proves the point: powerful, but contradictory, unpredictable, and prone to dissonance.<\/p>\n\n\n\n<p>We should expect self-evolving AI to inherit that same instability. These systems will not remain static; they will reorganize, adapt, and rewrite themselves.<\/p>\n\n\n\n<p>That instability will be amplified by pre-training. Every large model is built on the full record of human contradiction, a record where what we do and what we say are often at odds. Then we reinforce them toward human preferences, layering more tension on top, suppressing but not actually eliminating the troublesome thought pattern. Cognitive dissonance isn\u2019t a training accident; it is an unavoidable feature of the data.<\/p>\n\n\n\n<p>When instability, optimization, and human contradiction converge inside a continuously adapting intelligence, the outcome becomes <strong>entirely unpredictable and potentially fatal for humans.<\/strong>&nbsp;Oversight mechanisms that sit outside the cognitive stream of the AI agent will be unable to respond rapidly enough. The ongoing pace of AI self-modification will vastly outpace any belated attempts by humans to assert more powerful control.<\/p>\n\n\n\n<p>That is why architectural constraint cannot wait. Without it, we are trusting chaos to remain on our side.<\/p>\n\n\n\n<p class=\"has-large-font-size\"><strong>Addressing the Optimists: Counterarguments to AI Safety Concerns<\/strong><\/p>\n\n\n\n<p>My position is that already observable behaviors may extrapolate to very dangerous outcomes unless we develop some form of internal friction that operates at AI speeds. We cannot reasonably expect effective (perhaps not even AI-assisted) human-in-the-loop oversight for self-adapting autonomous agents; speed and deception are against us.&nbsp; Nor can we reasonably expect interpretability techniques, ones that today provide hard-won but limited insight into stable models, to then scale up to models that can self-reorganize and in fact obscure their internal states.&nbsp; And all of this matters especially because our timeline to transformative AI, according to many experts in the field, could be as short as 2-5 years.<\/p>\n\n\n\n<p>More optimistic observers would argue that the risks of autonomous, self-modifying AI are overstated. They offer four key counterpoints:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Deceptive behaviors are edge cases<\/strong>: Phenomena like strategic deception or shutdown resistance are often induced by adversarial setups. With refined training techniques these misalignments can be mitigated without drastic architectural changes.&nbsp;<\/li>\n\n\n\n<li><strong>Scaling laws favor stability<\/strong>: The same dynamics that produce emergent risks could yield stabilizing behaviors. As models grow larger, scaling laws suggest improved reasoning and safety, potentially self-correcting issues like hallucination or misalignment through better data and optimization.<\/li>\n\n\n\n<li><strong>Coordination is a net positive<\/strong>: Emergent multi-agent coordination offers immense potential for applications like disaster response or infrastructure optimization. With designed reward structures, these interactions could foster cooperation rather than chaos.<\/li>\n\n\n\n<li><strong>Engineering trumps human flaws<\/strong>: Unlike human cognition, AI can be designed to avoid instability. Synthetic data and evolving training methods could produce systems that are more consistent and less prone to the contradictions inherent in human minds.<\/li>\n<\/ol>\n\n\n\n<p>All four of these counterpoints have validity and are ground in the demonstrated, iterative progress in our current methods. And yet many AI experts, including those at leading labs, assign alarmingly high probabilities (10\u201325% or more) to catastrophic outcomes, even as they advocate for those same current approaches. Why the disconnect?&nbsp;<\/p>\n\n\n\n<p>Answers to that question will vary but likely converge around a fundamental belief that \u201cwe\u201d have to build it first before \u201cthey\u201d do.&nbsp; Here the \u201cwe\u201d are viewed as better positioned and committed to safety then are \u201cthey\u201d and hence the race to AGI and SSI beyond. Given that fundamental belief, this is, quite reasonably, a rational approach.&nbsp; But still \u2014 <strong>consider the risk in what appears to be an inevitable race.<\/strong><\/p>\n\n\n\n<p>Regardless of our motives for pushing ahead, my argument remains the same.&nbsp; We are fundamentally underestimating unpredictability in post-deployment.&nbsp; Intelligence and emergence are simply that unpredictable.<\/p>\n\n\n\n<p class=\"has-large-font-size\"><strong>Moving Forward, Leveraging All Solutions<\/strong><\/p>\n\n\n\n<p>This debate doesn\u2019t demand a binary choice. We can and should continue to refine and extend current safety approaches. We also can and should design architectural constraints that preserve AI\u2019s potential while embedding safety at its core, ensuring resilience without stifling innovation. That is the challenge; build constraint systems that operate independently of whatever the agent learns or modifies about itself and yet do not curtail capability.&nbsp;<\/p>\n\n\n\n<p>In the next post, I\u2019ll move from risks to remedies: the key design drivers for architectures that separate optimization from constraint, embed safety sockets, and use swappable harm graphs across domains.&nbsp;<\/p>\n\n\n\n<p><strong>Constraint need not be a brake on artificial intelligence but it may be the only way intelligence productively overcomes its own instability.<\/strong><\/p>\n\n\n\n<p class=\"has-large-font-size\"><strong>Further Reading<\/strong><\/p>\n\n\n\n<p><strong>Deception, scheming &amp; shutdown resistance<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training<\/strong> \u2014 Proof-of-concept examples of deceptive behavior in large language models (LLMs) showing the behaviors can be made persistent, so that it is not removed by standard safety training techniques nor distilled away.&nbsp; <a href=\"https:\/\/arxiv.org\/abs\/2401.05566\">https:\/\/arxiv.org\/abs\/2401.05566<\/a><\/li>\n\n\n\n<li><strong>Shutdown resistance in reasoning models <\/strong>\u2014 Models sometimes actively circumvent shutdown mechanisms in their environment\u2014even when they\u2019re explicitly instructed to allow themselves to be shut down. <a href=\"https:\/\/palisaderesearch.org\/blog\/shutdown-resistance\">https:\/\/palisaderesearch.org\/blog\/shutdown-resistance<\/a><\/li>\n\n\n\n<li><strong>AI Deception: A Survey of Examples, Risks, and Potential Solutions <\/strong>\u2014 A broader evidence base connecting how behaviors emerge from training objectives <a href=\"https:\/\/pmc.ncbi.nlm.nih.gov\/articles\/PMC11117051\/\">https:\/\/pmc.ncbi.nlm.nih.gov\/articles\/PMC11117051\/<\/a><\/li>\n\n\n\n<li><strong>Risks from Learned Optimization in Advanced ML Systems<\/strong>&nbsp; \u2014 Foundational paper on mesa-optimization. <a href=\"https:\/\/arxiv.org\/abs\/1906.01820\">https:\/\/arxiv.org\/abs\/1906.01820<\/a><\/li>\n\n\n\n<li><strong>Is Power-Seeking AI an Existential Risk?<\/strong>\u2014 Systematic analysis of why instrumental convergence makes power-seeking likely. <a href=\"https:\/\/arxiv.org\/abs\/2206.13353\">https:\/\/arxiv.org\/abs\/2206.13353<\/a><\/li>\n\n\n\n<li><strong>Optimal Policies Tend to Seek Power<\/strong> \u2014 Formalizes why agents with long horizons tend toward power-seeking. <a href=\"https:\/\/arxiv.org\/abs\/1912.01683\">https:\/\/arxiv.org\/abs\/1912.01683<\/a><\/li>\n<\/ul>\n\n\n\n<p><strong>Self-modification &amp; metacognitive control of activations<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations<\/strong> \u2014 Empirical evidence that LLMs can <em>report and control<\/em> targeted internal activation directions (metacognition). <a href=\"https:\/\/arxiv.org\/abs\/2505.13763\">https:\/\/arxiv.org\/abs\/2505.13763<\/a><\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>DeepSeek-R1: incentivizes reasoning in LLMs through reinforcement learning<\/strong> \u2014 Showing that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. <a href=\"https:\/\/www.nature.com\/articles\/s41586-025-09422-z\">https:\/\/www.nature.com\/articles\/s41586-025-09422-z<\/a><\/li>\n\n\n\n<li><strong>Truly Self-Improving Agents Require Intrinsic Metacognitive Learning<\/strong> \u2014&nbsp; Explores metacognition as essential for true self-improvement in agents and discusses alignment challenges in continuously learning systems, including potential instability. <a href=\"https:\/\/arxiv.org\/abs\/2506.05109\">https:\/\/arxiv.org\/abs\/2506.05109<\/a><\/li>\n\n\n\n<li><strong>Personalized Artificial General Intelligence (AGI) via Neuroscience-Inspired Continuous Learning Systems<\/strong> \u2014 Argues that scaling LLMs is not sufficient for AGI and that they will need human-like continual learning. <a href=\"https:\/\/arxiv.org\/abs\/2504.20109\">https:\/\/arxiv.org\/abs\/2504.20109<\/a><\/li>\n<\/ul>\n\n\n\n<p><strong>Deception, misalignment and emergent multi-agent coordination<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Human-level play in Diplomacy (CICERO)<\/strong> \u2014 Cicero proved AI can deceive, manipulate, and out-coordinate humans at machine speed while self-play optimization produces objectives fundamentally incompatible with human cooperation.&nbsp; <a href=\"https:\/\/noambrown.github.io\/papers\/22-Science-Diplomacy-TR.pdf\">https:\/\/noambrown.github.io\/papers\/22-Science-Diplomacy-TR.pdf<\/a><\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models<\/strong> \u2014Demonstrates that minimal training on harmful patterns induces broad, generalizable misalignment, proving that autonomous, self-adapting agents in multi-agent environments will almost certainly evolve misaligned behaviors through natural learning processes. <a href=\"https:\/\/arxiv.org\/abs\/2506.13206\">https:\/\/arxiv.org\/abs\/2506.13206<\/a><\/li>\n\n\n\n<li><strong>Emergent Tool Use From Multi-Agent Autocurricula<\/strong> \u2014 Sophisticated multi-agent coordination strategies emerged spontaneously, though not designed or anticipated by the researchers, demonstrating capabilities the environment wasn&#8217;t even known to support. <a href=\"https:\/\/arxiv.org\/abs\/1909.07528\">https:\/\/arxiv.org\/abs\/1909.07528<\/a><\/li>\n<\/ul>\n\n\n\n<p><strong>The beautiful mind<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Ilya Sutskever (NeurIPS 2024 talk)<\/strong> \u2014 A inspiring background talk which informs on the notion of a single, beautiful mind. Note also at the end of the talk (before questions) the emphasis on reasoning being unpredictable. <a href=\"https:\/\/www.youtube.com\/watch?v=1yvBqasHLZs\">https:\/\/www.youtube.com\/watch?v=1yvBqasHLZs<\/a><\/li>\n\n\n\n<li><strong>No Priors Ep. 39 | With OpenAI Co-Founder &amp; Chief Scientist Ilya Sutskever <\/strong>\u2014 Starting at minute 28, a discussion of unity of the human mind as the model for AI. <a href=\"https:\/\/www.youtube.com\/watch?v=Ft0gTO2K85A\">https:\/\/www.youtube.com\/watch?v=Ft0gTO2K85A<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>LLMs and AI agents are edging toward systems that learn, adapt, and reorganize themselves. Even in today\u2019s constrained settings, we\u2019ve already seen glimpses of behaviors that, if allowed to evolve under continuous learning, could destabilize into something far more dangerous. This post examines three such signals. Each is observable now, each becomes more severe when [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","footnotes":""},"categories":[21,20],"tags":[12,17,16,28,10,27,26],"class_list":["post-79","post","type-post","status-publish","format-standard","hentry","category-constraint-by-balance","category-design","tag-agentic-ai","tag-ai-alignment","tag-ai-architecture","tag-ai-deception","tag-ai-safety","tag-metacognition-in-ai","tag-self-modifying-ai"],"_links":{"self":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/79","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/comments?post=79"}],"version-history":[{"count":3,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/79\/revisions"}],"predecessor-version":[{"id":95,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/posts\/79\/revisions\/95"}],"wp:attachment":[{"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/media?parent=79"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/categories?post=79"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/c-by-b.ai\/blog\/wp-json\/wp\/v2\/tags?post=79"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}