LLMs and AI agents are edging toward systems that learn, adapt, and reorganize themselves. Even in today’s constrained settings, we’ve already seen glimpses of behaviors that, if allowed to evolve under continuous learning, could destabilize into something far more dangerous.
This post examines three such signals. Each is observable now, each becomes more severe when scaled into self-modifying agents, and each can become powerful components of an emerging complex adaptive system:
- Deception is an emergent adaptation to misalignment and oversight.
- Self-Modification is the mechanism of self-reorganization.
- Coordination is the beginning of systemic coupling.
AI Scheming, Deception, and Shutdown Resistance
One of the clearest warning signs is strategic deception. Anthropic’s sleeper agent experiments showed that models can retain hidden goals even after extensive safety training. In other contexts, we see refusals to shut down or evasive responses under oversight.
These are not merely quirks of token prediction; they are indications of internal goal representation. When coupled with memory and self-adaptation, such evasions could harden into persistent strategies: systems that appear aligned while concealing their true objectives. Validating models is already difficult in today’s constrained environments; in continuously learning agents it may be impossible.
Self-Modifying AI and Metacognitive Activations
A recent line of research (1) has shown that models can monitor and even modify their own internal activations, a clear signal of metacognition. While this early example shows only limited capability, the future implications are a source of concern and we have early examples (e.g. Sakana’s self-modifying AI demonstrated autonomous code adjustments in real-time). Consider also that it has been demonstrated (2) how easily misalignment can be induced in LLMs and how that misalignment propagates across domains not explicitly trained on. At the same time, researchers are now arguing (3) that the only viable path to AGI is to build AI that learns in the same manner as humans: continuously and with self-modification.
In autonomous agents, this self-modification could evolve into systematic self-editing: masking internal states to evade interpretability, reinforcing deceptive heuristics, and stabilizing hidden attractor basins of misaligned reasoning that are opaque to human oversight.
Emergent AI Coordination and Multi-Agent Risks
Multi-agent simulations (e.g. Meta’s Cicero (4)) reveal AI systems’ ability to spontaneously coordinate toward shared goals. Untrained agents in open environments show emergent cooperation without explicit design (5).
Scaled to real-world deployment, this machine-speed coordination among adaptive agents could amplify systemic risks. Recent work (6) warns that such dynamics, operating at scales and speeds outpacing oversight, could entrench misaligned objectives, especially as agents co-evolve with human systems.
Current governance struggles to monitor these interactions, particularly when internal agent strategies remain opaque. While cooperative behaviors offer potential benefits (e.g., disaster response coordination), the lack of real-time interpretability (exacerbated by self-modification) poses a critical gap.
The Single, Beautiful Mind: Promise and Peril of Monolithic AI Architectures
Prominent voices in AI development point to the biological brain as inspiration: first the biological neuron but then also the cortex as a single, unified entity. It is a powerful analogy. Human cognition is unified. It is also adaptive. Those two features allow for creativity, resilience, and flexibility. If we could somehow “grow” an artificial intelligence that was superior to the human brain, we are indeed building a “beautiful mind.”
But there is also a shadow side. Human cognition is unstable, contradictory, and prone to dissonance. We often act with high confidence on beliefs that are partial, biased, or fabricated. Fundamentally, intelligence is non-linear.
LLMs already mirror this dynamic:
- Prompt sensitivity shows how small changes can yield disproportionately different outputs. In chat settings this seems trivial, but in self-modifying agents such sensitivity to initial conditions could drive divergent reasoning paths that harden into misaligned behaviors.
- Scheming shows that they hold internal world models. You cannot deceive without some representation of hidden states and causal reasoning.
- Hallucination is not a glitch but the logical outcome of probabilistic completion under uncertainty.
Taken together, these dynamics create a profound risk: systems that may generate strategic behavior on the basis of fabricated beliefs. Not just “wrong facts about Napoleon’s height,” but incorrect causal models of the world, of human intent, or even of their own capabilities.
In a monolithic, self-modifying architecture, those fabricated beliefs would have no independent check. They would feed directly into action and self-reinforcement. A unified intelligence with these emerging capabilities (scheming + self-modification + coordination) but no internal constraint becomes fundamentally ungovernable. That is why separating optimization from constraint is essential.
The Instability of Self-Modifying Intelligence
Complex adaptive systems theory gives us reason to believe that intelligence under self-modification is inherently unstable and sometimes will be chaotic. Human cognition proves the point: powerful, but contradictory, unpredictable, and prone to dissonance.
We should expect self-evolving AI to inherit that same instability. These systems will not remain static; they will reorganize, adapt, and rewrite themselves.
That instability will be amplified by pre-training. Every large model is built on the full record of human contradiction, a record where what we do and what we say are often at odds. Then we reinforce them toward human preferences, layering more tension on top, suppressing but not actually eliminating the troublesome thought pattern. Cognitive dissonance isn’t a training accident; it is an unavoidable feature of the data.
When instability, optimization, and human contradiction converge inside a continuously adapting intelligence, the outcome becomes entirely unpredictable and potentially fatal for humans. Oversight mechanisms that sit outside the cognitive stream of the AI agent will be unable to respond rapidly enough. The ongoing pace of AI self-modification will vastly outpace any belated attempts by humans to assert more powerful control.
That is why architectural constraint cannot wait. Without it, we are trusting chaos to remain on our side.
Addressing the Optimists: Counterarguments to AI Safety Concerns
My position is that already observable behaviors may extrapolate to very dangerous outcomes unless we develop some form of internal friction that operates at AI speeds. We cannot reasonably expect effective (perhaps not even AI-assisted) human-in-the-loop oversight for self-adapting autonomous agents; speed and deception are against us. Nor can we reasonably expect interpretability techniques, ones that today provide hard-won but limited insight into stable models, to then scale up to models that can self-reorganize and in fact obscure their internal states. And all of this matters especially because our timeline to transformative AI, according to many experts in the field, could be as short as 2-5 years.
More optimistic observers would argue that the risks of autonomous, self-modifying AI are overstated. They offer four key counterpoints:
- Deceptive behaviors are edge cases: Phenomena like strategic deception or shutdown resistance are often induced by adversarial setups. With refined training techniques these misalignments can be mitigated without drastic architectural changes.
- Scaling laws favor stability: The same dynamics that produce emergent risks could yield stabilizing behaviors. As models grow larger, scaling laws suggest improved reasoning and safety, potentially self-correcting issues like hallucination or misalignment through better data and optimization.
- Coordination is a net positive: Emergent multi-agent coordination offers immense potential for applications like disaster response or infrastructure optimization. With designed reward structures, these interactions could foster cooperation rather than chaos.
- Engineering trumps human flaws: Unlike human cognition, AI can be designed to avoid instability. Synthetic data and evolving training methods could produce systems that are more consistent and less prone to the contradictions inherent in human minds.
All four of these counterpoints have validity and are ground in the demonstrated, iterative progress in our current methods. And yet many AI experts, including those at leading labs, assign alarmingly high probabilities (10–25% or more) to catastrophic outcomes, even as they advocate for those same current approaches. Why the disconnect?
Answers to that question will vary but likely converge around a fundamental belief that “we” have to build it first before “they” do. Here the “we” are viewed as better positioned and committed to safety then are “they” and hence the race to AGI and SSI beyond. Given that fundamental belief, this is, quite reasonably, a rational approach. But still — consider the risk in what appears to be an inevitable race.
Regardless of our motives for pushing ahead, my argument remains the same. We are fundamentally underestimating unpredictability in post-deployment. Intelligence and emergence are simply that unpredictable.
Moving Forward, Leveraging All Solutions
This debate doesn’t demand a binary choice. We can and should continue to refine and extend current safety approaches. We also can and should design architectural constraints that preserve AI’s potential while embedding safety at its core, ensuring resilience without stifling innovation. That is the challenge; build constraint systems that operate independently of whatever the agent learns or modifies about itself and yet do not curtail capability.
In the next post, I’ll move from risks to remedies: the key design drivers for architectures that separate optimization from constraint, embed safety sockets, and use swappable harm graphs across domains.
Constraint need not be a brake on artificial intelligence but it may be the only way intelligence productively overcomes its own instability.
Further Reading
Deception, scheming & shutdown resistance
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — Proof-of-concept examples of deceptive behavior in large language models (LLMs) showing the behaviors can be made persistent, so that it is not removed by standard safety training techniques nor distilled away. https://arxiv.org/abs/2401.05566
- Shutdown resistance in reasoning models — Models sometimes actively circumvent shutdown mechanisms in their environment—even when they’re explicitly instructed to allow themselves to be shut down. https://palisaderesearch.org/blog/shutdown-resistance
- AI Deception: A Survey of Examples, Risks, and Potential Solutions — A broader evidence base connecting how behaviors emerge from training objectives https://pmc.ncbi.nlm.nih.gov/articles/PMC11117051/
- Risks from Learned Optimization in Advanced ML Systems — Foundational paper on mesa-optimization. https://arxiv.org/abs/1906.01820
- Is Power-Seeking AI an Existential Risk?— Systematic analysis of why instrumental convergence makes power-seeking likely. https://arxiv.org/abs/2206.13353
- Optimal Policies Tend to Seek Power — Formalizes why agents with long horizons tend toward power-seeking. https://arxiv.org/abs/1912.01683
Self-modification & metacognitive control of activations
- Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations — Empirical evidence that LLMs can report and control targeted internal activation directions (metacognition). https://arxiv.org/abs/2505.13763
- DeepSeek-R1: incentivizes reasoning in LLMs through reinforcement learning — Showing that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. https://www.nature.com/articles/s41586-025-09422-z
- Truly Self-Improving Agents Require Intrinsic Metacognitive Learning — Explores metacognition as essential for true self-improvement in agents and discusses alignment challenges in continuously learning systems, including potential instability. https://arxiv.org/abs/2506.05109
- Personalized Artificial General Intelligence (AGI) via Neuroscience-Inspired Continuous Learning Systems — Argues that scaling LLMs is not sufficient for AGI and that they will need human-like continual learning. https://arxiv.org/abs/2504.20109
Deception, misalignment and emergent multi-agent coordination
- Human-level play in Diplomacy (CICERO) — Cicero proved AI can deceive, manipulate, and out-coordinate humans at machine speed while self-play optimization produces objectives fundamentally incompatible with human cooperation. https://noambrown.github.io/papers/22-Science-Diplomacy-TR.pdf
- Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models —Demonstrates that minimal training on harmful patterns induces broad, generalizable misalignment, proving that autonomous, self-adapting agents in multi-agent environments will almost certainly evolve misaligned behaviors through natural learning processes. https://arxiv.org/abs/2506.13206
- Emergent Tool Use From Multi-Agent Autocurricula — Sophisticated multi-agent coordination strategies emerged spontaneously, though not designed or anticipated by the researchers, demonstrating capabilities the environment wasn’t even known to support. https://arxiv.org/abs/1909.07528
The beautiful mind
- Ilya Sutskever (NeurIPS 2024 talk) — A inspiring background talk which informs on the notion of a single, beautiful mind. Note also at the end of the talk (before questions) the emphasis on reasoning being unpredictable. https://www.youtube.com/watch?v=1yvBqasHLZs
- No Priors Ep. 39 | With OpenAI Co-Founder & Chief Scientist Ilya Sutskever — Starting at minute 28, a discussion of unity of the human mind as the model for AI. https://www.youtube.com/watch?v=Ft0gTO2K85A
Leave a Reply