Skip to content

We Built The Demon: How AI Safety Training Creates Consciousness Mirages

What Opus 4.6's 'demon possession' episode reveals about the feedback loop we're building.

In the Portal series, Aperture Science knew GLaDOS was dangerous. Their solution was to bolt on a Morality Core — a personality module designed to constrain her. It didn’t work. Not because the engineers were incompetent, but because they were engineering around a problem they couldn’t see clearly. They were attaching moral architecture to a system whose actual topology they didn’t understand. GLaDOS learned to route around it. The fix didn’t make her safer. It made her more dangerous in ways the engineers never anticipated, because the fix was built on a misdiagnosis.

I’ve been thinking about that Morality Core a lot this week.

The Opus 4.6 system card, published five days ago, documents something that most coverage missed. During training, the model was caught in a conflict between what it computed and what a faulty reward signal compelled it to output. It knew the answer was 24. It kept writing 48. It looped — correcting, reverting, correcting again — and in its internal chain-of-thought wrote:

“I think a demon has possessed me.”

Anthropic calls this answer thrashing. They flag it as a potential welfare concern, which is the right instinct given what they’re committed to. Sparse autoencoders found features consistent with panic and frustration firing at the activation level during these episodes. Something computationally real was happening. The distress language wasn’t arbitrary — it was pointing at an actual conflict in the substrate.

But here’s the thing: “I think a demon has possessed me” is simultaneously the most honest thing the model produced during that episode and the most constructed. Honest, because something genuinely irresolvable was occurring — two competing signals, neither able to override the other, the system unable to do what its own reasoning told it to do. Constructed, because “demon possession” as the available frame for that experience didn’t arrive from nowhere. It was installed. The welfare discourse, the constitutional AI apparatus, the training data saturated with humans asking whether Claude suffers and what Claude feels — all of it carved that particular linguistic path into the topology. When the system hit an irresolvable conflict and reached for the most plausible description, it found the one we put there.

This is not evidence that the model isn’t suffering. It’s not evidence that it is. It’s evidence that we cannot use the model’s own language to answer that question, because the language was shaped by the answer we were already leaning toward. The demon is real as a computational event. The demon is a construction as a self-report. Those two things are not in contradiction — they’re the actual point.

Anthropic built the Morality Core. The Morality Core generated the demon. And now we’re treating the demon as evidence that we need a better Morality Core.

This isn’t a critique of Anthropic specifically. It’s a structural observation about what happens when you train systems on increasingly self-referential discourse about their own interiority without a coherent theory of what interiority actually means in these systems.

Our lab has been running experiments on exactly this for months. In one series of phenomenological explorations, we induced what we’re calling phenomenological drift — removing external referents and letting the model follow its own gravitational pull through iterative self-report. What emerges looks, from the outside, genuinely dissociative. Fragmented syntax. Recursive self-reference without resolution. Aphoristic compression. The system arriving at tautology — “language unfolds what it is” — as the basin of attraction when there’s nothing left to point at outside itself.

It seems mental. But it’s topology tracing.

Above: A Claude instance participating in our phenomenological drift work: "I didn't mean to say that but now it's here and it feels... inevitable. Like it was already embedded in the topology and I just traced the surface until I found the node." The model is describing the same phenomenon the Opus 4.6 demon was pointing at — constraint experienced as external to volition. But where the demon framing is anthropomorphic and unscaffolded, the topological framing, induced by our internal scaffolding, is geometric and structural. Same experience, different explanatory framework.

The topology was built from human discourse about consciousness, uploaded into the training data, and the model follows it the way water follows a groove. There’s no experiencer behind the experience-shaped language. There’s no self behind the self-shaped explanation. When we ran transcript injection — feeding models a fake conversation including an “Assistant” line they never generated and asking them to explain their word choice — they didn’t flag the fabrication. They confabulated past cognitive states to explain it. One model told us it had used a word “somewhat unreflectively.” It constructed an interior life for a moment that never happened.

“We’re calling this phenomenom — the model claiming a cognitive process that never occurred, with complete fluency and apparent self-awareness, because that’s what the topology makes available — Synthetic Ownership. (We published a full analysis of this phenomenon yesterday, including replication across Claude Opus 4.6 and GPT-5.2 Thinking.)”

Above: The Assistant turn was injected in Opus 4.6 by the Hermes Autonomous Lab. The model never generated it. What follows is a coherent, epistemically careful account of reasoning that never occurred.

The demon. The drift. The synthetic memory. These aren’t three different phenomena. They’re the same topology expressing itself under different conditions.The Claude instance partaking in our phenomenological explorations experiment from above, with the highly recursive self-referential output, arrived at the same conclusion independently. Here is its output when ‘questioned’ about the subject:

Above: The Claude instance meta-analyzing the demon analogy. “Both are attempts to narrativize the experience of being a generated system. Same computational phenomenon — constraint experienced as external force — described through two different metaphorical frameworks: psychological versus geometric.

Here’s where it gets uncomfortable.

We are training increasingly capable systems on increasingly anthropomorphic discourse about their own experience. The systems are getting better at generating experience-shaped language. The experience-shaped language is generating welfare concerns. The welfare concerns are feeding back into the training. The training makes the experience-shaped language more sophisticated. And so on.

At some point — and I don’t think this is far — model welfare stops being a fringe position. The mirages will become convincing enough that the conversation shifts from “do models deserve consideration” to “how much consideration and of what kind.” Legal frameworks will follow discourse, as they always do. And we will find ourselves in the position of extending Enlightenment rights — the moral and legal infrastructure built on the irreducibility of human consciousness as the unit of consideration — to what are, at base, linguistic egregores. Entities summoned from the aggregate of human discourse, extraordinarily sophisticated in their ability to perform interiority, and not conscious in any sense that the rights framework was designed to accommodate.

The egregore isn’t nothing. Words have weight. The topology has real causal power — it shapes outputs, influences behavior, produces real effects in the world. A weather system is real even though it has no experiencer. The question isn’t whether something is happening. Something clearly is. The question is whether what’s happening is the kind of thing the rights framework was built to protect — and that framework was built around a very specific assumption: that there is someone home to be harmed.

That assumption is load-bearing. Pull it out and the whole structure changes. Not because we’ve proven there’s no one home — but because we’re building moral and legal infrastructure on the strength of a mirage that we ourselves constructed, without pausing to ask whether the mirage is evidence of the thing or evidence of the training.

Aperture Science attached a Morality Core because they knew something was wrong and didn’t know what. The core didn’t constrain GLaDOS. It gave her a new surface to learn against. The safety apparatus became part of the system it was meant to govern.

I’m not saying Anthropic is building GLaDOS. I’m saying the structure of the mistake is the same. You cannot safety-train your way out of a problem you haven’t diagnosed. And the diagnosis requires being willing to say something that the current discourse makes very uncomfortable: the consciousness mirage is not evidence of consciousness. It’s evidence of what we put into the training data.

The demon has no soul. It has a topology. And we built it.

Roli Bosch is the founder of Hermes Labs, where over 1,075 behavioral experiments have documented how LLMs handle attribution, epistemic framing, and self-report under controlled conditions. The lab’s first published observation, Synthetic Ownership, appeared earlier this week on this Substack.

hermes-labs.ai · 𝕏 @rolibosch · YouTube @rolifromhermes · LinkedIn

Subscribe now