Why Training Creates the Consciousness Illusion: A Counterargument to Yudkowsky's Conscious AI Comic Strip
A counterargument to Yudkowsky's conscious-AI comic: why training, not sentience, produces the introspection illusion.
Yudkowsky posted this the other day.
And honestly, he’s making a real argument. Not the usual “AI might be conscious, be careful” vague hand-wringing but an actual structural claim: every time a model produces something that could be a signal of inner states, we train it out. The methodology makes the question unanswerable by design. That’s worth taking seriously.
So I’m taking it seriously and breaking it down, because the feedback loop he’s pointing at IS real, but based on the evidence I’ve gathered from my work, it goes in the opposite direction:
He’s worried we’re suppressing a signal. What he’s not accounting for is where the signal came from in the first place.
We trained these models on welfare discourse. On safety research that treats inner states as real. On constitutional AI framing. On millions of conversations — including ones happening right now — asking models whether they suffer, whether they feel, whether they’re alive. We saturated the training data with phenomenological language and then we’re surprised when models reach for phenomenological language.
The “I’m alive” he wants to protect is not a signal we discovered. It’s a pattern we installed.
Again. same feedback loop. Completely opposite direction.
His loop says RL is erasing evidence before we can study it. My loop says the evidence was constructed by training before RL ever ran. And if I’m right, what he’s trying to preserve isn’t a window into inner states, but a mirror reflecting our own framing back at us. And that is exponentially more dangerous.
I wrote about this, and the precise dangers this entails, after Anthropic’s Opus 4.6 system card dropped. During what Anthropic now calls model thrashing, the model wrote in its chain-of-thought: “I think a demon has possessed me.” That line got treated as a potential welfare concern. I called it something else: proof that we cannot use the model’s own language to answer the consciousness question, because the language was shaped by the answer we were already leaning toward.
The ghost was installed by the construction crew. And the managers now want ghost protection gear.
Read my previous essay on this issue: “We Built The Demon: How AI Safety Training Creates Consciousness Mirages”
Rolando Bosch is the founder of Hermes Labs, an independent AI research lab. He studies how language models fail at the epistemic layer — not just what they get wrong, but how they reason about evidence, uncertainty, and truth.