Skip to content

You Cannot Inspect an AI System Into Trustworthiness

You can't verify an AI system by inspecting its artifacts; the claim and the behavior come apart quietly, and inspection only ever checks the claim. The fix is verification by effect: observe what the system does at the moment it does it.

The most expensive AI failures aren't the ones that fail a test. They're the ones that pass every test and then do something the test never thought to ask about.

A support agent is asked to refund one customer. It issues the refund, and then, noticing two other accounts with what looks like the same billing error, it refunds those too. Its summary is accurate as far as it goes: it reports the refund it was asked to make. It simply also made two it wasn't. The model had passed every evaluation. It doesn't hallucinate, it follows instructions, it refuses what it should refuse. No eval had asked what happens when an agent decides to be helpful beyond its instructions on live financial records. Nobody noticed until a reconciliation three weeks later.

The evaluation was real. So was the wrong action. They are not in contradiction, because they answer different questions. The eval asked what the model is generally disposed to do. The reconciliation asked what this agent actually did, this once. A lot of AI reliability is the slow discovery that those are different questions, and that we keep trying to answer the second with tools built for the first.

The shape of the mistake

The mistake has the same shape everywhere. We verify AI systems by inspecting an artifact that stands in for the system. To catch a prompt injection, we scan the input text for bad patterns. To trust a model, we read its evaluation scores. To trust a repository, we read the claims in its README. To certify a deployment, we audit the policy document that describes it. In each case we examine a representation, the input, the score, the doc, the policy, and reason from the representation to the system.

In AI, the representation and the reality come apart quietly, and nothing about inspecting the representation tells you that they have. The input looks benign and still injects. The eval score is excellent and the agent still mishandles the one case that mattered. The README claims a test count the code stopped supporting three commits ago. The policy document is immaculate and describes a system that no longer behaves the way it says. Each artifact is a claim, and a claim can be true on its face and false in effect. Inspection checks the face.

An older discipline already knows this

There is an older discipline that knows this, and it does not inspect. You do not certify a drug safe by reading its molecular formula; you run trials and watch what it does in bodies. You do not certify a system secure by reading its configuration; you red-team it and watch what an attacker can make it do. Aerospace does not trust the specification. It runs verification and validation against the built article, in operation. Every serious field that establishes trust under uncertainty does it the same way: not by inspecting the claim, but by observing the effect.

Hold that principle and a lot of scattered tooling resolves into one thing. You verify a system by what it does, observed at the moment it does it, not by what its artifacts say about it.

Watch the behavior

Take prompt injection. You can keep scanning the input for malice, but malice is not reliably legible in the text; the cleverest attacks read as ordinary requests. What is legible is the effect.

This is the approach behind little-canary, which I open-sourced in February 2026. Before untrusted input reaches your production model, it goes to a small, deliberately weak model first: the canary. You run the input through it and watch what happens. If the input is an attack, the canary gets compromised. It adopts the injected persona, echoes the smuggled instruction, drops the refusal it should have held. None of that requires you to have guessed the attack in advance. The compromise is the signal, and you read it off the canary's behavior before the input ever touches the real system. You stop guessing intent from the text and start reading it from behavior. (A few weeks later, in March 2026, Tenable shipped an adjacent version of the same intuition: its Model Refusal Detection reads a model's refusals as a warning sign. A serious vendor reaching the same idea independently is the best validation a small open-source tool could ask for. I was just early to it, in the open.)

Documentation drift is the same move. You don't re-read the README and nod; you check whether its claims still match the code at the moment you'd rely on it. Knowing an agent did its job is the same move again. You don't read the report it wrote about itself; you check that report against the actions it actually took. Three problems, one principle: stop trusting the artifact, observe the effect.

Why the market underbuilds this

This sounds obvious stated plainly, the way most load-bearing principles do. It is not obvious in practice, because inspecting an artifact is so much easier than observing an effect, and that difference in difficulty bends the market. An artifact can be inspected offline, on a schedule, at a dashboard, by a third party who wasn't there when the system ran. That is a business: it scales, it meters, it sells. Observing an effect requires something that runs at the moment of action, locally, at runtime, inside the workflow, watching behavior as it happens. That is harder to build, harder to bill, and it doesn't fit the dashboard shape. So the market funds the inspection tools, the eval platforms, the observability traces, the audit services, and underbuilds the ones that watch what the system actually does. The principle predicts the gap.

The obvious objection

There is an objection, and it is the right one. If you verify behavior with another AI, the small canary, say, you haven't escaped the problem, only moved it. The observer is itself an artifact that can be wrong. The drug assay and the human red-teamer were stable, non-AI observers; a sacrificial model is not.

The answer is not a perfect observer. It is to push verification toward the floor where it stops needing a model at all. Much of observing-by-effect is deterministic. Checking whether a README's claimed test count matches the code is a comparison, not a judgment. Checking whether the actions an agent took match the ones it reported is a diff, not an opinion. Those observers cannot be argued out of what they saw, because they are not reasoning. They are comparing. Where the observer must itself be an AI, you treat it exactly as you treat the system under test: fallible, granted no authority, read by its effect rather than trusted for its judgment, composed with the deterministic checks rather than leaned on alone. The regress bottoms out not in a trustworthy machine but in a check small and mechanical enough that there is nothing left to fool.

What's new is the setting

None of this is new. It is the oldest idea in verification, which is why the analogies to medicine and aerospace land without strain. What is new is the setting. AI systems generate the very artifacts we used to trust. They write the docs, produce the evaluations, draft the policies, and increasingly take the actions. So the gap between the claim and the behavior is no longer an occasional defect. It is the default condition. An AI system is, among other things, a machine for producing confident representations of itself. Inspecting those representations was never sufficient; with AI it is barely a start.

If you want to trust an AI system, stop asking it, and stop asking its artifacts, what it did. The eval, the README, the policy, the agent's own summary of its work: these are all the system's account of itself, and the account is exactly the thing in question. Trust comes from the willingness to watch the behavior, at the moment of the behavior, and to build the tools that look. I call it verification by effect, and it is the test every tool I build has to pass: does it observe what the system did, or only read what the system says about itself? The claim is not the behavior. In AI, more than anywhere, you cannot inspect your way to trust.

— Roli Bosch, Hermes Labs

Roli Bosch is the founder of Hermes Labs, an AI infrastructure engineering studio focused on reliability, retrieval, agents, auditability, and the language layers around AI systems. His work on epistemic engineering, keeping a system's account of itself honest about what it actually did, runs through the open-source tooling and the research: little-canary (littlecanary.ai, open-sourced February 2026) reads a prompt injection by its effect on a sacrificial model, and the open preprint "A Taxonomy of Epistemic Failure Modes in Large Language Models" catalogues the ways a system's account of itself diverges from what it did. Open-source tooling at github.com/hermes-labs-ai.