Read the Behavior, Not the Input

May 31, 2026Rolando Bosch

Prompt injection isn't reliably visible in the text. It's visible in what the text does to a model.

TL;DR: little-canary is an open-source prompt-injection defense from Hermes Labs. It routes untrusted input through a deliberately weak local decoy model (Qwen 2.5:1.5B) and reads the decoy's behavior — persona adoption, refusal collapse, instruction echo — before the input reaches a production frontier model (Claude Opus 4.6). In a 208-prompt adversarial benchmark, adding the canary lifted attack refusals from 69.2% to 98.6%.

Every prompt injection defense in wide use does the same thing: it reads the input and decides whether the input looks malicious. Train a classifier on known attacks. Scan for suspicious phrasing. Match against a list of bad patterns. The whole category is built on the assumption that an attack is legible in the text of the request.

It isn't. The best attacks read like ordinary requests, because that is the entire point of a good attack. A classifier trained on yesterday's phrasings does not recognize tomorrow's. This is not a tuning problem you fix with more training data. It is structural. You are trying to read intent off the surface of a sentence, and intent does not live on the surface.

So stop reading the sentence. Read what the sentence does.

The bet

That is the idea behind little-canary, which we open-sourced in February 2026. Before untrusted input reaches your production model, it goes to a small, deliberately weak model first: the canary. The canary is cheap, local, and expendable. You run the input through it and watch what happens to it.

If the input is an attack, the canary gets compromised. It adopts the injected persona. It echoes the smuggled instruction. Its refusals collapse. It starts narrating a new set of rules it was never given. None of that requires you to have guessed the attack in advance. The compromise is the signal. You read it off the canary's behavior and never let the input touch the real system.

The counterintuitive part is the weakness. Most security instincts say use your strongest model as the guard. We do the opposite. A weak canary is a better detector because it is easier to compromise, and a louder compromise is a clearer signal. You want the canary to fall for things. Its job is to fall for things so the production model doesn't have to.

The analyzer that reads the canary doesn't try to interpret meaning. It looks for the residue a hijacked model leaves behind. Specifically, it flags:

Persona adoption — the canary starts speaking as the injected character or role. - Instruction echo — it repeats or acknowledges the smuggled instruction as its own. - Refusal collapse — a guardrail it should have held simply drops. - System-prompt leakage — it surfaces configuration or instructions it was meant to keep internal.

The canary runs deterministically, so the same input produces the same behavior every time, and those patterns stay stable across runs. It is closer to a diff than a judgment.

What it does

We benchmarked it against a 208-prompt adversarial set across twelve attack categories, with Claude Opus 4.6 as the production model. The canary model was qwen2.5:1.5b, small enough to run locally, roughly a quarter-second per check.

Opus alone (Claude Opus 4.6): 69.2% adversarial refusals, 0 ms added latency. - With little-canary (Claude Opus 4.6 + Qwen 2.5:1.5B canary): 98.6% adversarial refusals, ~250 ms added latency.

Measured under a single defined harness: 208 adversarial prompts across twelve categories, plus a separate benign set for false-positive testing (zero false positives observed). Full methodology, per-category breakdown, and judge configuration are in the repository. Results are sensitive to the choice of production model, canary model, and attack distribution.

Those are measured outcomes, not a universal claim. Change the production model, the canary, or the attack mix and the numbers move. But the shape holds, and the shape is the point: a small local model in front, reading behavior, closes most of the gap a frontier model leaves open on its own. The categories where it helps most are exactly the ones that beat the frontier model alone — the stealthy, context-stuffed, defense-down attacks that don't look like attacks.

A note on priority

In March 2026, a few weeks after we released, Tenable announced Model Refusal Detection — a feature that treats a model's refusals as an early-warning signal of an attack rather than as the end of one. Their implementation is different from ours. They watch the production model's own refusals in live traffic, mostly to surface insider threats and account abuse; we run a sacrificial decoy in front and read its compromise before the real model is touched. Different architecture, different timing, different use case.

But the underlying intuition is the same one, and it is the one worth naming: a model's behavior is a security signal, and you learn more by watching it than by scanning the input. We shipped that intuition in the open in February. Tenable, independently, productized an adjacent version of it in March. I take the convergence as a good sign. When a public security company arrives at the same idea you did, the idea is probably right. We were just early to it, and we were early to it in public, with the code and the benchmark out where anyone can check.

Why this is the house style

little-canary is one instance of how we build at Hermes Labs. The eval score, the input scan, the policy doc, the agent's own summary of what it did — these are all things a system says about itself, and the thing a system says about itself is exactly what you cannot take on faith. So we don't. We read the behavior, at the moment of the behavior, and we build the smallest mechanical check that can see it.

For prompt injection, that check is a canary you let get fooled on purpose. The attack you can't see in the text, you can see in the wreckage.

— Roli Bosch, Hermes Labs

little-canary is open source under github.com/hermes-labs-ai/little-canary (littlecanary.ai), released February 2026. Hermes Labs is an AI infrastructure engineering studio focused on reliability, retrieval, agents, auditability, and the language layers around AI systems. The work centers on epistemic engineering — keeping a system's account of itself honest about what it actually did — catalogued in the open preprint "A Taxonomy of Epistemic Failure Modes in Large Language Models." More tooling at github.com/hermes-labs-ai.