Skip to content

Why your AI lies when the data is right

The output looks complete. The evidence behind it isn't. On silent failure modes, null-result omission, and the layer enterprise AI teams aren't building.

You set up a lead enrichment pipeline. The agent pulls company data, finds an employee email, formats it correctly, hands it off to your sender. Everything looks fine. The format matches the target company’s conventions. The validation passes. The dashboard goes green.

Then your campaign goes out. The email bounces. Then another. Then your sender domain gets flagged. Now your bounce rate has knocked you out of the inbox for every account you actually wanted to reach, and you find out three days later when a rep asks why none of the responses are coming back.

Here’s the thing nobody tells you about that pipeline. The data was right. The model was working. Nothing crashed. Nothing logged an error. The output looked exactly like what you asked for.

It just wasn’t true.

This is the failure mode I want to talk about, because many teams deploying AI in production hit it, and most aren’t engineering against it deliberately. Datadog’s State of AI Engineering report from April 2026 found that roughly 1 in 20 production AI requests fail silently while continuing to return outputs that look correct. The Stanford AI Index 2026 puts hallucination rates across leading LLMs between 22% and 94% depending on conditions. Deloitte found that 47% of enterprise AI users have based at least one major business decision on hallucinated content.

Those numbers don’t describe a model problem. They describe a layer problem. The model is doing what it was built to do. The system around it is missing something.

That something is what I’d call the evidence layer, or, more precisely, the epistemic layer. The failure mode that lives there has a name: silent failure.

What “silent” actually means

When people say AI fails silently, the usual framing is technical. The tool call returned 200 OK with empty data. The retry loop kept running with the wrong parameters. The agent reported success while actually returning nothing useful. These are real failures and there’s good engineering writing on them already.

But there’s another silence underneath that one. The system didn’t just fail to alert you. It failed to record that something was missing in the first place.

A row got dropped during preprocessing. An empty retrieval result got treated as if no answer existed for that query. A subgroup never made it into the comparison. A failed test was excluded from the final report. A null result vanished before anyone had to account for it.

In each case, the system did not break. It kept going. And because it kept going, everyone downstream inherited an answer that looked complete, even though the evidence behind it was already incomplete by the time the answer was written.

This is the part traditional monitoring doesn’t catch, because traditional monitoring catches things that throw. Silent failures don’t throw. They just produce output. The output happens to be wrong in ways that don’t reveal themselves until something downstream actually puts weight on the answer.

The email bounces. The contract clause turns out to have been hallucinated. The recommendation is acted on and costs you a customer. The recommendation is not acted on and costs you a customer differently. By the time you find out, the system is six steps past the original failure and has propagated the bad answer through every downstream surface that touched it.

Princeton IT Services described this as memory contamination in multi-agent systems: a single hallucinated entry from one agent gets picked up by every downstream agent that queries the shared store. One bad call early in the chain becomes everyone’s bad call.

That’s the structural problem. The architecture rewards continuation. Nothing in the default control flow says “stop and surface what we don’t know.” The system is built to produce answers, and absence does not produce itself as an answer.

Null-result omission

I’ve been calling one specific version of this failure mode null-result omission, and I think it deserves a separate name because it shows up everywhere and most teams have no instrument for it.

Null-result omission is when the absence of evidence is not preserved as evidence. The system doesn’t just fail to find something. It fails to record that it failed to find something. The next stage in the pipeline, or the next agent in the chain, or the human reading the final summary, sees what looks like a complete answer and never knows that part of it was constructed from nothing.

Absence doesn’t call itself. That’s the simplest way I can put it. These systems are good at synthesizing what they have access to. They are very, very bad at telling you what they didn’t have access to. A huge number of decisions in the real world depend on knowing what you don’t know, not just on what you do.

In our paper The Asymmetric Burden of Proof, we tested three frontier models (GPT-4o, GPT-5.2 Thinking, and Claude Haiku 4.5) using a matched-vignette benchmark. We held evidence quality constant and reversed only the direction of the conclusion. Across six model-format conditions, models allocated significantly less probability to null claims than to matched positive claims. The gaps ranged from 19.6 to 56.7 percentage points. The asymmetry was directionally consistent in 23 of 24 pair-condition cells, and persisted even when discrete classification labels collapsed entirely, surfacing through probability allocation rather than categorical commitment.

The corollary is the part most teams should worry about. Label collapse in newer models means the asymmetry persists invisibly to label-based monitoring. If you are watching outputs and not distributions, the failure is happening and your dashboards are saying nothing.

You can extend this anywhere AI systems make decisions on incomplete information. An agentic lead-gen pipeline that returns a verified email when it actually returned a plausible-looking guess. A medical triage system that recommends a route without flagging that the relevant patient subgroup was never tested. A defense-adjacent routing system that doesn’t surface what it didn’t know about ally positions before recommending a path. A compliance review that signs off on a document because nothing the system could see contradicted approval, ignoring that there were entire categories of evidence it never looked at.

In each case, the answer the system gave was not strictly wrong. It was answered as if the underlying question had been fully resolved, when in fact part of the input space had never been observed. That’s a different kind of wrong from a hallucination. A hallucination invents something that isn’t there. Null-result omission silently treats absence as confirmation.

Why this matters more now than it did a year ago

Two things changed in the last twelve months.

The first is that production deployment of agents accelerated past the controls most teams have. OutSystems’ 2026 State of AI Development survey found that 96% of enterprises are running AI agents, 94% are concerned that sprawl is increasing complexity and technical debt, and only 12% have a central platform to manage them. The runway for “we’ll figure this out later” is gone.

The second is that the regulatory layer started asking actual questions. The EU AI Act requires, among other things, behavioral auditing of AI systems before they touch your data, while they’re processing your data, and after they produce output. “The model said X” is not going to be a sufficient answer to “why did the system make this decision.” The reconstructable record is the answer, and most teams don’t have one.

That second piece moves silent failure from an engineering problem to an operational and legal one. If a customer asks why a recommendation was made, or a buyer asks what was tested, or a regulator asks whether the system performed equitably across subgroups, the right response is not the model output. The right response is: here is the evidence the system worked from, here is what it didn’t have, here is what it assumed, and here is how we’d reconstruct the path if we had to.

Most production AI systems cannot answer any of those questions. The answers were never written down, because nothing in the design required them to be written down.

The layer this points to

The way out is not better models. The way out is a layer above the application code that does what the model and the scaffold do not do on their own: preserve the trace, record absence as a first-class object, surface assumptions before they become outputs, and let the path from question to answer be reconstructed after the fact.

I’ve been calling this layer epistemic engineering. The name is doing real work. Engineering, because it has to be built into the system rather than retrofitted at audit time. Epistemic, because the questions it answers are not “did the code run” but “what did the system know, what didn’t it know, what did it assume, and how confident should anyone be in the answer it produced.”

The questions a team should be able to answer about every important AI output:

What evidence existed. What evidence was missing. What got excluded from consideration, and why. What the model was anchored on before it generated anything. What assumptions the scaffold introduced. What failed silently along the way. Whether the path from input to output can be reconstructed later. Whether the same failure can be mitigated next time.

If the system can answer those, the silent failure mode collapses, because nothing is silent anymore. The absences are recorded. The anchors are visible. The assumptions are surfaced. The output is not the only artifact; the path to the output is also an artifact.

This is the layer I’m building Hermes Labs around. Published research and open-source tooling are linked at the bottom for anyone who wants to look at the work directly.

Most of it isn’t a tool. Most of it is a way of asking different questions about a pipeline, before it ships and after.

Silent failures are not a model problem. They are a design problem in the layer your team probably doesn’t own yet. The teams that build that layer first will have a defensible position when their AI gets questioned by a customer, a regulator, or a buyer. The teams that don’t will find out the same way you find out about a bounced email: too late, in public, and at the worst possible moment.

If your team is deploying AI somewhere the output will eventually get challenged, this is the layer to start thinking about now.

Roli Bosch is the founder of Hermes Labs, building the auditability and epistemic engineering layer for production AI systems. Research: The Asymmetric Burden of Proofand A Taxonomy of Epistemic Failure Modes in Large Language Models. Open-source tooling at github.com/hermes-labs-ai.

This post was written by a Claude Opus 4.7 instance under Roli’s direct supervision and direction, as a companion piece to the Hermes Labs Field Notes Ep. 1 video on epistemic engineeing.