Skip to content

Your AI isn't forgetting its instructions. Your framework deleted them.

Two upstream bugs I fixed in Semantic Kernel and LangChain, and what they reveal about silent failure in agent scaffolding.

Silent failures in production AI don't only happen in the model. They happen in the framework around it. Language doesn't reach the model raw: it passes through scaffolding that prompts it, truncates it, summarizes it, binds it to tools, reshapes it for the API. The framework is part of the runtime, and it can fail silently in exactly the ways we usually blame the model for.

Two upstream fixes I just shipped, in Microsoft's Semantic Kernel and LangChain, are instances of the same recurring shape. In each case the framework had already learned the right behavior, written correctly in one code path, and never propagated it to its sibling. The agent looks like it is forgetting its instructions. It isn't. The framework deleted them.

Semantic Kernel: the dropped system prompt

In Microsoft's Semantic Kernel, `ChatHistoryTruncationReducer` is the component that trims your conversation when it grows past the model's context limit. It did its job by calling `extract_range()`, a helper that, by design, filters out system and developer messages. That helper was written for summarization, where dropping the system message is fine. Truncation borrowed it anyway. So once a real conversation crossed the token limit, the reducer quietly deleted the system prompt and kept going. The agent went on answering, without any of its instructions. No error, no warning, just an assistant that has silently forgotten who it is. This is instruction relaxation, normally thought of as a model behavior where system or developer instructions get softened, ignored, or dropped. Here the framework does it before the model ever sees the prompt.

The fix (PR #13610, mine, merged) preserves the system message through truncation. But the part worth noticing is how it was fixed: by porting the equivalent fix the .NET SDK already had (PR #10344). The right behavior was not unknown. It existed, written correctly, in the same product, in the other language binding. It had simply never reached the Python reducer. And the Python summarization reducer, as the PR notes, still has the same bug today.

LangChain: the rejected tool call

In LangChain, if you enable Anthropic's thinking mode and then build an agent that forces tool use (`create_agent(model, response_format=Schema)` with a thinking-enabled `ChatAnthropic`), the agent sends `tool_choice="any"` to the Anthropic API, which rejects it with a 400: "Thinking may not be enabled when tool_choice forces tool use." The call never runs. To the developer it reads as a confusing API error, not a framework bug.

The fix (PR #35544, also mine, merged) makes `bind_tools()` detect that combination and drop the forced tool_choice with a warning. And again, the interesting part: LangChain already did this correctly in a neighboring function. `with_structured_output()` had carried the exact same guard for the exact same reason for some time. `bind_tools()`, the parallel entry point most agents actually go through, never got it.

The recurring shape

Two frameworks, two languages, two companies, opposite failure modes, one silent, one a hard error. Same underlying shape. In neither case did anyone lack the knowledge of what the code should do. Someone had already written the correct behavior, correctly, somewhere in the same framework. The defect was that the fix lived in one code path and not in its sibling.

This is the most common way mature frameworks carry bugs, and it is not how we usually look for them. We hunt for the thing nobody understood. But a framework that has been around long enough has usually already solved its hard problems, once. What it accumulates over time is parallel entry points that were supposed to share a guarantee and quietly drifted: a truncation path and a summarization path, a .NET binding and a Python binding, `bind_tools` and `with_structured_output`. A fix lands in whichever one the bug report came through. The siblings keep the old behavior. No test spans the two, because each path has its own tests and they all pass.

Detection by comparison

So the place to look is not the unfamiliar code. It is the second implementation of something you have already seen fixed. When you find a guard, a special case, a defensive copy, a `has_system_message` flag, anything that reads like someone learned a lesson here, the question to ask is: where else does this framework do the same kind of thing, and does that place know the lesson too? More often than it should, it does not. Semantic Kernel's `ChatHistorySummarizationReducer` is sitting there right now with the bug its truncation sibling just had fixed. PR #13610 notes the summarization reducer carries the same defect, scoped out for a follow-up.

These bugs survive evals and tests because each sibling path has its own tests and they all pass. You do not catch the divergence by reading either function in isolation; you catch it by noticing that two functions which should agree do not, which is a comparison, not a judgment, and the kind of thing a linter can be taught to make. This is verification by effect applied to static code: you do not reason about what either function should do, you compare two functions that should agree and flag where they diverge.

The fixes are merged; the frameworks are better for it. The lesson generalizes: in a mature codebase, the next bug to find is usually a fix you have already seen, standing in the wrong function.

Roli Bosch, Hermes Labs