Skip to content
Case study · Upstream contribution · 2026

When the System Prompt Vanishes: A Silent Failure in Semantic Kernel

A merged Python fix for a silent system-prompt-deletion bug in Microsoft Semantic Kernel, a framework teams ship LLM products on. As conversations grew long, the Python SDK silently deleted the system prompt, and the model quietly stopped following its instructions, with no error and no log. A community user had reported the symptom in 2025, and Microsoft had already corrected the same defect on the .NET side. What was missing was the Python port and the diagnosis of why the defect surfaced there. We traced the Python root cause to a summarization-only helper, ported the documented .NET fix, and identified that the sibling summarization reducer carried the same defect — flagged for a follow-up, not fixed here. Microsoft merged the truncation fix as PR #13610. The named contribution is the failure mode itself: silent instruction relaxation, the model losing its instructions with nothing in the system to signal it.

Semantic Kernel ships history reducers that keep a conversation inside the model's context window by trimming old turns. The truncation reducer, ChatHistoryTruncationReducer, did its trimming by calling extract_range, a helper built for the summarization path. That helper unconditionally filters out system and developer messages, which is correct behavior when you are about to feed history to a summarizer, and incorrect behavior when you are truncating the live conversation. The result: once a conversation grew long enough to trigger truncation, the system prompt was dropped.

The failure was silent. No exception. No warning log. The reducer returned a valid, well-formed history that simply no longer contained the instructions the application author wrote. The model then answered without its guardrails, its persona, or its task framing, exactly in the long conversations where staying on task matters most. This is a concrete merged instance of what we categorize as silent instruction relaxation: the model stops following its instructions, and nothing in the system tells you it happened.

  • 2025-01-31 · Microsoft corrects the same defect on the .NET side in PR #10344 — the documented reference for what correct behavior looks like.
  • 2025-06-27 · A community user files issue #12612: the Python ChatHistoryTruncationReducer deletes the system prompt. The report sits open, unsurfaced at runtime, for months — the .NET fix never having been ported to Python.
  • 2026-03-01 · Hermes Labs reproduces the Python deletion, traces the root cause to the summarization-only extract_range helper, ports the .NET approach to the truncation reducer, and opens PR #13610 with regression tests. The same diagnosis surfaces that the sibling summarization reducer carries the same defect; the PR documents it and scopes its repair to a follow-up to keep the change focused.
  • 2026-03-19 · Microsoft maintainers review and merge PR #13610. The Python truncation reducer now preserves the system prompt across reduction; the flagged summarization defect remains open for the follow-up.

The shape worth naming: a defect that throws no exception and logs no warning can sit live in a widely-used SDK for months, reported but unfixed, even after the same behavior has been corrected in a sibling SDK — because nothing surfaces it at runtime. The engineering job here was to reproduce the Python failure, confirm it against the .NET reference, and produce a patch a maintainer could adopt without re-doing the work.

The same binary-gated discipline Hermes Labs runs as a diagnostic pipeline on private client stacks also governs how it produces a public upstream fix: pass a gate, or stop. No gate is skippable. No gate is judgment-graded. The intent is that an opinionated process produces a falsifiable artifact at each step, so the diff at the end is one a maintainer can adopt on its own merits.

Gate 0 · Qualify. Score the target against the systematic failure categories and the comprehension cost. A widely-used Microsoft framework dropping the system prompt is a high-alignment, high-impact case: instruction following is the whole reason the system prompt exists.

Gate 1 · Reproduce. Drive ChatHistoryTruncationReducer.reduce() with a history that opens with a system message and exceeds target_count. Confirm the returned history no longer contains the system message. No reproduction, no work.

Gate 2 · Investigate. Trace the failing path line by line into extract_range and find the unconditional system/developer filter. Establish the external contract: the .NET SDK had already fixed the same defect in PR #10344, so the correct behavior was documented in the project itself, not invented here.

Gate 3 · Audit. Falsify the hypothesis on purpose. Is the deletion intended? Does a caller re-inject the system prompt downstream? Is the summarization filter load-bearing for truncation? Each competing explanation has to be ruled out against the code before the fix is allowed to proceed.

Gate 4 · Test. A regression test that fails on the unfixed reducer and passes on the fixed one, asserting the preserved system message and exercising the framework's own reducer at the right call-graph depth, plus the edge cases: developer-role messages, no-system histories, and target_count=1.

Gate 5 · Fix. The minimum diff that closes the test, matching the .NET semantics: detect the first system/developer message, adjust the reduction target to account for the preserved slot, slice the tail instead of stripping, and re-prepend the system message if it was truncated away.

Gate 6 · Review. Adversarial reviewers run in context-isolated subagents and score the same binary gates before submission. An arbiter returns ship or no-ship. Maintainer review is a separate, real bar after that — the value of the internal pass is a defensible diff entering review, not a claim that review found nothing.

Gate 7 · Report. A structured internal report: summary, reproduction, root cause, fix, test, contracts verified, edge cases, and review tally. The same diagnosis established that the sibling summarization reducer carried the same defect; the PR documents it and scopes the repair to a follow-up, so the reach of the diagnosis is on the record even where the fix is deferred.

Each gate emits a falsifiable artifact, so the work can be re-checked rather than trusted. The proprietary tooling that runs the pipeline stays in-house.

The defect was the truncation reducer routing through a summarization-only helper that strips system and developer messages. The fix stops that routing and preserves the system prompt explicitly. The target behavior did not have to be invented: Microsoft had already corrected the same defect on the .NET side in PR #10344, which served as the verified reference for what correct looks like. The truncation reducer now locates and retains the first system or developer message:

# Preserve the first system/developer message so it is not
# lost during truncation (matches the .NET SDK behavior).
system_message_index = next(
    (i for i, msg in enumerate(history)
     if msg.role in (AuthorRole.SYSTEM, AuthorRole.DEVELOPER)),
    -1,
)
system_message = history[system_message_index] if system_message_index >= 0 else None

The reduction index calculation takes a new has_system_message flag so the preserved prompt is accounted for in the target count, with a guard added so a small target_count cannot produce an IndexError:

if has_system_message:
    target_count -= 1
    if target_count <= 0:
        logger.warning(
            "target_count after accounting for system message is %d; "
            "reduction will keep only the system message.",
            target_count,
        )

The patch touched the truncation reducer and the shared reducer utilities. The merged PR updated the existing truncation tests and added new ones covering system messages, developer messages, the no-system case, and the target_count=1 edge. The .NET fix gave the reference behavior, and the Microsoft maintainers reviewed and merged the Python port. The diagnosis also reached the sibling summarization reducer, which carries the same defect through the same helper; the PR documents that and scopes the repair to a follow-up to keep the change focused, so its tests are unaffected here. The community report and the .NET fix had not flagged the Python sibling, which is the measure of how far the diagnosis ran past the reported bug.

For a team that runs a history reducer to keep long conversations inside the context window, the cost of this failure is every long conversation in which the model ignored its system prompt: the safety framing, the tool-use rules, the persona, the task constraints. There is no error to catch, no metric that moves, no log to grep. A short test conversation never grows long enough to trigger truncation, so the deletion is invisible at runtime and surfaces only in long-running production sessions that are hardest to inspect. The failure is reproducible and testable once you know to look, which is precisely why a targeted regression test pins it and why the diagnosis is the work: the bug hides not because it cannot be tested, but because nothing prompts you to write that test.

This case is public, so the credit is shared. The symptom was reported by a community user, and the correct behavior was already established on the .NET side. The Python port carried that behavior over, held up in a Microsoft maintainer's review, and merged. When Hermes Labs audits a private stack, the same discipline turns to discovery, and the deliverable is the receipts: the reproduction script, the falsification log, the failing-then-passing regression test, and the verified contract behind the fix. A buyer can read the report cold and decide whether the failure applies to their stack and whether the audit checked the right things.

This is the same pipeline Hermes Labs runs whether the target is a chat history reducer, a retrieval layer, a prompt template, an agent harness, or a tool definition. The systematic failure categories vary; the gate structure does not. Our upstream contributions are listed at /open-source/contributions.

If you ship an LLM-based product where the model is supposed to follow a system prompt, ask what happens to that prompt when the conversation gets long. A silent instruction relaxation does not throw, does not log, and does not move a dashboard. We run this diagnostic pipeline against production AI stacks under NDA.

Talk to us about an audit

References: SK #13610 · SK #12612 · .NET #10344 · PR filed under our founder's GitHub handle, roli-lpci · Upstream contributions: /open-source/contributions