Case study · Upstream contribution · 2026

Fixing a Silent 400: Forced tool_choice Under Claude Thinking in LangChain

A merged upstream fix for a silent, runtime-only failure. It passed every static check a team would normally trust: the binding raised nothing, the agent graph type-checked, the unit tests were green, the linter was clean. Only a live API call exposed it, as a 400 returned whenever forced tool_choice met a Claude extended-thinking model. A community reporter filed the issue and root-caused it. Hermes Labs engineered the guard one layer down, at the bind_tools() binding, so any caller that forces a tool is covered, proving which tool_choice values to drop versus keep, holding the guard inert when thinking is off, and adding regression tests. LangChain merged it as PR #35544.

§ 01The failure

Forced tool_choice meets thinking

Anthropic's API has a hard rule: when extended thinking is enabled, you may not force tool use. A request that sets thinking={ type: "enabled" } and also sends a forcing tool_choice, meaning "any" or a specific named tool, is rejected with a 400: "Thinking may not be enabled when tool_choice forces tool use." The non-forcing value tool_choice="auto" is permitted with thinking and works fine, which is exactly what makes the failure narrow and easy to misattribute.

In langchain-anthropic, ChatAnthropic.bind_tools() normalized tool_choice and passed it straight through to the API, with no awareness of whether thinking was active on the model. Any higher-level construct that forces a tool, the most common being create_agent with a response_format schema, binds the structured-output tool with tool_choice="any". Combine that one forcing path with a thinking-enabled Claude model and the request never reached a useful response. It died at the API boundary with a 400. Agents using tool_choice="auto", the default, were unaffected.

The failure was silent everywhere it would normally be caught. The binding itself raised nothing. The model was configured correctly. The agent graph type-checked. The unit tests stayed green and the linter stayed clean, because none of those touch a live endpoint. The constraint is enforced by Anthropic at request time, so the only thing that surfaces it is an actual API call. When it did surface, it surfaced far from the line that caused it: a request-level rejection naming tool_choice rather than the binding that set it. For a team adding extended thinking to an existing forced-tool agent, the agent simply stopped working, every static gate said the code was fine, and the stack trace pointed at the wrong layer.

§ 02Timeline

Public record

2026-03-03 · A community user files issue #35539 with a self-contained repro and an accurate root cause: create_agent with a response_format hardcodes tool_choice="any" for the structured-output tool, which Claude rejects under extended thinking. The diagnosis and the repro are the reporter's.
2026-03 · Hermes Labs engineers the fix one layer down, at the ChatAnthropic.bind_tools() binding, so any caller that forces a tool, not just create_agent, is covered. The work confirms the reporter's cause against the code, proves which tool_choice values must be dropped versus preserved, verifies the guard is inert when thinking is off, and adds regression tests. Hermes opens PR #35544.
2026-03-05 · LangChain maintainers merge PR #35544 to master, closing #35539.

§ 03The engineering discipline behind the fix

Binary at every gate

The same binary-gated discipline Hermes Labs runs as a diagnostic pipeline on private client stacks also governs how it produces a public upstream fix: pass a gate, or stop. No gate is skippable. No gate is judgment-graded. The intent is that an opinionated process produces a falsifiable artifact at each step, so the diff at the end is one a maintainer can adopt without re-deriving the reasoning. The gates below describe that discipline as it bears on this fix.

Gate 0 · Qualify. The target is a top-tier framework, the failure sits squarely in the model-binding layer, and comprehension cost is bounded to one provider package. Above threshold.

Gate 1 · Reproduce. The reporter's example reproduced the 400 exactly: a thinking-enabled ChatAnthropic plus a forced-tool structured-output path, run from a known commit. The repro confirmed the symptom; it ran clean under every static check and only the live call returned the 400. No reproduction, no work.

Gate 2 · Investigate. Confirm the reporter's root cause against the code and locate the precise point to guard. Tracing the failing call path line by line: the forced tool_choice="any" originates in the structured-output-to-tool conversion and flows untouched through bind_tools() to the request, where the API rejects it — so the binding is the one place a single guard covers every forcing caller. The same trace surfaced the in-repo contract for the fix: the structured-output path already handles this exact constraint in _get_llm_for_structured_output_when_thinking_is_enabled, which omits tool_choice for the same API reason. That precedent set the shape of the fix, so the guard would look like code already in the repo rather than a novel pattern.

Gate 3 · Audit. Falsify the hypothesis on purpose. Confirm that the rejection is a documented Anthropic constraint and not a transient or a model-version artifact. Confirm tool_choice="auto" is permitted with thinking and returns a valid response, so the guard must drop only the forcing variants ("any" and a named tool), never "auto". This is also the check that disproves any "thinking plus tools is simply broken" reading: only the forced path fails. Confirm the guard must be inert when thinking is absent, None, or explicitly disabled, so existing tool-forcing agents see no behavior change.

Gate 4 · Test. Tests that fail on unfixed code and pass on fixed code, exercising the real bind_tools() behavior rather than a mock at the patched line: forced string, forced dict, the "any" and "tool" dict types, the "auto" passthrough, and the thinking-disabled, None, and absent cases.

Gate 5 · Fix. The minimum diff that closes the test: a guard added after tool_choice normalization in bind_tools(). Every changed line gets a one-sentence justification. The shape was chosen to mirror the existing structured-output guard so reviewers could verify it against precedent already in the repo.

Gate 6 · Review. Adversarial reviewers run in context-isolated subagents and score the same binary gates before submission. An arbiter returns ship or no-ship. Maintainer review is a separate, real bar after that — the value of the internal pass is a defensible diff entering review, not a claim that review found nothing.

Gate 7 · Report. A structured internal report: bug summary, reproduction, root cause, fix, test, contracts verified, version compatibility, review tally, iteration history, falsification attempts, competing hypotheses, lessons.

The internal artifact set for this fix is on file. Sharing the relevant slices is a service deliverable on request.

§ 04The fix

A guard at the binding boundary, proven by test

The cause sits at a specific seam: bind_tools() passed a forcing tool_choice through to a thinking-enabled request with no guard. The fix is the minimum change that closes that seam. After tool_choice normalization, if thinking is enabled and the choice forces tool use, drop the choice and warn. The precedent surfaced in Gate 2 set the shape of the fix: the structured-output path already omits tool_choice under thinking for the same API reason, so reviewers could check the guard against code already in the repo rather than against a novel pattern.

if (
    self.thinking is not None
    and self.thinking.get("type") == "enabled"
    and "tool_choice" in kwargs
    and kwargs["tool_choice"].get("type") in ("any", "tool")
):
    warnings.warn(
        "tool_choice is forced but thinking is enabled. The Anthropic "
        "API does not support forced tool use with thinking. "
        "Dropping tool_choice to avoid an API error. ...",
        stacklevel=2,
    )
    del kwargs["tool_choice"]

The production change is roughly twenty lines in chat_models.py (the warnings module was already imported there; the only added import was import warnings in the test file). It is paired with eight regression test cases across the forced variants, the "auto" passthrough, and the thinking-off, None, and disabled cases. The warning matters as much as the guard: dropping tool_choice means tool calls are no longer guaranteed on that request, so the caller is told explicitly rather than left to discover it. Credit to the reporter for the accurate root cause and clean repro, and to the LangChain maintainers for the structured-output guard that gave the fix its shape. Hermes's contribution was the engineering: generalizing the guard to the binding boundary so every forcing caller is covered, proving which tool_choice values had to be dropped versus preserved, showing the change was inert for every unaffected agent, and landing it through review.

§ 05What this kind of audit delivers

The class of failure, and the receipts

For a team running Claude agents, this is the failure shape that is most expensive to diagnose in-house: a provider constraint, enforced at the API boundary, triggered by a value set two layers up in a framework you did not write. Nothing static catches it. The types are sound, the unit tests pass, the linter is quiet; the only signal is a live request that comes back 400. And the error message, while accurate, is misleading, because it names tool_choice and not the binding that forced it. Teams burn hours bisecting their own agent graph before the cause turns out to be a one-line interaction between two correct-looking settings, one of which can only be observed against the real endpoint.

This case is public, so the credit is shared. The reporter supplied the diagnosis. The engineering held up in a maintainer's review and merged. When Hermes Labs audits a private stack, the same discipline turns to discovery, and the deliverable is the receipts: the reproduction, the falsification log that rules out competing causes, the failing-then-passing regression tests, and an explicit list of what was checked — that auto stays untouched, that thinking-off agents see no change, and that the change is inert everywhere the failure does not apply. A buyer can read it cold and decide whether the failure applies to their stack and whether the audit checked the right things, without re-running the investigation.

This is the same pipeline Hermes Labs runs whether the target is a model binding, a prompt template, an agent harness, a retrieval layer, or a tool definition. The systematic failure categories vary; the gate structure does not.

If you run Claude agents in production, the failures that cost the most are the ones that look like your code but live in a framework binding or a provider constraint. We run this diagnostic pipeline against production AI stacks under NDA.

Talk to us about an audit →

References: #35539 · #35544 · PR filed under our founder's GitHub handle, roli-lpci · Upstream contributions: /open-source/contributions