LLM Fallback Gateway — Steer Document¶
Date: 2026-04-09
Author: Srikanth Donthi (CPO/CTO) + Claude Code session 35
Status: Design Complete — Not Yet Implemented
Companion spec: llm-fallback-gateway-feature.md
1. Problem Statement¶
When the Anthropic API becomes unavailable — for any reason — the entire Curaway agent stack degrades to deterministic templates. The patient keeps seeing replies, but those replies are canned, the Clinical Context Agent stops extracting ICD codes, the chat extractor stops capturing medications, and the matching engine loses the LLM-driven re-ranking. The conversation continues, but the intelligence of the platform is silently turned off.
This already bit us. Earlier in this session sequence, an Anthropic account spending limit silently capped LLM calls and the symptoms were diagnosed as a code bug — the wild-goose-chase that produced multiple PRs trying to fix things that weren't broken before someone realized the API itself was the blocker. With an automatic fallback, the platform would have switched to GPT-4o mini and kept working until the limit was raised. The patient experience would have been indistinguishable, ops would have seen the fallback events, and we would have saved several hours.
Today's reality after a grep audit:
| Touchpoint | File | Failure behavior today |
|---|---|---|
| Clinical Context Agent — extract entities | app/agents/clinical_context.py:111 |
Catches exception, stores raw text, queues for QStash retry. No alternate LLM. |
| Clinical Context Agent — map ICD/SNOMED codes | app/agents/clinical_context.py:180 |
Catches exception, returns empty coded_entities list. Patient's clinical record is silently unstructured. |
| Clinical Context Agent — generate FHIR resources | app/agents/clinical_context.py:227 |
Catches exception, returns empty fhir_resources list. Matching engine has nothing to score on. |
| Conversation LLM | app/agents/llm_conversation.py:generate_response |
Catches exception, returns canned _fallback_response() template. Patient sees generic text. |
| Chat Extractor | app/services/chat_extractor.py |
Catches exception, returns empty extraction. Medications, allergies, demographics from this turn are lost. |
| Match Agent — analyze clinical picture | app/agents/match_agent.py:87 |
Catches exception, returns minimal fallback dict. No LLM-driven specialty determination. |
| Match Agent — rerank edge cases | app/agents/match_agent.py:144 |
Catches exception, no rerank. Top 5 providers stay in deterministic order. |
| Intake Agent (legacy) | app/agents/intake_agent.py |
Catches exception, returns canned text. |
Eight Claude touchpoints, eight near-identical try/except → deterministic template patterns. Adding a new agent today means writing a ninth fallback by hand, and forgetting one means a new silent failure surface.
2. Decision: Centralized LLM Gateway with Claude → GPT-4o mini Fallback¶
Decision: Build a single app/services/llm_gateway.py that wraps
all LLM calls. The gateway tries Claude first, GPT-4o mini second,
raises on both. Every existing agent is refactored to call into the
gateway instead of constructing a ChatAnthropic client directly. The
gateway is gated behind a Flagsmith flag (llm_fallback_enabled,
default true) for instant rollback.
Why a centralized gateway over per-agent fallbacks¶
The wild-goose-chase happened because every agent had its own quiet failure path. Centralizing failure handling gives us:
- One place to instrument. Every fallback fires the same Langfuse
span shape and the same
llm.fallback_firedevent. No "did agent X log this?" ambiguity. - One place to swap providers. When we eventually replace GPT-4o mini with MedGemma, Gemini, or a self-hosted model, it's a one-line change in the gateway. Today that change would touch 8 files and risk leaving stragglers.
- One place to enforce policy. Voice rules, output validation, timeout budgets, retry counts, circuit-breaker thresholds — all live in the gateway. Adding a new policy means changing one file, not eight.
- One single entry point for new agents. Whoever writes the next
agent calls
await llm_gateway.invoke(...)instead of importingChatAnthropicdirectly. The fallback is inherited automatically; they cannot accidentally write a non-resilient agent.
The refactor cost is real (touches 5 agent files in addition to the new gateway module), but it's a one-time payment. Every future LLM stack change pays it back.
3. The Trigger Matrix¶
When does the gateway escalate from Claude to GPT-4o mini? Six failure classes, each with an explicit decision recorded here:
| # | Trigger | Behavior | Reason |
|---|---|---|---|
| 1 | Claude returns 5xx (server error) | Retry on GPT-4o mini immediately, no wait | Server-side outage; retrying Claude won't help in the patient's chat-turn budget |
| 2 | Claude returns 429 (rate limit) | Retry on GPT-4o mini immediately, no wait | Patient is in conversation; we don't have time to back off and retry Claude |
| 3 | Claude returns 401/403 (auth/billing) | Retry on GPT-4o mini immediately AND emit llm.config.error event so ops sees it |
This is the exact wild-goose-chase pattern. Retry keeps the patient working; the event gives ops the signal to fix the underlying config |
| 4 | Claude timeout (no response in budget) | Retry on GPT-4o mini after Claude's per-call budget elapses (default 8 seconds, configurable) | 8s = the chat-turn target. Past that, the patient is waiting on a spinner |
| 5 | Claude returns malformed JSON that breaks the parser | Do NOT retry on GPT. Fall back to deterministic template, log the bad output | Different model = different malformation pattern. JSON failures are a Claude prompt bug; GPT will just produce a different malformed JSON. The right fix is prompt iteration, not retry. |
| 6 | Patient explicitly asked for "the better model" (Sonnet via clinical reasoning) | Sonnet failure → fall back to GPT-4o mini, not GPT-4o | Tier preservation doubles the failure surface. Patient is better served by some coherent response than by a second retry. Cheaper too. |
The six classes cover everything we've actually seen in practice. The
gateway distinguishes them via standard httpx.HTTPStatusError codes
plus a JSON-parse-error sentinel.
4. Voice Rules + Output Validation¶
GPT-4o mini will produce text that flows through the same voice rules + medical advice filters as Claude:
app/services/output_validator.py— regex scan for forbidden patterns (diagnosis, treatment recommendations, outcome promises)config/voice_rules.yaml— 30+ forbidden phrases ("I'd love to help", "in good hands", "you'll be fine", etc.)tests/test_no_medical_advice.py(PR #84) — imperative clinical verbs in patient-facing stringstests/test_voice_compliance.py— voice rule scan in source files- The CoT prompt structure (PR #76, Session 34) — JSON-only response
schema with structured
reasoning_steps
GPT-4o mini's default phrasing leans more saccharine than Claude's ("I'd love to help you with that" is a known GPT-ism that's in our forbidden list). Two mitigations:
- GPT receives the same assembled prompt as Claude — including
the forbidden phrases already injected via
{forbidden_phrases_block}byprompt_loader.py. The gateway adds a short GPT-specific preamble ("You are substituting for Claude...follow all voice rules already in this prompt") but does NOT re-inject the forbidden list. (Updated post-prompt-abstraction, Session 35.) - Output validation runs unchanged on every reply regardless of provider. If GPT trips a forbidden phrase, the validator flags it the same way it would flag a Claude reply — and we degrade to the deterministic template for that turn (not a third retry on a third model).
In practice we expect GPT to trip voice rules more often than Claude because GPT was never tuned on our prompts. The first week of production data will tell us how often. If it's >5% of GPT calls, we iterate the GPT system prompt.
5. Cost & Quality Tradeoffs¶
Cost: GPT-4o mini at $0.15/$0.60 per MTok is genuinely cheaper
than Claude Haiku 4.5 at $1/$5. So fallback is cheaper than primary
in pure cost terms. The dashboard's "Claude Code (dev)" derived
estimate (Anthropic total − Langfuse-traced) will see a small reduction
when fallback fires, and OpenAI's line in the spend dashboard will
move off $0 for the first time.
Quality: GPT-4o mini is noticeably worse at the brand voice instructions (per CLAUDE.md Use Case 1: "★★★ Good but less nuanced brand voice"). For patient-facing conversation, fallback replies will sound slightly different. Acceptable tradeoff because the alternative is a deterministic template that's much more obviously canned.
JSON shape stability: GPT-4o mini follows tight JSON schemas
slightly worse than Haiku, especially nested arrays. The Clinical
Context Agent's coded_entities schema and the CoT reasoning_steps
shape have been hardened for Haiku output. Mitigation: the parser
hardening from PR #76 (the balanced-bracket JSON extractor) handles
preamble + JSON regardless of which model produced it. We'll watch
for new failure modes during the rollout window.
Voice rule compliance: addressed in §4 above. Expected to trip
more often on GPT; mitigated by the forbidden phrases already in the
assembled prompt (from prompt_loader.py) plus the GPT-specific
preamble and the runtime output validator.
6. Silent Fallback (UX Decision)¶
The fallback is silent. Patients do not see a "replying via backup model" badge. Reasoning:
- The standard pattern across SaaS is silent fallback. Most services with multi-model support don't surface the model switch to end users; the value is delivered, the implementation is hidden.
- A visible badge raises questions we don't have good answers for. "Why is it on backup? Is my data still safe? Will it understand me?" None of those concerns are real, but answering them in-line during a conversation about a knee replacement is the wrong moment.
- The voice tone difference is subtle. Patients are unlikely to notice; surfacing it would draw attention to something they wouldn't otherwise care about.
Operations sees the fallback via llm.fallback_fired events in
Langfuse and the events table. The patient does not. If we ever decide
that transparency is worth the friction, it's a one-line UI change to
add the badge later.
7. Observability¶
Three layers, no real-time alerts for the MVP.
7.1 Langfuse trace per fallback attempt¶
Every fallback creates a child span on the parent agent trace, tagged
llm.fallback=true with the failure reason. The LangChain callback
handler we already use for Claude tracing supports this natively — we
just need to call it for the GPT call too. Per-call metadata:
{
"primary_model": "claude-haiku-4-5",
"primary_failure_reason": "5xx" | "429" | "401" | "403" | "timeout",
"primary_failure_status": 503,
"primary_failure_message": "...",
"fallback_model": "gpt-4o-mini",
"fallback_success": true,
"fallback_latency_ms": 1240,
"fallback_cost_usd": 0.0008,
}
7.2 Events table row¶
Every fallback fires an llm.fallback_fired event into the existing
events table:
{
"event_type": "llm.fallback_fired",
"tenant_id": ...,
"patient_id": ...,
"payload": {
"case_id": ...,
"agent": "clinical_context.map_codes",
"primary_model": "claude-haiku-4-5",
"primary_failure_reason": "429",
"fallback_model": "gpt-4o-mini",
"fallback_success": true,
"fallback_latency_ms": 1240,
}
}
Easy to query for "how many fallbacks last week" via Metabase or a direct SQL hit. After the first week of production data, we know the real fallback rate per agent and can decide whether to tune the trigger thresholds.
7.3 Spend dashboard reflection¶
The OpenAI line in the spend dashboard already handles this — the
$0 honest note added in PR #87 will auto-update the moment OpenAI
starts seeing real traffic. No dashboard work needed; the fallback
spend is captured automatically.
7.4 Real-time alerts (deferred)¶
For the MVP, no PagerDuty / email / Slack notification when fallbacks
exceed a threshold. Stub the integration interface so the alert
layer can be wired later (one function call, one config flag), but
do not implement an actual sender. Plan: add a _dispatch_alert()
no-op function in the gateway that future PRs can route to PagerDuty
or Slack via a third-party SDK. Today it logs a warning and does
nothing else.
8. Feature Flag¶
Flag name: llm_fallback_enabled
Default: true
Behavior when disabled: The gateway throws on Claude failure —
agents catch the exception and degrade to their existing deterministic
template behavior. Functionally identical to today.
A second flag, llm_fallback_provider, defaults to gpt-4o-mini and
controls which fallback the gateway uses. Setting it to a different
value (e.g., gpt-4o, gemini-1.5-flash) requires the corresponding
SDK and API key to be configured. This is the "single config change
for model switch" that the decision called for.
9. Out of Scope¶
- Wiring more than one fallback in a chain. No "Claude → GPT-4o mini → Gemini → self-hosted". One fallback only. Adding a second fallback doubles the test matrix and the operational complexity for marginal value.
- Streaming fallback. Today,
llm_conversation.generate_response_streamingstreams Claude tokens via Redis SSE. The fallback path is non-streaming — if Claude streaming fails, we fall back to GPT and serve the full response in one shot. Streaming over GPT is a future optimization. - Per-tenant fallback configuration. Every tenant uses the same fallback model. Tenant-specific overrides are post-Series A.
- Tier matching. Sonnet failures fall back to GPT-4o mini, not GPT-4o (per Decision F).
- Real-time alerts. Stub only. Implementation deferred.
- Wiring fallback into the embedding service (
embedding_service.py). Embeddings don't use Claude; they use Voyage. Voyage failure doesn't trigger the LLM fallback gateway. - Wiring fallback into the voice service (
voice_service.py). Whisper failure doesn't trigger the gateway either — voice falls back to the Web Speech API (browser-side) which is the existing default.
10. Success Criteria¶
Measured over the first two weeks of production after rollout:
- Zero patient-visible LLM-availability incidents. Any Claude API
outage of any duration produces a non-zero
llm.fallback_firedevent count and zero "Claude API errored, conversation degraded" symptoms. - Fallback rate <2% of total LLM calls under normal conditions (baseline). Spikes during Anthropic incidents are expected and not a regression.
- Voice rule violation rate on GPT replies <5% of all fallback invocations. If higher, we iterate the GPT system prompt before loosening the validator.
- End-to-end chat-turn latency does not increase by more than 200ms p95 vs baseline. The fallback only fires on failure, so the median is unaffected; the p95 captures the case where Claude timed out and we retried.
- No new
voice_complianceorno_medical_adviceCI test failures triggered by GPT-specific phrasing.
11. Rollback¶
Three rollback layers, each progressively safer:
llm_fallback_enabled = falsein Flagsmith. Disables the fallback path within one cache TTL (60s). All agents revert to deterministic-template-on-failure. No code redeploy needed.- Per-agent flag override (if a specific agent's GPT fallback
produces bad output — e.g., the chat extractor's structured
schema doesn't survive). Pattern:
llm_fallback_enabled_chat_extractoretc. Implemented as Flagsmith identifier checks inside the gateway. llm_fallback_provider = nonein Flagsmith. Equivalent to #1 but more explicit about the intent ("we want no fallback at all right now").
12. Connection to Other Specs¶
- Medical advice remediation (PR #84, draft): the gateway
injects the same forbidden-phrase prompt block into both Claude and
GPT calls. PR #84's
tests/test_no_medical_advice.pyCI guard applies to GPT replies the same way it applies to Claude. - Chain-of-thought prompts (PR #76): the parser hardening (the balanced-bracket JSON extractor) is what makes GPT fallback feasible for the Clinical Context Agent. Without #76, GPT's slightly different output shape would break the JSON parser.
- Conversation flow gates v2 (PR #70):
gates_v2.intake_completedoesn't care which model produced the extracted medications. If fallback fires during a chat extractor call, the resulting metadata is identical (extra_metadata.medications = [...]) and the gate fires correctly. - Case record porting (PR #85, #86): irrelevant — porting doesn't call any LLM. The gateway has no interaction with this feature.
- Spend dashboard (PR #78–#87): OpenAI fetcher already handles the fallback spend automatically. No dashboard work needed.
13. References¶
- Companion feature spec:
llm-fallback-gateway-feature.md - Existing failure paths:
app/agents/clinical_context.py:111, 180, 227app/agents/llm_conversation.py:generate_responseapp/services/chat_extractor.pyapp/agents/match_agent.py:87, 144- CLAUDE.md drift to fix in implementation PR: §11.2.1 says "Fallback: GPT-4o mini" but no fallback exists today. This spec makes that line true.
- Wild-goose-chase precedent: the Anthropic spending limit incident earlier in this session sequence — the explicit motivation for this PR.