Skip to content

LLM Fallback Gateway — Steer Document

Date: 2026-04-09 Author: Srikanth Donthi (CPO/CTO) + Claude Code session 35 Status: Design Complete — Not Yet Implemented Companion spec: llm-fallback-gateway-feature.md


1. Problem Statement

When the Anthropic API becomes unavailable — for any reason — the entire Curaway agent stack degrades to deterministic templates. The patient keeps seeing replies, but those replies are canned, the Clinical Context Agent stops extracting ICD codes, the chat extractor stops capturing medications, and the matching engine loses the LLM-driven re-ranking. The conversation continues, but the intelligence of the platform is silently turned off.

This already bit us. Earlier in this session sequence, an Anthropic account spending limit silently capped LLM calls and the symptoms were diagnosed as a code bug — the wild-goose-chase that produced multiple PRs trying to fix things that weren't broken before someone realized the API itself was the blocker. With an automatic fallback, the platform would have switched to GPT-4o mini and kept working until the limit was raised. The patient experience would have been indistinguishable, ops would have seen the fallback events, and we would have saved several hours.

Today's reality after a grep audit:

Touchpoint File Failure behavior today
Clinical Context Agent — extract entities app/agents/clinical_context.py:111 Catches exception, stores raw text, queues for QStash retry. No alternate LLM.
Clinical Context Agent — map ICD/SNOMED codes app/agents/clinical_context.py:180 Catches exception, returns empty coded_entities list. Patient's clinical record is silently unstructured.
Clinical Context Agent — generate FHIR resources app/agents/clinical_context.py:227 Catches exception, returns empty fhir_resources list. Matching engine has nothing to score on.
Conversation LLM app/agents/llm_conversation.py:generate_response Catches exception, returns canned _fallback_response() template. Patient sees generic text.
Chat Extractor app/services/chat_extractor.py Catches exception, returns empty extraction. Medications, allergies, demographics from this turn are lost.
Match Agent — analyze clinical picture app/agents/match_agent.py:87 Catches exception, returns minimal fallback dict. No LLM-driven specialty determination.
Match Agent — rerank edge cases app/agents/match_agent.py:144 Catches exception, no rerank. Top 5 providers stay in deterministic order.
Intake Agent (legacy) app/agents/intake_agent.py Catches exception, returns canned text.

Eight Claude touchpoints, eight near-identical try/except → deterministic template patterns. Adding a new agent today means writing a ninth fallback by hand, and forgetting one means a new silent failure surface.


2. Decision: Centralized LLM Gateway with Claude → GPT-4o mini Fallback

Decision: Build a single app/services/llm_gateway.py that wraps all LLM calls. The gateway tries Claude first, GPT-4o mini second, raises on both. Every existing agent is refactored to call into the gateway instead of constructing a ChatAnthropic client directly. The gateway is gated behind a Flagsmith flag (llm_fallback_enabled, default true) for instant rollback.

Why a centralized gateway over per-agent fallbacks

The wild-goose-chase happened because every agent had its own quiet failure path. Centralizing failure handling gives us:

  1. One place to instrument. Every fallback fires the same Langfuse span shape and the same llm.fallback_fired event. No "did agent X log this?" ambiguity.
  2. One place to swap providers. When we eventually replace GPT-4o mini with MedGemma, Gemini, or a self-hosted model, it's a one-line change in the gateway. Today that change would touch 8 files and risk leaving stragglers.
  3. One place to enforce policy. Voice rules, output validation, timeout budgets, retry counts, circuit-breaker thresholds — all live in the gateway. Adding a new policy means changing one file, not eight.
  4. One single entry point for new agents. Whoever writes the next agent calls await llm_gateway.invoke(...) instead of importing ChatAnthropic directly. The fallback is inherited automatically; they cannot accidentally write a non-resilient agent.

The refactor cost is real (touches 5 agent files in addition to the new gateway module), but it's a one-time payment. Every future LLM stack change pays it back.


3. The Trigger Matrix

When does the gateway escalate from Claude to GPT-4o mini? Six failure classes, each with an explicit decision recorded here:

# Trigger Behavior Reason
1 Claude returns 5xx (server error) Retry on GPT-4o mini immediately, no wait Server-side outage; retrying Claude won't help in the patient's chat-turn budget
2 Claude returns 429 (rate limit) Retry on GPT-4o mini immediately, no wait Patient is in conversation; we don't have time to back off and retry Claude
3 Claude returns 401/403 (auth/billing) Retry on GPT-4o mini immediately AND emit llm.config.error event so ops sees it This is the exact wild-goose-chase pattern. Retry keeps the patient working; the event gives ops the signal to fix the underlying config
4 Claude timeout (no response in budget) Retry on GPT-4o mini after Claude's per-call budget elapses (default 8 seconds, configurable) 8s = the chat-turn target. Past that, the patient is waiting on a spinner
5 Claude returns malformed JSON that breaks the parser Do NOT retry on GPT. Fall back to deterministic template, log the bad output Different model = different malformation pattern. JSON failures are a Claude prompt bug; GPT will just produce a different malformed JSON. The right fix is prompt iteration, not retry.
6 Patient explicitly asked for "the better model" (Sonnet via clinical reasoning) Sonnet failure → fall back to GPT-4o mini, not GPT-4o Tier preservation doubles the failure surface. Patient is better served by some coherent response than by a second retry. Cheaper too.

The six classes cover everything we've actually seen in practice. The gateway distinguishes them via standard httpx.HTTPStatusError codes plus a JSON-parse-error sentinel.


4. Voice Rules + Output Validation

GPT-4o mini will produce text that flows through the same voice rules + medical advice filters as Claude:

  • app/services/output_validator.py — regex scan for forbidden patterns (diagnosis, treatment recommendations, outcome promises)
  • config/voice_rules.yaml — 30+ forbidden phrases ("I'd love to help", "in good hands", "you'll be fine", etc.)
  • tests/test_no_medical_advice.py (PR #84) — imperative clinical verbs in patient-facing strings
  • tests/test_voice_compliance.py — voice rule scan in source files
  • The CoT prompt structure (PR #76, Session 34) — JSON-only response schema with structured reasoning_steps

GPT-4o mini's default phrasing leans more saccharine than Claude's ("I'd love to help you with that" is a known GPT-ism that's in our forbidden list). Two mitigations:

  1. GPT receives the same assembled prompt as Claude — including the forbidden phrases already injected via {forbidden_phrases_block} by prompt_loader.py. The gateway adds a short GPT-specific preamble ("You are substituting for Claude...follow all voice rules already in this prompt") but does NOT re-inject the forbidden list. (Updated post-prompt-abstraction, Session 35.)
  2. Output validation runs unchanged on every reply regardless of provider. If GPT trips a forbidden phrase, the validator flags it the same way it would flag a Claude reply — and we degrade to the deterministic template for that turn (not a third retry on a third model).

In practice we expect GPT to trip voice rules more often than Claude because GPT was never tuned on our prompts. The first week of production data will tell us how often. If it's >5% of GPT calls, we iterate the GPT system prompt.


5. Cost & Quality Tradeoffs

Cost: GPT-4o mini at $0.15/$0.60 per MTok is genuinely cheaper than Claude Haiku 4.5 at $1/$5. So fallback is cheaper than primary in pure cost terms. The dashboard's "Claude Code (dev)" derived estimate (Anthropic total − Langfuse-traced) will see a small reduction when fallback fires, and OpenAI's line in the spend dashboard will move off $0 for the first time.

Quality: GPT-4o mini is noticeably worse at the brand voice instructions (per CLAUDE.md Use Case 1: "★★★ Good but less nuanced brand voice"). For patient-facing conversation, fallback replies will sound slightly different. Acceptable tradeoff because the alternative is a deterministic template that's much more obviously canned.

JSON shape stability: GPT-4o mini follows tight JSON schemas slightly worse than Haiku, especially nested arrays. The Clinical Context Agent's coded_entities schema and the CoT reasoning_steps shape have been hardened for Haiku output. Mitigation: the parser hardening from PR #76 (the balanced-bracket JSON extractor) handles preamble + JSON regardless of which model produced it. We'll watch for new failure modes during the rollout window.

Voice rule compliance: addressed in §4 above. Expected to trip more often on GPT; mitigated by the forbidden phrases already in the assembled prompt (from prompt_loader.py) plus the GPT-specific preamble and the runtime output validator.


6. Silent Fallback (UX Decision)

The fallback is silent. Patients do not see a "replying via backup model" badge. Reasoning:

  1. The standard pattern across SaaS is silent fallback. Most services with multi-model support don't surface the model switch to end users; the value is delivered, the implementation is hidden.
  2. A visible badge raises questions we don't have good answers for. "Why is it on backup? Is my data still safe? Will it understand me?" None of those concerns are real, but answering them in-line during a conversation about a knee replacement is the wrong moment.
  3. The voice tone difference is subtle. Patients are unlikely to notice; surfacing it would draw attention to something they wouldn't otherwise care about.

Operations sees the fallback via llm.fallback_fired events in Langfuse and the events table. The patient does not. If we ever decide that transparency is worth the friction, it's a one-line UI change to add the badge later.


7. Observability

Three layers, no real-time alerts for the MVP.

7.1 Langfuse trace per fallback attempt

Every fallback creates a child span on the parent agent trace, tagged llm.fallback=true with the failure reason. The LangChain callback handler we already use for Claude tracing supports this natively — we just need to call it for the GPT call too. Per-call metadata:

{
  "primary_model": "claude-haiku-4-5",
  "primary_failure_reason": "5xx" | "429" | "401" | "403" | "timeout",
  "primary_failure_status": 503,
  "primary_failure_message": "...",
  "fallback_model": "gpt-4o-mini",
  "fallback_success": true,
  "fallback_latency_ms": 1240,
  "fallback_cost_usd": 0.0008,
}

7.2 Events table row

Every fallback fires an llm.fallback_fired event into the existing events table:

{
  "event_type": "llm.fallback_fired",
  "tenant_id": ...,
  "patient_id": ...,
  "payload": {
    "case_id": ...,
    "agent": "clinical_context.map_codes",
    "primary_model": "claude-haiku-4-5",
    "primary_failure_reason": "429",
    "fallback_model": "gpt-4o-mini",
    "fallback_success": true,
    "fallback_latency_ms": 1240,
  }
}

Easy to query for "how many fallbacks last week" via Metabase or a direct SQL hit. After the first week of production data, we know the real fallback rate per agent and can decide whether to tune the trigger thresholds.

7.3 Spend dashboard reflection

The OpenAI line in the spend dashboard already handles this — the $0 honest note added in PR #87 will auto-update the moment OpenAI starts seeing real traffic. No dashboard work needed; the fallback spend is captured automatically.

7.4 Real-time alerts (deferred)

For the MVP, no PagerDuty / email / Slack notification when fallbacks exceed a threshold. Stub the integration interface so the alert layer can be wired later (one function call, one config flag), but do not implement an actual sender. Plan: add a _dispatch_alert() no-op function in the gateway that future PRs can route to PagerDuty or Slack via a third-party SDK. Today it logs a warning and does nothing else.


8. Feature Flag

Flag name: llm_fallback_enabled Default: true Behavior when disabled: The gateway throws on Claude failure — agents catch the exception and degrade to their existing deterministic template behavior. Functionally identical to today.

A second flag, llm_fallback_provider, defaults to gpt-4o-mini and controls which fallback the gateway uses. Setting it to a different value (e.g., gpt-4o, gemini-1.5-flash) requires the corresponding SDK and API key to be configured. This is the "single config change for model switch" that the decision called for.


9. Out of Scope

  • Wiring more than one fallback in a chain. No "Claude → GPT-4o mini → Gemini → self-hosted". One fallback only. Adding a second fallback doubles the test matrix and the operational complexity for marginal value.
  • Streaming fallback. Today, llm_conversation.generate_response_streaming streams Claude tokens via Redis SSE. The fallback path is non-streaming — if Claude streaming fails, we fall back to GPT and serve the full response in one shot. Streaming over GPT is a future optimization.
  • Per-tenant fallback configuration. Every tenant uses the same fallback model. Tenant-specific overrides are post-Series A.
  • Tier matching. Sonnet failures fall back to GPT-4o mini, not GPT-4o (per Decision F).
  • Real-time alerts. Stub only. Implementation deferred.
  • Wiring fallback into the embedding service (embedding_service.py). Embeddings don't use Claude; they use Voyage. Voyage failure doesn't trigger the LLM fallback gateway.
  • Wiring fallback into the voice service (voice_service.py). Whisper failure doesn't trigger the gateway either — voice falls back to the Web Speech API (browser-side) which is the existing default.

10. Success Criteria

Measured over the first two weeks of production after rollout:

  • Zero patient-visible LLM-availability incidents. Any Claude API outage of any duration produces a non-zero llm.fallback_fired event count and zero "Claude API errored, conversation degraded" symptoms.
  • Fallback rate <2% of total LLM calls under normal conditions (baseline). Spikes during Anthropic incidents are expected and not a regression.
  • Voice rule violation rate on GPT replies <5% of all fallback invocations. If higher, we iterate the GPT system prompt before loosening the validator.
  • End-to-end chat-turn latency does not increase by more than 200ms p95 vs baseline. The fallback only fires on failure, so the median is unaffected; the p95 captures the case where Claude timed out and we retried.
  • No new voice_compliance or no_medical_advice CI test failures triggered by GPT-specific phrasing.

11. Rollback

Three rollback layers, each progressively safer:

  1. llm_fallback_enabled = false in Flagsmith. Disables the fallback path within one cache TTL (60s). All agents revert to deterministic-template-on-failure. No code redeploy needed.
  2. Per-agent flag override (if a specific agent's GPT fallback produces bad output — e.g., the chat extractor's structured schema doesn't survive). Pattern: llm_fallback_enabled_chat_extractor etc. Implemented as Flagsmith identifier checks inside the gateway.
  3. llm_fallback_provider = none in Flagsmith. Equivalent to #1 but more explicit about the intent ("we want no fallback at all right now").

12. Connection to Other Specs

  • Medical advice remediation (PR #84, draft): the gateway injects the same forbidden-phrase prompt block into both Claude and GPT calls. PR #84's tests/test_no_medical_advice.py CI guard applies to GPT replies the same way it applies to Claude.
  • Chain-of-thought prompts (PR #76): the parser hardening (the balanced-bracket JSON extractor) is what makes GPT fallback feasible for the Clinical Context Agent. Without #76, GPT's slightly different output shape would break the JSON parser.
  • Conversation flow gates v2 (PR #70): gates_v2.intake_complete doesn't care which model produced the extracted medications. If fallback fires during a chat extractor call, the resulting metadata is identical (extra_metadata.medications = [...]) and the gate fires correctly.
  • Case record porting (PR #85, #86): irrelevant — porting doesn't call any LLM. The gateway has no interaction with this feature.
  • Spend dashboard (PR #78–#87): OpenAI fetcher already handles the fallback spend automatically. No dashboard work needed.

13. References

  • Companion feature spec: llm-fallback-gateway-feature.md
  • Existing failure paths:
  • app/agents/clinical_context.py:111, 180, 227
  • app/agents/llm_conversation.py:generate_response
  • app/services/chat_extractor.py
  • app/agents/match_agent.py:87, 144
  • CLAUDE.md drift to fix in implementation PR: §11.2.1 says "Fallback: GPT-4o mini" but no fallback exists today. This spec makes that line true.
  • Wild-goose-chase precedent: the Anthropic spending limit incident earlier in this session sequence — the explicit motivation for this PR.