Skip to content

LLM Fallback Gateway — Feature Spec

Status: Implemented in Session 39; this document now describes the implemented design plus follow-up work and known edge cases Companion steer: ai-steer/llm-fallback-gateway-steer.md

This is the implementation plan. Read the steer first — it covers the six trigger decisions, why a centralized gateway over per-agent fallbacks, and the cost/quality/voice tradeoffs.


Overview

A new app/services/llm_gateway.py wraps every LLM call in the agent stack. Agents stop constructing ChatAnthropic clients directly and instead call llm_gateway.invoke(...). The gateway tries Claude first, GPT-4o mini second, raises on both. Fallback fires on 5xx, 429, 401/403, and timeout — but NOT on JSON parse failure (per the steer doc, malformed JSON is a prompt bug, not an availability problem).

Behind a Flagsmith flag (llm_fallback_enabled, default true). Fallback model controlled by a second flag (llm_fallback_provider, default gpt-4o-mini).


File-by-file change list

Backend — new

File Purpose
NEW app/services/llm_gateway.py The gateway. ~250 lines: invoke(), _try_claude(), _try_fallback(), _classify_failure(), _emit_fallback_event(), _dispatch_alert() stub, model registry, retry policy.

Backend — modified (8 agent/service files)

File Change
app/agents/clinical_context.py 4 LLM calls (extract + retry + ICD map + FHIR gen) refactored to llm_gateway.invoke(). Uses cached_system_message() for prompts. CoT parser hardening from PR #76 preserved.
app/agents/llm_conversation.py generate_response and generate_response_streaming route through the gateway. Streaming path falls back to non-streaming GPT-4o mini on Claude failure. Uses cached_system_message() for system message.
app/services/chat_extractor.py Single Haiku call refactored to gateway. Schema parser unchanged. Uses cached_system_message().
app/agents/match_agent.py Both analyze_clinical_picture and rerank_edge_cases route through the gateway. Dedup short-circuit (PR #76) preserved. Uses cached_system_message().
app/agents/intake_agent.py Legacy intake LLM call routed through the gateway. Uses cached_system_message().
app/agents/orchestrator.py Intent classification LLM call (CLASSIFY_INTENT_PROMPT) routed through gateway. Uses cached_system_message().
app/agents/explanation_agent.py Match explanation LLM call routed through gateway. Uses cached_system_message().
app/services/message_classifier.py Guardrail input classification LLM call routed through gateway.

Note (Session 35 → 38 changes): All agent files now use cached_system_message() from prompt_loader.py instead of SystemMessage(content=...). The gateway must preserve this pattern — cached_system_message() enables Anthropic prompt caching (10% input cost on cache hits).

Optional: app/agents/lab_analyzer.py has a comorbidity_llm_shadow mode (feature-flagged, off by default) that calls Claude Sonnet for shadow comparison. Route through gateway if shadow mode is on. Low priority — only fires when flag is explicitly enabled.

Backend — minor edits

File Change
app/config.py No startup validation tied to Flagsmith. openai_api_key already exists; fallback availability is checked at runtime inside llm_gateway.py because llm_fallback_enabled is a runtime flag, not a static settings value.
config/feature_flags.yaml Add two flags: llm_fallback_enabled (default true), llm_fallback_provider (default "gpt-4o-mini").
app/services/decision_recorder.py New event type: llm.fallback_fired.
app/services/output_validator.py No change — already provider-agnostic.
CLAUDE.md Truth-up §11.2.1 "Fallback: GPT-4o mini" line — it's now actually true.

Tests

File Change
NEW tests/test_llm_gateway.py Unit tests for the gateway. ~15 cases covering each trigger class + retry budget + flag-off behavior + alert dispatch + Langfuse tracing. Mocks both providers.
NEW tests/test_llm_gateway_fallback_fixtures.py Fixture-based tests where the gateway is fed a prepared "Claude returns 503" / "Claude times out" / "Claude returns malformed JSON" sequence and the test asserts the right downstream behavior (fallback / no fallback / deterministic template).
tests/test_clinical_context_agent.py Update mocks: instead of patching ChatAnthropic, patch llm_gateway.invoke. The 3 existing test cases should still pass with no semantic changes.
tests/test_session13_agents.py Same: patch llm_gateway.invoke instead of ChatAnthropic.
tests/test_chat_extractor.py Same.
tests/test_session14_mcp_orchestrator.py Same.
tests/test_cot_clinical_prompts.py The PR #76 tests already mock ChatAnthropic — update to mock llm_gateway.invoke. 17 tests, ~30 lines of fixture changes.

PART A — app/services/llm_gateway.py

The new gateway module. Single entry point for every LLM call.

A1. Public surface

from typing import Any, Literal
from langchain_core.messages import BaseMessage

class LLMGatewayResult(TypedDict):
    content: str                      # the LLM's text response
    model_used: str                   # "claude-haiku-4-5" | "gpt-4o-mini" | ...
    provider: Literal["claude", "gpt"]
    fallback_fired: bool              # True if fallback was used
    primary_failure_reason: str | None
    latency_ms: int
    estimated_cost_usd: float

async def invoke(
    *,
    agent_name: str,                  # "clinical_context.map_codes" | "llm_conversation" | ...
    messages: list[BaseMessage],      # langchain-style messages
    max_tokens: int = 1024,
    temperature: float = 0,
    expects_json: bool = False,       # if True, JSON parse failure = degrade, NOT retry
    case_id: str | None = None,       # for events table
    patient_id: str | None = None,    # for events table
    tenant_id: str | None = None,     # for events table
    langfuse_handler: object | None = None,
) -> LLMGatewayResult:
    """The single entry point. Tries Claude first, GPT-4o mini second.
    Raises LLMGatewayError if both fail."""

A2. Trigger classification

_classify_failure(exception) returns one of:

  • "5xx"httpx.HTTPStatusError with status 500–599
  • "429"httpx.HTTPStatusError with status 429
  • "401" — auth/billing failures (401, 403, 402)
  • "timeout"httpx.TimeoutException or asyncio timeout
  • "json_parse" — JSON decoder error from a model that returned 200 OK
  • "unknown" — fallthrough; treated like 5xx for retry purposes

A3. Retry logic

async def invoke(...) -> LLMGatewayResult:
    flag_on = is_feature_enabled("llm_fallback_enabled")
    fallback_provider = get_feature_value("llm_fallback_provider") or "gpt-4o-mini"

    primary_start = time.monotonic()
    try:
        result = await _try_claude(messages, max_tokens, temperature, ...)
        return _success_result(result, "claude", primary_start, fallback_fired=False)
    except Exception as primary_exc:
        reason = _classify_failure(primary_exc)

        # JSON parse failures: do NOT fall back. Surface the error so
        # the caller can degrade to deterministic template.
        if reason == "json_parse" and expects_json:
            await _emit_fallback_event(
                agent_name, "claude", reason, fallback_attempted=False, ...
            )
            raise LLMGatewayError(f"Claude returned malformed JSON: {primary_exc}") from primary_exc

        # Auth/billing failures: emit a config-error event so ops sees
        # it AND retry on the fallback (the wild-goose-chase fix).
        if reason == "401":
            await _emit_config_error_event(agent_name, primary_exc)

        # Flag off: re-raise without trying fallback.
        if not flag_on:
            raise LLMGatewayError(f"Claude failed ({reason}) and fallback disabled") from primary_exc

        # Try the fallback provider.
        fallback_start = time.monotonic()
        try:
            result = await _try_provider(
                fallback_provider, messages, max_tokens, temperature, ...
            )
            await _emit_fallback_event(
                agent_name, "claude", reason,
                fallback_provider=fallback_provider,
                fallback_success=True,
                fallback_latency_ms=int((time.monotonic() - fallback_start) * 1000),
                ...
            )
            return _success_result(result, "gpt", primary_start, fallback_fired=True, primary_failure_reason=reason)
        except Exception as fallback_exc:
            await _emit_fallback_event(
                agent_name, "claude", reason,
                fallback_provider=fallback_provider,
                fallback_success=False,
                fallback_error=str(fallback_exc),
                ...
            )
            await _dispatch_alert(
                "llm_total_failure",
                f"Both Claude and {fallback_provider} failed for agent {agent_name}",
            )
            raise LLMGatewayError(
                f"Both Claude and {fallback_provider} failed: "
                f"primary={primary_exc}, fallback={fallback_exc}"
            ) from fallback_exc

A4. Provider-specific clients

_try_claude builds a ChatAnthropic client with the model from get_model_for_task(agent_name) (the existing model registry). The fallback path builds a ChatOpenAI client with the model from llm_fallback_provider flag.

Both clients receive the same langfuse_handler for tracing continuity. Both calls share the same prompt — prompts arrive pre-assembled via prompt_loader.py with locale-specific examples and forbidden phrases already injected by _get_forbidden_phrases_block() (see prompt-abstraction-steer.md §5).

A5. GPT system prompt augmentation

Updated post-prompt-abstraction (Session 35):

The original design injected a _GPT_FORBIDDEN_PROLOGUE block with forbidden phrases into the GPT system prompt. This is no longer needed — the assembled prompt already contains the forbidden phrases via {forbidden_phrases_block} substitution in the conversation prompt. Injecting them again would duplicate the NEVER list.

Instead, the gateway adds a short GPT-specific preamble that tells GPT-4o mini to follow the existing voice rules in the prompt:

_GPT_PREAMBLE = (
    "You are substituting for Claude on the Curaway platform. "
    "Follow all voice rules, formatting instructions, and safety "
    "guardrails already in this system prompt. Do not add preambles "
    "like 'I'd be happy to help' or 'Of course!'. Respond as if "
    "you are the Curaway clinical intake coordinator."
)

def _augment_for_gpt(messages: list[BaseMessage]) -> list[BaseMessage]:
    """Add GPT-specific preamble. Does NOT re-inject forbidden phrases
    (already in the assembled prompt from prompt_loader)."""
    if messages and isinstance(messages[0], SystemMessage):
        messages = list(messages)
        messages[0] = SystemMessage(
            content=_GPT_PREAMBLE + "\n\n" + messages[0].content
        )
    else:
        messages = [SystemMessage(content=_GPT_PREAMBLE)] + list(messages)
    return messages

A6. Shared formatting rules (DEFERRED — follow-up issue)

Deferred from gateway MVP. The formatting rules in conversation_v2.yaml already exist in the assembled prompt. Both Claude and GPT receive the same prompt (including formatting rules) via prompt_loader.py. Extracting into a shared YAML fragment (config/prompts/shared/formatting.yaml) improves maintainability but is not required for the gateway to function.

Tracked as a follow-up — not blocking gateway implementation.

The forbidden-phrase enforcement remains a three-layer system: 1. Prompt layer{forbidden_phrases_block} in the system prompt (same for Claude and GPT, injected by _get_forbidden_phrases_block()) 2. Runtime validatorresponse_policy.py checks every response regardless of provider 3. CI scannertest_voice_compliance.py catches regressions in source files

A6. Cost estimation

_PRICE_TABLE = {
    "claude-haiku-4-5":  (1.00, 5.00),    # $/MTok in / out
    "claude-sonnet-4-5": (3.00, 15.00),
    "gpt-4o-mini":       (0.15, 0.60),
    "gpt-4o":            (2.50, 10.00),
}

def _estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    p_in, p_out = _PRICE_TABLE.get(model, (0, 0))
    return (input_tokens * p_in + output_tokens * p_out) / 1_000_000

Used for the Langfuse trace metadata and the events table payload. Same numbers as the spend report's Anthropic price table — could be extracted into a shared constants module if it grows.

A7. Alert dispatch stub

async def _dispatch_alert(severity: str, message: str) -> None:
    """Stub for future PagerDuty/Slack/email integration.

    Today: logs a warning and does nothing else. The integration
    interface is in place so adding a real sender is one function
    body change — no caller-site changes required.
    """
    logger.warning("ALERT [%s]: %s", severity, message)
    # Future: route to PagerDuty / Slack / email based on severity
    # via app/services/alert_dispatcher.py (does not exist yet)

PART B — Per-agent refactoring

B1. Pattern

For each of the 5 agent files, replace the existing ChatAnthropic construction + ainvoke call with a single llm_gateway.invoke call. Example for clinical_context.py:map_to_medical_codes:

Before (current — post-prompt-abstraction, Session 35):

try:
    llm = _get_llm()
    messages = [
        cached_system_message(prompt),     # ← prompt caching via prompt_loader
        HumanMessage(content=json.dumps(state["extracted_entities"])),
    ]
    response = await llm.ainvoke(messages)
    parsed = _parse_json_response(response.content)
    coded = parsed.get("coded_entities", []) if isinstance(parsed, dict) else parsed
    return {"coded_entities": coded}
except Exception as e:
    return {"coded_entities": [], "errors": [...]}

After (gateway):

try:
    from app.services.llm_gateway import invoke as llm_invoke, LLMGatewayError
    messages = [
        cached_system_message(prompt),     # ← preserved: prompt caching still active
        HumanMessage(content=json.dumps(state["extracted_entities"])),
    ]
    result = await llm_invoke(
        agent_name="clinical_context.map_codes",
        messages=messages,
        max_tokens=2048,
        expects_json=True,
        patient_id=state["patient_id"],
        tenant_id=state["tenant_id"],
        langfuse_handler=...,
    )
    parsed = _parse_json_response(result["content"])
    coded = parsed.get("coded_entities", []) if isinstance(parsed, dict) else parsed
    return {"coded_entities": coded}
except LLMGatewayError as e:
    # Both providers failed → fall back to deterministic template
    return {"coded_entities": [], "errors": [f"llm_unavailable: {e}"]}

Key: cached_system_message() is preserved in the gateway path. The gateway receives the pre-built messages list (with cache_control markers) and passes it to whichever provider it calls.

The deterministic template fallback is preserved — it just becomes the third layer of resilience instead of the first. The gateway handles Claude → GPT; the agent handles GPT-also-failed → template.

B2. Streaming consideration

llm_conversation.generate_response_streaming uses llm.astream() to push tokens to the Redis SSE channel. Streaming ownership remains in llm_conversation.py, not in llm_gateway.py. The current boundary is:

  • llm_conversation.py owns Claude streaming, SSE token publishing, and the decision to abandon streaming when the stream fails
  • llm_gateway.invoke() owns request-level fallback for the subsequent non-streaming retry path

In practice, if Claude streaming fails before or during the stream, llm_conversation.py catches that exception and then calls llm_gateway.invoke() to obtain a non-streaming GPT fallback response in one shot. The frontend's SSE consumer handles both cases (it already accepts either streamed tokens OR a single completion event with the full text).

B3. Match Agent dedup interaction

The Match Agent's analyze_clinical_picture has a dedup short-circuit (PR #76) that uses an existing CoT-enriched summary if one is provided. The gateway is called only when the dedup misses. No interaction issues.

B4. Test mock updates

Existing tests patch ChatAnthropic directly. They need to patch llm_gateway.invoke instead. Pattern:

Before:

with patch("app.agents.clinical_context.ChatAnthropic") as MockLLM:
    mock_instance = AsyncMock()
    mock_instance.ainvoke = AsyncMock(return_value=fake_response)
    MockLLM.return_value = mock_instance

After:

with patch("app.services.llm_gateway.invoke") as mock_invoke:
    mock_invoke.return_value = {
        "content": fake_response.content,
        "model_used": "claude-haiku-4-5",
        "provider": "claude",
        "fallback_fired": False,
        "primary_failure_reason": None,
        "latency_ms": 100,
        "estimated_cost_usd": 0.001,
    }

Patch the gateway's exported invoke function, not a local function-scoped alias. In the current implementation many callers do from app.services.llm_gateway import invoke as llm_invoke inside the function body, so app.agents.*.llm_invoke is not a stable module-level symbol to patch.

Same coverage, marginally cleaner because we no longer mock the LLM client constructor + the ainvoke method separately.


PART C — Tests

C1. tests/test_llm_gateway.py (15 unit tests)

# Provider success cases
def test_claude_succeeds_returns_claude_result()
def test_gpt_fallback_succeeds_when_claude_fails_with_5xx()
def test_gpt_fallback_succeeds_when_claude_fails_with_429()
def test_gpt_fallback_succeeds_when_claude_fails_with_401()
def test_gpt_fallback_succeeds_when_claude_times_out()

# Failure cases
def test_json_parse_failure_does_not_fall_back()
def test_both_providers_failing_raises_LLMGatewayError()
def test_alert_dispatched_when_both_fail()

# Flag behavior
def test_flag_off_re_raises_claude_failure()
def test_provider_flag_switches_fallback_target()

# Observability
def test_fallback_event_emitted_on_fallback_fired()
def test_config_error_event_emitted_on_401()
def test_langfuse_handler_passed_to_both_providers()

# Voice rules / GPT augmentation
def test_gpt_preamble_prepended_on_fallback()
def test_gpt_prompt_does_not_duplicate_forbidden_phrases()
def test_gpt_system_prompt_preserves_original_instructions()

C2. tests/test_llm_gateway_fallback_fixtures.py (5 e2e fixtures)

Each test uses a synthetic Claude failure sequence and asserts the end-to-end agent behavior. Shows what each agent does when Claude goes down.

async def test_clinical_context_map_codes_fallback_returns_valid_entities()
async def test_chat_extractor_fallback_extracts_medications()
async def test_llm_conversation_fallback_returns_coherent_reply()
async def test_match_agent_analyze_fallback_returns_valid_clinical_summary()
async def test_intake_agent_fallback_returns_canned_template_when_both_fail()

C3. Existing test fixture updates

5 existing test files mock ChatAnthropic directly. Each needs ~5 lines updated to mock llm_gateway.invoke instead. Total ~25 lines of test fixture changes. Functionally identical coverage.


PART D — CLAUDE.md drift fix

Update §11.2.1 (Clinical Context Agent) — change:

Model: Claude Haiku 4.5 (~$0.01 per report). Fallback: GPT-4o mini.

to:

Model: Claude Haiku 4.5 (~$0.01 per report). Fallback: GPT-4o mini via llm_gateway.invoke() (see ADR-0015 / docs/specs/llm-fallback-gateway-feature.md).

Also update §11.7 (Fallback Philosophy) — the table currently says:

| Clinical Context | Store raw text, queue for retry. Patient not blocked. |

Add a note:

Now wrapped by the LLM gateway (PR #X). Claude failure → GPT-4o mini via the gateway. Only if BOTH providers fail does the agent fall through to the legacy "store raw text + queue" behavior.

Same treatment for the other agent rows (Intake, Match, Explanation, Orchestrator) — each gets a one-line note that the gateway is the new primary failure boundary.


PART E — Rollout

  1. Implementation landed (Session 39) — gateway service, feature flags, and agent/service call-site migrations are present in the codebase.
  2. Verification window: after deployment changes, watch llm.fallback_fired events in the events table and confirm expected fallback rate remains <2% under normal conditions.
  3. Incident validation — when an Anthropic incident happens (rate limit, 5xx, billing cap), confirm the events table shows the fallback firing and the patient experience continues.
  4. Voice rule audit on GPT replies — sample GPT fallback responses and confirm they pass output_validator. If the violation rate is

    5%, iterate the GPT system prompt augmentation.

  5. Spec maintenance — keep this document updated with any implementation deltas so it remains the design source of truth, rather than a pre-implementation plan.

PART E2 — Edge Cases

# Edge Case Scenario Handling Severity
1 Claude fails mid-stream Streaming response partially sent (tokens pushed to SSE), then Claude errors Gateway catches the exception. stream_end event sent to SSE with error: true. Frontend shows what was received + "Response interrupted — please try again." Do NOT attempt GPT fallback for partial streams (patient already saw partial content, switching models mid-sentence is worse). Critical
2 Concurrent fallbacks under Claude outage Document upload triggers 4 LLM calls (extract + ICD + FHIR + match). Claude is down → all 4 fall back to GPT simultaneously. GPT may rate-limit. Gateway tracks active fallback count via a module-level counter. If >10 concurrent GPT calls, queue with asyncio.Semaphore(10). Log warning at >5. This prevents GPT rate-limit cascade. High
3 cached_system_message content-block format with GPT cached_system_message() returns SystemMessage(content=[{"type": "text", "text": "...", "cache_control": {...}}]). LangChain ChatOpenAI may not accept content-block format — it expects content: str. _augment_for_gpt() MUST flatten content blocks: extract .text from each block, concatenate, return SystemMessage(content=flat_string). The cache_control key is Anthropic-specific and must be stripped for GPT. Add a test: test_gpt_handles_cached_system_message_format(). Critical
4 Timeout budget varies per agent Chat turns = 8s target. Clinical extraction = 15-20s (large reports). Match analysis on Sonnet = 10-15s. One timeout doesn't fit all. Gateway accepts optional timeout_seconds parameter (default 8). Each agent passes its budget: clinical_context.extract → 20s, llm_conversation → 8s, match_agent.analyze → 15s. Timeout on Claude triggers fallback; timeout on GPT raises LLMGatewayError. High
5 OpenAI API key missing/invalid llm_fallback_enabled=true but OPENAI_API_KEY not set or expired. Every fallback attempt fails immediately. Check at gateway module load: if llm_fallback_enabled and no OPENAI_API_KEY, log ERROR and set _fallback_available = False. When fallback fires with _fallback_available=False, skip GPT attempt, raise immediately with clear message: "Fallback enabled but OPENAI_API_KEY not configured." High
6 Mixed-provider outputs in pipeline Document pipeline: extract (Claude) → ICD map (GPT fallback) → FHIR gen (Claude). ICD mapping output from GPT may have slightly different JSON structure. The JSON parser (_parse_json_response) already handles format variations (balanced-bracket extractor from PR #76). No special handling needed IF the parser is robust. Add a test: test_gpt_icd_output_parses_correctly() with a real GPT-style response. Medium
7 Langfuse trace continuity Claude call starts a Langfuse span, fails. GPT call needs to continue the same trace, not create a new one. Pass the same langfuse_handler to both calls (already in spec). The handler tracks the parent trace. Add metadata={"fallback": True, "primary_failure": reason} to the GPT call's Langfuse metadata so the trace shows the fallback clearly. Medium
8 Gateway import at startup Gateway imports langchain_openai. If langchain-openai package is not installed, every agent that imports the gateway fails at startup — even if fallback is disabled. Lazy import: from langchain_openai import ChatOpenAI inside _try_provider(), not at module top. If import fails, catch ImportError, log error, set _fallback_available = False. Gateway still works for Claude-only path. High
9 Client leak under sustained outage Each fallback creates a new ChatOpenAI client. Under sustained Claude outage (hours), thousands of GPT clients created. Use a module-level singleton for the GPT client (same pattern as _llm_singleton in llm_conversation.py). Create once, reuse. Reset on config change (new API key). Medium
10 Empty messages list Agent passes messages=[] to gateway (bug in caller). Validate: if len(messages) == 0, raise ValueError("messages list cannot be empty") immediately — don't send to any provider. Low
11 Very large prompt (>100K tokens) Clinical extraction with a 50-page medical report. Claude accepts it (200K context). GPT-4o mini has 128K context but may truncate. Log a warning if estimated input tokens >100K. On fallback to GPT, if the prompt exceeds GPT's context window, the fallback will fail naturally (API error). Log the token count in the fallback event for debugging. Medium

PART F — Implementation checklist

Model guidance: Opus for architectural/compliance tasks, Sonnet for mechanical implementation following established patterns.

Opus (new service, security boundary, clinical safety)

  • [ ] app/services/llm_gateway.py created with invoke(), _try_claude(), _try_provider(), _classify_failure(), _emit_fallback_event(), _emit_config_error_event(), _dispatch_alert(), _augment_for_gpt(), _estimate_cost(), LLMGatewayError, LLMGatewayResult TypedDict — new service, failure classification logic, GPT prompt augmentation with medical advice guardrails
  • [ ] app/agents/llm_conversation.pygenerate_response + generate_response_streaming refactored, streaming-to-non-streaming fallback handled — core conversation path, voice rules at stake
  • [ ] app/agents/clinical_context.py — 4 LLM calls refactored — clinical extraction pipeline, data quality impact

Sonnet (refactoring to existing pattern, config, tests)

  • [ ] app/services/chat_extractor.py — refactored (follows gateway pattern)
  • [ ] app/agents/match_agent.py — both nodes refactored, dedup short-circuit preserved
  • [ ] app/agents/intake_agent.py — refactored
  • [ ] app/agents/orchestrator.py — intent classification refactored
  • [ ] app/agents/explanation_agent.py — match explanation refactored
  • [ ] app/services/message_classifier.py — guardrail classification refactored
  • [ ] app/agents/lab_analyzer.py — LLM shadow mode (optional, feature-flagged)
  • [ ] config/feature_flags.yamlllm_fallback_enabled (default true), llm_fallback_provider (default "gpt-4o-mini")
  • [ ] app/services/decision_recorder.py — new event type
  • [ ] tests/test_llm_gateway.py — 15 unit tests
  • [ ] tests/test_llm_gateway_fallback_fixtures.py — 5 e2e fixtures
  • [ ] 8 existing test files updated to mock llm_gateway.invoke instead of ChatAnthropic
  • [ ] CLAUDE.md §11.2.1 + §11.7 updated to reflect the new fallback path
  • [ ] Manual smoke: temporarily set ANTHROPIC_API_KEY=invalid on a Railway preview env, send a chat turn, confirm fallback fires and the response is coherent
  • [ ] Verify llm.fallback_fired events appear in the events table with the right shape

References

  • Steer: ai-steer/llm-fallback-gateway-steer.md
  • CoT prompts (parser hardening that makes GPT fallback feasible): PR #76, docs/specs/chain-of-thought-prompts.md
  • Voice rules + medical advice CI guard: config/voice_rules.yaml, tests/test_voice_compliance.py, tests/test_no_medical_advice.py (PR #84 draft)
  • OpenAI spend dashboard fetcher (already wired to detect this spend): PR #87, app/services/spend_report_service.py:_fetch_openai