LLM Fallback Gateway — Feature Spec¶
Status: Implemented in Session 39; this document now describes the
implemented design plus follow-up work and known edge cases
Companion steer: ai-steer/llm-fallback-gateway-steer.md
This is the implementation plan. Read the steer first — it covers the six trigger decisions, why a centralized gateway over per-agent fallbacks, and the cost/quality/voice tradeoffs.
Overview¶
A new app/services/llm_gateway.py wraps every LLM call in the agent
stack. Agents stop constructing ChatAnthropic clients directly and
instead call llm_gateway.invoke(...). The gateway tries Claude
first, GPT-4o mini second, raises on both. Fallback fires on 5xx, 429,
401/403, and timeout — but NOT on JSON parse failure (per the steer
doc, malformed JSON is a prompt bug, not an availability problem).
Behind a Flagsmith flag (llm_fallback_enabled, default true).
Fallback model controlled by a second flag (llm_fallback_provider,
default gpt-4o-mini).
File-by-file change list¶
Backend — new¶
| File | Purpose |
|---|---|
NEW app/services/llm_gateway.py |
The gateway. ~250 lines: invoke(), _try_claude(), _try_fallback(), _classify_failure(), _emit_fallback_event(), _dispatch_alert() stub, model registry, retry policy. |
Backend — modified (8 agent/service files)¶
| File | Change |
|---|---|
app/agents/clinical_context.py |
4 LLM calls (extract + retry + ICD map + FHIR gen) refactored to llm_gateway.invoke(). Uses cached_system_message() for prompts. CoT parser hardening from PR #76 preserved. |
app/agents/llm_conversation.py |
generate_response and generate_response_streaming route through the gateway. Streaming path falls back to non-streaming GPT-4o mini on Claude failure. Uses cached_system_message() for system message. |
app/services/chat_extractor.py |
Single Haiku call refactored to gateway. Schema parser unchanged. Uses cached_system_message(). |
app/agents/match_agent.py |
Both analyze_clinical_picture and rerank_edge_cases route through the gateway. Dedup short-circuit (PR #76) preserved. Uses cached_system_message(). |
app/agents/intake_agent.py |
Legacy intake LLM call routed through the gateway. Uses cached_system_message(). |
app/agents/orchestrator.py |
Intent classification LLM call (CLASSIFY_INTENT_PROMPT) routed through gateway. Uses cached_system_message(). |
app/agents/explanation_agent.py |
Match explanation LLM call routed through gateway. Uses cached_system_message(). |
app/services/message_classifier.py |
Guardrail input classification LLM call routed through gateway. |
Note (Session 35 → 38 changes): All agent files now use
cached_system_message() from prompt_loader.py instead of
SystemMessage(content=...). The gateway must preserve this
pattern — cached_system_message() enables Anthropic prompt
caching (10% input cost on cache hits).
Optional: app/agents/lab_analyzer.py has a comorbidity_llm_shadow
mode (feature-flagged, off by default) that calls Claude Sonnet
for shadow comparison. Route through gateway if shadow mode is on.
Low priority — only fires when flag is explicitly enabled.
Backend — minor edits¶
| File | Change |
|---|---|
app/config.py |
No startup validation tied to Flagsmith. openai_api_key already exists; fallback availability is checked at runtime inside llm_gateway.py because llm_fallback_enabled is a runtime flag, not a static settings value. |
config/feature_flags.yaml |
Add two flags: llm_fallback_enabled (default true), llm_fallback_provider (default "gpt-4o-mini"). |
app/services/decision_recorder.py |
New event type: llm.fallback_fired. |
app/services/output_validator.py |
No change — already provider-agnostic. |
| CLAUDE.md | Truth-up §11.2.1 "Fallback: GPT-4o mini" line — it's now actually true. |
Tests¶
| File | Change |
|---|---|
NEW tests/test_llm_gateway.py |
Unit tests for the gateway. ~15 cases covering each trigger class + retry budget + flag-off behavior + alert dispatch + Langfuse tracing. Mocks both providers. |
NEW tests/test_llm_gateway_fallback_fixtures.py |
Fixture-based tests where the gateway is fed a prepared "Claude returns 503" / "Claude times out" / "Claude returns malformed JSON" sequence and the test asserts the right downstream behavior (fallback / no fallback / deterministic template). |
tests/test_clinical_context_agent.py |
Update mocks: instead of patching ChatAnthropic, patch llm_gateway.invoke. The 3 existing test cases should still pass with no semantic changes. |
tests/test_session13_agents.py |
Same: patch llm_gateway.invoke instead of ChatAnthropic. |
tests/test_chat_extractor.py |
Same. |
tests/test_session14_mcp_orchestrator.py |
Same. |
tests/test_cot_clinical_prompts.py |
The PR #76 tests already mock ChatAnthropic — update to mock llm_gateway.invoke. 17 tests, ~30 lines of fixture changes. |
PART A — app/services/llm_gateway.py¶
The new gateway module. Single entry point for every LLM call.
A1. Public surface¶
from typing import Any, Literal
from langchain_core.messages import BaseMessage
class LLMGatewayResult(TypedDict):
content: str # the LLM's text response
model_used: str # "claude-haiku-4-5" | "gpt-4o-mini" | ...
provider: Literal["claude", "gpt"]
fallback_fired: bool # True if fallback was used
primary_failure_reason: str | None
latency_ms: int
estimated_cost_usd: float
async def invoke(
*,
agent_name: str, # "clinical_context.map_codes" | "llm_conversation" | ...
messages: list[BaseMessage], # langchain-style messages
max_tokens: int = 1024,
temperature: float = 0,
expects_json: bool = False, # if True, JSON parse failure = degrade, NOT retry
case_id: str | None = None, # for events table
patient_id: str | None = None, # for events table
tenant_id: str | None = None, # for events table
langfuse_handler: object | None = None,
) -> LLMGatewayResult:
"""The single entry point. Tries Claude first, GPT-4o mini second.
Raises LLMGatewayError if both fail."""
A2. Trigger classification¶
_classify_failure(exception) returns one of:
"5xx"—httpx.HTTPStatusErrorwith status 500–599"429"—httpx.HTTPStatusErrorwith status 429"401"— auth/billing failures (401, 403, 402)"timeout"—httpx.TimeoutExceptionor asyncio timeout"json_parse"— JSON decoder error from a model that returned 200 OK"unknown"— fallthrough; treated like5xxfor retry purposes
A3. Retry logic¶
async def invoke(...) -> LLMGatewayResult:
flag_on = is_feature_enabled("llm_fallback_enabled")
fallback_provider = get_feature_value("llm_fallback_provider") or "gpt-4o-mini"
primary_start = time.monotonic()
try:
result = await _try_claude(messages, max_tokens, temperature, ...)
return _success_result(result, "claude", primary_start, fallback_fired=False)
except Exception as primary_exc:
reason = _classify_failure(primary_exc)
# JSON parse failures: do NOT fall back. Surface the error so
# the caller can degrade to deterministic template.
if reason == "json_parse" and expects_json:
await _emit_fallback_event(
agent_name, "claude", reason, fallback_attempted=False, ...
)
raise LLMGatewayError(f"Claude returned malformed JSON: {primary_exc}") from primary_exc
# Auth/billing failures: emit a config-error event so ops sees
# it AND retry on the fallback (the wild-goose-chase fix).
if reason == "401":
await _emit_config_error_event(agent_name, primary_exc)
# Flag off: re-raise without trying fallback.
if not flag_on:
raise LLMGatewayError(f"Claude failed ({reason}) and fallback disabled") from primary_exc
# Try the fallback provider.
fallback_start = time.monotonic()
try:
result = await _try_provider(
fallback_provider, messages, max_tokens, temperature, ...
)
await _emit_fallback_event(
agent_name, "claude", reason,
fallback_provider=fallback_provider,
fallback_success=True,
fallback_latency_ms=int((time.monotonic() - fallback_start) * 1000),
...
)
return _success_result(result, "gpt", primary_start, fallback_fired=True, primary_failure_reason=reason)
except Exception as fallback_exc:
await _emit_fallback_event(
agent_name, "claude", reason,
fallback_provider=fallback_provider,
fallback_success=False,
fallback_error=str(fallback_exc),
...
)
await _dispatch_alert(
"llm_total_failure",
f"Both Claude and {fallback_provider} failed for agent {agent_name}",
)
raise LLMGatewayError(
f"Both Claude and {fallback_provider} failed: "
f"primary={primary_exc}, fallback={fallback_exc}"
) from fallback_exc
A4. Provider-specific clients¶
_try_claude builds a ChatAnthropic client with the model from
get_model_for_task(agent_name) (the existing model registry). The
fallback path builds a ChatOpenAI client with the model from
llm_fallback_provider flag.
Both clients receive the same langfuse_handler for tracing
continuity. Both calls share the same prompt — prompts arrive
pre-assembled via prompt_loader.py with locale-specific examples
and forbidden phrases already injected by _get_forbidden_phrases_block()
(see prompt-abstraction-steer.md §5).
A5. GPT system prompt augmentation¶
Updated post-prompt-abstraction (Session 35):
The original design injected a _GPT_FORBIDDEN_PROLOGUE block with
forbidden phrases into the GPT system prompt. This is no longer
needed — the assembled prompt already contains the forbidden phrases
via {forbidden_phrases_block} substitution in the conversation
prompt. Injecting them again would duplicate the NEVER list.
Instead, the gateway adds a short GPT-specific preamble that tells GPT-4o mini to follow the existing voice rules in the prompt:
_GPT_PREAMBLE = (
"You are substituting for Claude on the Curaway platform. "
"Follow all voice rules, formatting instructions, and safety "
"guardrails already in this system prompt. Do not add preambles "
"like 'I'd be happy to help' or 'Of course!'. Respond as if "
"you are the Curaway clinical intake coordinator."
)
def _augment_for_gpt(messages: list[BaseMessage]) -> list[BaseMessage]:
"""Add GPT-specific preamble. Does NOT re-inject forbidden phrases
(already in the assembled prompt from prompt_loader)."""
if messages and isinstance(messages[0], SystemMessage):
messages = list(messages)
messages[0] = SystemMessage(
content=_GPT_PREAMBLE + "\n\n" + messages[0].content
)
else:
messages = [SystemMessage(content=_GPT_PREAMBLE)] + list(messages)
return messages
A6. Shared formatting rules (DEFERRED — follow-up issue)¶
Deferred from gateway MVP. The formatting rules in
conversation_v2.yaml already exist in the assembled prompt. Both
Claude and GPT receive the same prompt (including formatting rules)
via prompt_loader.py. Extracting into a shared YAML fragment
(config/prompts/shared/formatting.yaml) improves maintainability
but is not required for the gateway to function.
Tracked as a follow-up — not blocking gateway implementation.
The forbidden-phrase enforcement remains a three-layer system:
1. Prompt layer — {forbidden_phrases_block} in the system prompt
(same for Claude and GPT, injected by _get_forbidden_phrases_block())
2. Runtime validator — response_policy.py checks every response
regardless of provider
3. CI scanner — test_voice_compliance.py catches regressions in
source files
A6. Cost estimation¶
_PRICE_TABLE = {
"claude-haiku-4-5": (1.00, 5.00), # $/MTok in / out
"claude-sonnet-4-5": (3.00, 15.00),
"gpt-4o-mini": (0.15, 0.60),
"gpt-4o": (2.50, 10.00),
}
def _estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
p_in, p_out = _PRICE_TABLE.get(model, (0, 0))
return (input_tokens * p_in + output_tokens * p_out) / 1_000_000
Used for the Langfuse trace metadata and the events table payload. Same numbers as the spend report's Anthropic price table — could be extracted into a shared constants module if it grows.
A7. Alert dispatch stub¶
async def _dispatch_alert(severity: str, message: str) -> None:
"""Stub for future PagerDuty/Slack/email integration.
Today: logs a warning and does nothing else. The integration
interface is in place so adding a real sender is one function
body change — no caller-site changes required.
"""
logger.warning("ALERT [%s]: %s", severity, message)
# Future: route to PagerDuty / Slack / email based on severity
# via app/services/alert_dispatcher.py (does not exist yet)
PART B — Per-agent refactoring¶
B1. Pattern¶
For each of the 5 agent files, replace the existing ChatAnthropic
construction + ainvoke call with a single llm_gateway.invoke call.
Example for clinical_context.py:map_to_medical_codes:
Before (current — post-prompt-abstraction, Session 35):
try:
llm = _get_llm()
messages = [
cached_system_message(prompt), # ← prompt caching via prompt_loader
HumanMessage(content=json.dumps(state["extracted_entities"])),
]
response = await llm.ainvoke(messages)
parsed = _parse_json_response(response.content)
coded = parsed.get("coded_entities", []) if isinstance(parsed, dict) else parsed
return {"coded_entities": coded}
except Exception as e:
return {"coded_entities": [], "errors": [...]}
After (gateway):
try:
from app.services.llm_gateway import invoke as llm_invoke, LLMGatewayError
messages = [
cached_system_message(prompt), # ← preserved: prompt caching still active
HumanMessage(content=json.dumps(state["extracted_entities"])),
]
result = await llm_invoke(
agent_name="clinical_context.map_codes",
messages=messages,
max_tokens=2048,
expects_json=True,
patient_id=state["patient_id"],
tenant_id=state["tenant_id"],
langfuse_handler=...,
)
parsed = _parse_json_response(result["content"])
coded = parsed.get("coded_entities", []) if isinstance(parsed, dict) else parsed
return {"coded_entities": coded}
except LLMGatewayError as e:
# Both providers failed → fall back to deterministic template
return {"coded_entities": [], "errors": [f"llm_unavailable: {e}"]}
Key: cached_system_message() is preserved in the gateway path.
The gateway receives the pre-built messages list (with cache_control
markers) and passes it to whichever provider it calls.
The deterministic template fallback is preserved — it just becomes the third layer of resilience instead of the first. The gateway handles Claude → GPT; the agent handles GPT-also-failed → template.
B2. Streaming consideration¶
llm_conversation.generate_response_streaming uses llm.astream() to
push tokens to the Redis SSE channel. Streaming ownership remains in
llm_conversation.py, not in llm_gateway.py. The current boundary is:
llm_conversation.pyowns Claude streaming, SSE token publishing, and the decision to abandon streaming when the stream failsllm_gateway.invoke()owns request-level fallback for the subsequent non-streaming retry path
In practice, if Claude streaming fails before or during the stream,
llm_conversation.py catches that exception and then calls
llm_gateway.invoke() to obtain a non-streaming GPT fallback response
in one shot. The frontend's SSE consumer handles both cases (it already
accepts either streamed tokens OR a single completion event with the
full text).
B3. Match Agent dedup interaction¶
The Match Agent's analyze_clinical_picture has a dedup short-circuit
(PR #76) that uses an existing CoT-enriched summary if one is provided.
The gateway is called only when the dedup misses. No interaction issues.
B4. Test mock updates¶
Existing tests patch ChatAnthropic directly. They need to patch
llm_gateway.invoke instead. Pattern:
Before:
with patch("app.agents.clinical_context.ChatAnthropic") as MockLLM:
mock_instance = AsyncMock()
mock_instance.ainvoke = AsyncMock(return_value=fake_response)
MockLLM.return_value = mock_instance
After:
with patch("app.services.llm_gateway.invoke") as mock_invoke:
mock_invoke.return_value = {
"content": fake_response.content,
"model_used": "claude-haiku-4-5",
"provider": "claude",
"fallback_fired": False,
"primary_failure_reason": None,
"latency_ms": 100,
"estimated_cost_usd": 0.001,
}
Patch the gateway's exported invoke function, not a local
function-scoped alias. In the current implementation many callers do
from app.services.llm_gateway import invoke as llm_invoke inside
the function body, so app.agents.*.llm_invoke is not a stable
module-level symbol to patch.
Same coverage, marginally cleaner because we no longer mock the LLM
client constructor + the ainvoke method separately.
PART C — Tests¶
C1. tests/test_llm_gateway.py (15 unit tests)¶
# Provider success cases
def test_claude_succeeds_returns_claude_result()
def test_gpt_fallback_succeeds_when_claude_fails_with_5xx()
def test_gpt_fallback_succeeds_when_claude_fails_with_429()
def test_gpt_fallback_succeeds_when_claude_fails_with_401()
def test_gpt_fallback_succeeds_when_claude_times_out()
# Failure cases
def test_json_parse_failure_does_not_fall_back()
def test_both_providers_failing_raises_LLMGatewayError()
def test_alert_dispatched_when_both_fail()
# Flag behavior
def test_flag_off_re_raises_claude_failure()
def test_provider_flag_switches_fallback_target()
# Observability
def test_fallback_event_emitted_on_fallback_fired()
def test_config_error_event_emitted_on_401()
def test_langfuse_handler_passed_to_both_providers()
# Voice rules / GPT augmentation
def test_gpt_preamble_prepended_on_fallback()
def test_gpt_prompt_does_not_duplicate_forbidden_phrases()
def test_gpt_system_prompt_preserves_original_instructions()
C2. tests/test_llm_gateway_fallback_fixtures.py (5 e2e fixtures)¶
Each test uses a synthetic Claude failure sequence and asserts the end-to-end agent behavior. Shows what each agent does when Claude goes down.
async def test_clinical_context_map_codes_fallback_returns_valid_entities()
async def test_chat_extractor_fallback_extracts_medications()
async def test_llm_conversation_fallback_returns_coherent_reply()
async def test_match_agent_analyze_fallback_returns_valid_clinical_summary()
async def test_intake_agent_fallback_returns_canned_template_when_both_fail()
C3. Existing test fixture updates¶
5 existing test files mock ChatAnthropic directly. Each needs ~5
lines updated to mock llm_gateway.invoke instead. Total ~25 lines
of test fixture changes. Functionally identical coverage.
PART D — CLAUDE.md drift fix¶
Update §11.2.1 (Clinical Context Agent) — change:
Model: Claude Haiku 4.5 (~$0.01 per report). Fallback: GPT-4o mini.
to:
Model: Claude Haiku 4.5 (~$0.01 per report). Fallback: GPT-4o mini via
llm_gateway.invoke()(see ADR-0015 / docs/specs/llm-fallback-gateway-feature.md).
Also update §11.7 (Fallback Philosophy) — the table currently says:
| Clinical Context | Store raw text, queue for retry. Patient not blocked. |
Add a note:
Now wrapped by the LLM gateway (PR #X). Claude failure → GPT-4o mini via the gateway. Only if BOTH providers fail does the agent fall through to the legacy "store raw text + queue" behavior.
Same treatment for the other agent rows (Intake, Match, Explanation, Orchestrator) — each gets a one-line note that the gateway is the new primary failure boundary.
PART E — Rollout¶
- Implementation landed (Session 39) — gateway service, feature flags, and agent/service call-site migrations are present in the codebase.
- Verification window: after deployment changes, watch
llm.fallback_firedevents in the events table and confirm expected fallback rate remains <2% under normal conditions. - Incident validation — when an Anthropic incident happens (rate limit, 5xx, billing cap), confirm the events table shows the fallback firing and the patient experience continues.
- Voice rule audit on GPT replies — sample GPT fallback responses
and confirm they pass
output_validator. If the violation rate is5%, iterate the GPT system prompt augmentation.
- Spec maintenance — keep this document updated with any implementation deltas so it remains the design source of truth, rather than a pre-implementation plan.
PART E2 — Edge Cases¶
| # | Edge Case | Scenario | Handling | Severity |
|---|---|---|---|---|
| 1 | Claude fails mid-stream | Streaming response partially sent (tokens pushed to SSE), then Claude errors | Gateway catches the exception. stream_end event sent to SSE with error: true. Frontend shows what was received + "Response interrupted — please try again." Do NOT attempt GPT fallback for partial streams (patient already saw partial content, switching models mid-sentence is worse). |
Critical |
| 2 | Concurrent fallbacks under Claude outage | Document upload triggers 4 LLM calls (extract + ICD + FHIR + match). Claude is down → all 4 fall back to GPT simultaneously. GPT may rate-limit. | Gateway tracks active fallback count via a module-level counter. If >10 concurrent GPT calls, queue with asyncio.Semaphore(10). Log warning at >5. This prevents GPT rate-limit cascade. |
High |
| 3 | cached_system_message content-block format with GPT |
cached_system_message() returns SystemMessage(content=[{"type": "text", "text": "...", "cache_control": {...}}]). LangChain ChatOpenAI may not accept content-block format — it expects content: str. |
_augment_for_gpt() MUST flatten content blocks: extract .text from each block, concatenate, return SystemMessage(content=flat_string). The cache_control key is Anthropic-specific and must be stripped for GPT. Add a test: test_gpt_handles_cached_system_message_format(). |
Critical |
| 4 | Timeout budget varies per agent | Chat turns = 8s target. Clinical extraction = 15-20s (large reports). Match analysis on Sonnet = 10-15s. One timeout doesn't fit all. | Gateway accepts optional timeout_seconds parameter (default 8). Each agent passes its budget: clinical_context.extract → 20s, llm_conversation → 8s, match_agent.analyze → 15s. Timeout on Claude triggers fallback; timeout on GPT raises LLMGatewayError. |
High |
| 5 | OpenAI API key missing/invalid | llm_fallback_enabled=true but OPENAI_API_KEY not set or expired. Every fallback attempt fails immediately. |
Check at gateway module load: if llm_fallback_enabled and no OPENAI_API_KEY, log ERROR and set _fallback_available = False. When fallback fires with _fallback_available=False, skip GPT attempt, raise immediately with clear message: "Fallback enabled but OPENAI_API_KEY not configured." |
High |
| 6 | Mixed-provider outputs in pipeline | Document pipeline: extract (Claude) → ICD map (GPT fallback) → FHIR gen (Claude). ICD mapping output from GPT may have slightly different JSON structure. | The JSON parser (_parse_json_response) already handles format variations (balanced-bracket extractor from PR #76). No special handling needed IF the parser is robust. Add a test: test_gpt_icd_output_parses_correctly() with a real GPT-style response. |
Medium |
| 7 | Langfuse trace continuity | Claude call starts a Langfuse span, fails. GPT call needs to continue the same trace, not create a new one. | Pass the same langfuse_handler to both calls (already in spec). The handler tracks the parent trace. Add metadata={"fallback": True, "primary_failure": reason} to the GPT call's Langfuse metadata so the trace shows the fallback clearly. |
Medium |
| 8 | Gateway import at startup | Gateway imports langchain_openai. If langchain-openai package is not installed, every agent that imports the gateway fails at startup — even if fallback is disabled. |
Lazy import: from langchain_openai import ChatOpenAI inside _try_provider(), not at module top. If import fails, catch ImportError, log error, set _fallback_available = False. Gateway still works for Claude-only path. |
High |
| 9 | Client leak under sustained outage | Each fallback creates a new ChatOpenAI client. Under sustained Claude outage (hours), thousands of GPT clients created. |
Use a module-level singleton for the GPT client (same pattern as _llm_singleton in llm_conversation.py). Create once, reuse. Reset on config change (new API key). |
Medium |
| 10 | Empty messages list | Agent passes messages=[] to gateway (bug in caller). |
Validate: if len(messages) == 0, raise ValueError("messages list cannot be empty") immediately — don't send to any provider. |
Low |
| 11 | Very large prompt (>100K tokens) | Clinical extraction with a 50-page medical report. Claude accepts it (200K context). GPT-4o mini has 128K context but may truncate. | Log a warning if estimated input tokens >100K. On fallback to GPT, if the prompt exceeds GPT's context window, the fallback will fail naturally (API error). Log the token count in the fallback event for debugging. | Medium |
PART F — Implementation checklist¶
Model guidance: Opus for architectural/compliance tasks, Sonnet for mechanical implementation following established patterns.
Opus (new service, security boundary, clinical safety)¶
- [ ]
app/services/llm_gateway.pycreated withinvoke(),_try_claude(),_try_provider(),_classify_failure(),_emit_fallback_event(),_emit_config_error_event(),_dispatch_alert(),_augment_for_gpt(),_estimate_cost(),LLMGatewayError,LLMGatewayResultTypedDict — new service, failure classification logic, GPT prompt augmentation with medical advice guardrails - [ ]
app/agents/llm_conversation.py—generate_response+generate_response_streamingrefactored, streaming-to-non-streaming fallback handled — core conversation path, voice rules at stake - [ ]
app/agents/clinical_context.py— 4 LLM calls refactored — clinical extraction pipeline, data quality impact
Sonnet (refactoring to existing pattern, config, tests)¶
- [ ]
app/services/chat_extractor.py— refactored (follows gateway pattern) - [ ]
app/agents/match_agent.py— both nodes refactored, dedup short-circuit preserved - [ ]
app/agents/intake_agent.py— refactored - [ ]
app/agents/orchestrator.py— intent classification refactored - [ ]
app/agents/explanation_agent.py— match explanation refactored - [ ]
app/services/message_classifier.py— guardrail classification refactored - [ ]
app/agents/lab_analyzer.py— LLM shadow mode (optional, feature-flagged) - [ ]
config/feature_flags.yaml—llm_fallback_enabled(default true),llm_fallback_provider(default"gpt-4o-mini") - [ ]
app/services/decision_recorder.py— new event type - [ ]
tests/test_llm_gateway.py— 15 unit tests - [ ]
tests/test_llm_gateway_fallback_fixtures.py— 5 e2e fixtures - [ ] 8 existing test files updated to mock
llm_gateway.invokeinstead ofChatAnthropic - [ ] CLAUDE.md §11.2.1 + §11.7 updated to reflect the new fallback path
- [ ] Manual smoke: temporarily set
ANTHROPIC_API_KEY=invalidon a Railway preview env, send a chat turn, confirm fallback fires and the response is coherent - [ ] Verify
llm.fallback_firedevents appear in the events table with the right shape
References¶
- Steer:
ai-steer/llm-fallback-gateway-steer.md - CoT prompts (parser hardening that makes GPT fallback feasible):
PR #76,
docs/specs/chain-of-thought-prompts.md - Voice rules + medical advice CI guard:
config/voice_rules.yaml,tests/test_voice_compliance.py,tests/test_no_medical_advice.py(PR #84 draft) - OpenAI spend dashboard fetcher (already wired to detect this
spend): PR #87,
app/services/spend_report_service.py:_fetch_openai