LLM Routing¶
Overview¶
Curaway uses a tiered model selection strategy to balance cost, latency, and accuracy across its AI-powered features. Every LLM call is routed through a centralized model registry that supports A/B testing, fallback chains, and runtime configuration via feature flags.
The guiding principle: use the cheapest model that meets the quality threshold for each task.
Tiered Model Selection¶
Model Tiers¶
graph TD
Request[Incoming LLM Request] --> Router[Model Router]
Router --> Tier1{Task Complexity}
Tier1 -->|Simple: 80% of calls| Haiku[Claude Haiku 4.5]
Tier1 -->|Complex: 20% of calls| Sonnet[Claude Sonnet 4.6]
Tier1 -->|Classification: bulk| Mini[GPT-4o mini]
Haiku --> Langfuse[Langfuse Tracking]
Sonnet --> Langfuse
Mini --> Langfuse
style Router fill:#008B8B,color:#fff
style Haiku fill:#4A90D9,color:#fff
style Sonnet fill:#FF7F50,color:#fff
style Mini fill:#6B7280,color:#fff
| Tier | Model | Provider | Use Cases | % of Calls | Avg Cost/Call |
|---|---|---|---|---|---|
| Economy | Claude Haiku 4.5 | Anthropic | Conversations, intake, explanations, orchestration | ~80% | $0.003 |
| Premium | Claude Sonnet 4.6 | Anthropic | Clinical extraction, complex reasoning, reranking | ~20% | $0.015 |
| Bulk | GPT-4o mini | OpenAI | High-volume classification, intent detection | As needed | $0.001 |
Why Not a Single Model?
Using Claude Sonnet 4.6 for everything would cost ~5x more per patient journey. For conversational intake and template-based explanations, Haiku 4.5 provides comparable quality at a fraction of the cost. The premium tier is reserved for tasks where accuracy directly impacts patient safety.
Cost Per Patient Journey¶
A typical patient journey involves approximately 6 agent calls:
| Step | Agent | Model | Input Tokens | Output Tokens | Cost |
|---|---|---|---|---|---|
| 1. Document OCR fallback | Clinical Context | Sonnet 4.6 | ~3,000 | ~1,500 | $0.030 |
| 2. Entity extraction | Clinical Context | Sonnet 4.6 | ~2,000 | ~800 | $0.018 |
| 3. Code mapping | Clinical Context | Haiku 4.5 | ~500 | ~300 | $0.002 |
| 4. Intake conversation (3 turns) | Intake | Haiku 4.5 | ~1,500 | ~900 | $0.006 |
| 5. Match reranking | Match | Haiku 4.5 | ~1,000 | ~500 | $0.003 |
| 6. Explanation generation | Explanation | Haiku 4.5 | ~800 | ~600 | $0.003 |
| Total | ~8,800 | ~4,600 | $0.062 |
Cost range per patient journey: $0.07 - $0.50
- Lower end: Text-based PDFs (no Vision OCR), simple conditions
- Upper end: Scanned documents (Vision OCR), complex multi-condition cases
- Average: ~$0.15
Monthly Cost Projections¶
| Scale | Patients/Month | Avg Cost/Patient | Monthly LLM Cost |
|---|---|---|---|
| MVP | 50 | $0.15 | $7.50 |
| Early | 500 | $0.12 | $60 |
| Growth | 5,000 | $0.10 | $500 |
| Scale | 50,000 | $0.08 | $4,000 |
Cost Reduction at Scale
Per-patient cost decreases at scale due to: (1) prompt caching on repeated patterns, (2) more documents processed by PyMuPDF (fewer Vision OCR calls), and (3) model improvements reducing token counts.
Model Registry¶
Configuration File¶
All model routing is configured through config/model_registry.yaml:
# config/model_registry.yaml
models:
claude-haiku-4.5:
provider: anthropic
model_id: claude-haiku-4-5-20250514
max_tokens: 4096
temperature: 0.3
tier: economy
cost_per_1k_input: 0.001
cost_per_1k_output: 0.005
rate_limit_rpm: 1000
timeout_seconds: 30
claude-sonnet-4.6:
provider: anthropic
model_id: claude-sonnet-4-6-20250514
max_tokens: 8192
temperature: 0.1
tier: premium
cost_per_1k_input: 0.003
cost_per_1k_output: 0.015
rate_limit_rpm: 500
timeout_seconds: 60
gpt-4o-mini:
provider: openai
model_id: gpt-4o-mini
max_tokens: 4096
temperature: 0.2
tier: bulk
cost_per_1k_input: 0.00015
cost_per_1k_output: 0.0006
rate_limit_rpm: 2000
timeout_seconds: 15
# Task-to-model routing
routing:
clinical_extraction:
primary: claude-sonnet-4.6
fallback: claude-haiku-4.5
ab_split:
claude-sonnet-4.6: 100
claude-haiku-4.5: 0
patient_conversation:
primary: claude-haiku-4.5
fallback: gpt-4o-mini
ab_split:
claude-haiku-4.5: 100
intent_classification:
primary: gpt-4o-mini
fallback: claude-haiku-4.5
ab_split:
gpt-4o-mini: 80
claude-haiku-4.5: 20
match_explanation:
primary: claude-haiku-4.5
fallback: null # Template fallback, no LLM needed
ab_split:
claude-haiku-4.5: 100
document_reranking:
primary: claude-haiku-4.5
fallback: null # Cosine similarity only
ab_split:
claude-haiku-4.5: 100
vision_ocr:
primary: claude-sonnet-4.6
fallback: null # Only used when other OCR methods fail
ab_split:
claude-sonnet-4.6: 100
A/B Split Percentages¶
The ab_split configuration enables gradual model migration. For example, to test GPT-4o mini for intent classification:
intent_classification:
primary: gpt-4o-mini
ab_split:
gpt-4o-mini: 80 # 80% of requests use GPT-4o mini
claude-haiku-4.5: 20 # 20% use Haiku for comparison
Results are tracked in PostHog with the model_ab_test event, comparing accuracy, latency, and cost between the two groups.
Fallback Chains¶
Every routing entry has a fallback model. If the primary model fails (timeout, rate limit, API error), the system automatically retries with the fallback:
async def route_llm_call(
task: str,
messages: list[dict],
tenant_id: str,
) -> LLMResponse:
"""Route an LLM call based on task type and registry config."""
config = model_registry.get_routing(task)
model = select_model_ab(config)
try:
response = await call_model(model, messages, config)
await track_usage(task, model, response, tenant_id)
return response
except (TimeoutError, RateLimitError, APIError) as e:
logger.warning(f"Primary model {model} failed for {task}: {e}")
if config.fallback:
fallback_response = await call_model(config.fallback, messages, config)
await track_usage(task, config.fallback, fallback_response, tenant_id, fallback=True)
return fallback_response
raise LLMRoutingError(f"All models failed for task {task}")
Langfuse Prompt Management¶
Versioned Prompts¶
All system prompts are stored and versioned in Langfuse, not hardcoded in the application:
from langfuse import Langfuse
langfuse = Langfuse()
async def get_prompt(name: str, version: Optional[int] = None) -> str:
"""Fetch a versioned prompt from Langfuse."""
prompt = langfuse.get_prompt(
name=name,
version=version, # None = latest version
cache_ttl_seconds=300, # Cache for 5 minutes
)
return prompt.compile()
Prompt Inventory¶
| Prompt Name | Current Version | Model | Description |
|---|---|---|---|
clinical_entity_extraction |
v3 | Sonnet 4.6 | Extract conditions, procedures, labs from OCR text |
medical_code_mapping |
v2 | Haiku 4.5 | Map entities to ICD-10, CPT, LOINC codes |
fhir_resource_generation |
v2 | Haiku 4.5 | Generate FHIR R4 JSON from coded entities |
intake_conversation |
v4 | Haiku 4.5 | Conversational preference collection |
intent_classification |
v2 | GPT-4o mini | Classify patient message intent |
match_explanation |
v3 | Haiku 4.5 | Generate natural-language match explanations |
document_reranker |
v1 | Haiku 4.5 | Verify document-to-requirement matches |
vision_ocr_extraction |
v2 | Sonnet 4.6 | Extract text from scanned document images |
comorbidity_summary |
v1 | Haiku 4.5 | Summarize detected comorbidities for patient |
System Prompt Storage¶
System prompts follow a consistent structure:
# Example: clinical_entity_extraction prompt (v3)
"""
You are a medical document analysis specialist. Extract clinical entities
from the following medical document text.
## Rules
1. Extract ALL conditions, procedures, medications, lab results, and vitals
2. Include laterality (left/right/bilateral) when mentioned
3. Include dates when available
4. Include severity/stage when mentioned
5. Do NOT infer information that is not explicitly stated
6. Return results in the specified JSON schema
## Output Schema
{output_schema}
## Document Text
{document_text}
"""
Prompt Versioning Discipline
Never edit a production prompt in place. Always create a new version, test it against the evaluation dataset, and then promote it. Langfuse tracks which version was used for every generation, enabling precise debugging.
Use Case Cost Analysis¶
1. Patient Conversation ($0.03/case)¶
| Detail | Value |
|---|---|
| Model | Claude Haiku 4.5 |
| Average turns | 3-5 per intake session |
| Avg input tokens/turn | 500 (includes conversation history) |
| Avg output tokens/turn | 300 |
| Cost per turn | ~$0.002 |
| Total per case | ~$0.006 - $0.010 |
The intake conversation is the most frequent LLM interaction but also the cheapest per call. Conversation history is managed with a sliding window to keep input tokens bounded.
2. Clinical Context Extraction ($0.03/report)¶
| Detail | Value |
|---|---|
| Model | Claude Sonnet 4.6 |
| Average pages | 2-5 per document |
| Avg input tokens | 2,000-3,000 (OCR text + prompt) |
| Avg output tokens | 800-1,200 (structured JSON) |
| Cost per document | ~$0.015 - $0.030 |
This is the most expensive per-call task because it requires Sonnet-level accuracy for clinical data extraction. The cost is justified: incorrect extraction could lead to wrong provider matches.
3. Comorbidity Detection ($0/case)¶
| Detail | Value |
|---|---|
| Model | None (rule-based) |
| Method | Lookup table of 150+ comorbidity pairs |
| Input | Extracted condition codes |
| Output | Flagged comorbidity pairs with severity |
| Cost | $0 |
COMORBIDITY_PAIRS = {
("E11", "I10"): {"name": "Diabetes + Hypertension", "severity": "moderate"},
("E66", "G47.3"): {"name": "Obesity + Sleep Apnea", "severity": "moderate"},
("E11", "N18"): {"name": "Diabetes + CKD", "severity": "high"},
("I25", "E78"): {"name": "CAD + Hyperlipidemia", "severity": "moderate"},
# ... 146 more pairs
}
def detect_comorbidities(condition_codes: list[str]) -> list[dict]:
"""Detect comorbidity pairs from extracted condition codes."""
detected = []
for i, code1 in enumerate(condition_codes):
for code2 in condition_codes[i+1:]:
key = tuple(sorted([code1[:3], code2[:3]])) # Match on 3-char prefix
if key in COMORBIDITY_PAIRS:
detected.append(COMORBIDITY_PAIRS[key])
return detected
Why Rule-Based?
Comorbidity detection doesn't need LLM reasoning -- it's a well-defined medical knowledge lookup. Using rules instead of an LLM call saves ~$0.005 per case and eliminates hallucination risk for this critical safety check.
4. Match Explanations ($0.005/explanation)¶
| Detail | Value |
|---|---|
| Model | Claude Haiku 4.5 |
| Explanations per match | 3-5 (top providers) |
| Avg input tokens | 800 (match data + prompt) |
| Avg output tokens | 600 (structured explanation) |
| Cost per explanation | ~$0.001 |
| Total per case | ~$0.003 - $0.005 |
Match explanations are generated lazily -- only for the top-ranked providers that are actually shown to the patient. If a patient doesn't scroll past the top 3, explanations for providers 4-5 are never generated.
Latency Optimizations (Session 31)¶
Response Streaming¶
generate_response_streaming() in the chat service uses llm.astream() to push individual tokens to a Redis list as they arrive from the LLM. The frontend consumes them via an SSE endpoint.
Architecture:
SSE endpoint: GET /cases/{id}/chat/stream
- Polls the
case:{case_id}Redis channel at 0.1s intervals - Endpoint timeout: 60s (single response window, not long-lived)
- Event types:
token— individual LLM output tokenstream_end— response complete, close connectionerror/timeout— terminal events
Fallback: When enable_response_streaming flag is off, the chat service falls back to llm.ainvoke() and returns the full response in a single HTTP response. This is the safe default for tenants where streaming is not yet tested.
Performance: Sub-second perceived time-to-first-token (TTFT). The patient sees text appearing within ~200-400ms of the LLM starting generation, compared to waiting 3-8s for the full response.
Chat Pipeline Caching¶
The chat pipeline caches two expensive data fetches using a cache-aside pattern in Redis.
| Cache Layer | Redis Key | TTL | What It Caches |
|---|---|---|---|
| Patient state | chat:state:{tenant_id}:{case_id} |
60s | EHR snapshot, case metadata, document status |
| Conversation context | chat:conv:{tenant_id}:{case_id} |
120s | Last N messages for LLM context window |
Invalidation triggers (cache is deleted on any of these events):
- New chat message sent or received
- Document uploaded or re-processed
- FHIR resource written or updated
- EHR snapshot rebuilt
Flag: enable_chat_cache (Flagsmith). When disabled, every chat turn fetches fresh from the database.
Non-fatal: If Redis is down or returns an error, the cache miss is swallowed and the service falls back to a fresh DB fetch. Cache failures never block the chat pipeline.
Prompt Compression¶
System prompts were rewritten from v1_original to v2_compressed variants, achieving 36-73% token reduction across all agent prompts.
Method: Removed redundant few-shot examples, condensed multi-paragraph instructions into bullet lists, eliminated repeated preamble across prompt sections.
Flag: prompt_version (Flagsmith) — values: v1_original | v2_compressed. Allows instant rollback if compressed prompts degrade output quality.
Parallel + Deferred Pipeline¶
The chat pipeline splits work into three timing tiers:
Parallel (before response): Input classifier and patient context fetch run concurrently via asyncio.gather(). This removes a serial ~500ms context-fetch wait from every chat turn.
- Flag:
enable_parallel_pipeline
Synchronous (response generation): The orchestrator and LLM conversation call run sequentially — these produce the user-visible response.
Deferred (after response starts streaming): Chat extractor (entity extraction from the conversation) and requirement matcher run as background tasks via asyncio.create_task(). These enrich the EHR and update document matching but do not block the user-visible response.
- Flag:
enable_deferred_extraction
Connection Pooling¶
LLM clients: ChatAnthropic and ChatOpenAI are now instantiated once at module load (singletons) rather than per-call. Creating a new client per call added ~50-200ms of TCP/TLS connection overhead. The singleton is reset between tests via importlib.reload.
SQLAlchemy: Database connection pool configured with pool_recycle=1800 (recycle connections after 30 minutes) and pool_timeout=30 (wait up to 30s for a connection from the pool).
Overall Latency Impact¶
| Metric | Before | After | Delta |
|---|---|---|---|
| Average chat response | 13.2s | 7.9s | -41% |
Contributors: LLM client singletons, parallel classifier+context, deferred extraction, prompt compression, chat pipeline caching, response streaming (perceived).
Observability and Cost Tracking¶
Langfuse Integration¶
Every LLM call is tracked in Langfuse with:
@langfuse.observe(name="clinical_extraction")
async def extract_clinical_entities(text: str, tenant_id: str):
"""Extract clinical entities with full Langfuse tracing."""
prompt = await get_prompt("clinical_entity_extraction")
model_config = model_registry.get_routing("clinical_extraction")
response = await call_model(
model=model_config.primary,
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": text},
],
)
# Langfuse automatically tracks: model, tokens, cost, latency, I/O
return parse_extraction_response(response)
Cost Dashboard Metrics¶
| Metric | Tracked In | Alert Threshold |
|---|---|---|
| Daily LLM spend | Langfuse | > $5/day (MVP) |
| Cost per patient journey | Langfuse + PostHog | > $0.50 |
| Fallback rate | Events table | > 10% of calls |
| Average latency (Haiku) | Langfuse | > 3 seconds |
| Average latency (Sonnet) | Langfuse | > 10 seconds |
| Token efficiency | Langfuse | > 5,000 input tokens/call |
Monthly Cost Breakdown (MVP)¶
+---------------------------+--------+--------+---------+
| Category | Calls | Avg $ | Total $ |
+---------------------------+--------+--------+---------+
| Clinical Extraction | 100 | $0.025 | $2.50 |
| Patient Conversations | 300 | $0.003 | $0.90 |
| Match Explanations | 250 | $0.001 | $0.25 |
| Document Reranking | 50 | $0.005 | $0.25 |
| Vision OCR (fallback) | 10 | $0.040 | $0.40 |
| Intent Classification | 300 | $0.001 | $0.30 |
+---------------------------+--------+--------+---------+
| TOTAL | 1,010 | | $4.60 |
+---------------------------+--------+--------+---------+
Post-Seed Plans¶
MedGemma 4B Evaluation¶
After seed funding, the team plans to evaluate Google's MedGemma 4B model as a cost-reduction option for clinical tasks:
| Consideration | Details |
|---|---|
| Model | MedGemma 4B (open-weight, medical-specialized) |
| Deployment | Self-hosted on GPU instance |
| Target tasks | Medical code mapping, comorbidity enhancement |
| Expected savings | 60-80% cost reduction for targeted tasks |
| Risk | Lower general reasoning vs. Claude, requires medical validation |
| Evaluation plan | Shadow mode on 500 cases, compare extraction accuracy |
BioMistral Shadow Mode¶
BioMistral is another candidate for self-hosted medical NLP:
| Consideration | Details |
|---|---|
| Model | BioMistral 7B |
| Deployment | Self-hosted on GPU instance |
| Target tasks | Entity extraction from structured lab reports |
| Expected savings | 70-90% cost reduction for lab report parsing |
| Risk | Narrower training data, may miss edge cases |
| Evaluation plan | Shadow mode alongside Claude Sonnet, compare F1 scores |
graph TD
A[Current: Cloud LLMs Only] --> B[Phase 1: Shadow Testing]
B --> C{Accuracy >= 90%?}
C -->|Yes| D[Phase 2: A/B Split 20%]
D --> E{Cost Savings Confirmed?}
E -->|Yes| F[Phase 3: Promote to Primary]
C -->|No| G[Continue Cloud LLMs]
E -->|No| G
style A fill:#008B8B,color:#fff
style D fill:#FF7F50,color:#fff
style F fill:#4A90D9,color:#fff
Medical AI Model Validation
Any model used for clinical tasks must pass a validation suite of 200+ annotated medical documents before being promoted from shadow mode. Patient safety is non-negotiable -- cost savings never justify reduced clinical accuracy.
Configuration Checklist¶
When adding a new LLM-powered feature:
- Add the model to
config/model_registry.yamlwith routing and fallback - Create a versioned prompt in Langfuse
- Implement the call using
route_llm_call()with the task identifier - Add a Flagsmith feature flag to enable/disable the feature
- Add cost tracking assertions in the test suite
- Add latency and fallback rate alerts in the monitoring dashboard
- Document the expected cost per call in this file