LLM Routing¶
Overview¶
Curaway uses a tiered model selection strategy to balance cost, latency, and accuracy across its AI-powered features. Every LLM call is routed through a centralized model registry that supports A/B testing, fallback chains, and runtime configuration via feature flags.
The guiding principle: use the cheapest model that meets the quality threshold for each task.
Tiered Model Selection¶
Model Tiers¶
graph TD
Request[Incoming LLM Request] --> Router[Model Router]
Router --> Tier1{Task Complexity}
Tier1 -->|Simple: 80% of calls| Haiku[Claude Haiku 4.5]
Tier1 -->|Complex: 20% of calls| Sonnet[Claude Sonnet 4.6]
Tier1 -->|Classification: bulk| Mini[GPT-4o mini]
Haiku --> Langfuse[Langfuse Tracking]
Sonnet --> Langfuse
Mini --> Langfuse
style Router fill:#008B8B,color:#fff
style Haiku fill:#4A90D9,color:#fff
style Sonnet fill:#FF7F50,color:#fff
style Mini fill:#6B7280,color:#fff
| Tier | Model | Provider | Use Cases | % of Calls | Avg Cost/Call |
|---|---|---|---|---|---|
| Economy | Claude Haiku 4.5 | Anthropic | Conversations, intake, explanations, orchestration | ~80% | $0.003 |
| Premium | Claude Sonnet 4.6 | Anthropic | Clinical extraction, complex reasoning, reranking | ~20% | $0.015 |
| Bulk | GPT-4o mini | OpenAI | High-volume classification, intent detection | As needed | $0.001 |
Why Not a Single Model?
Using Claude Sonnet 4.6 for everything would cost ~5x more per patient journey. For conversational intake and template-based explanations, Haiku 4.5 provides comparable quality at a fraction of the cost. The premium tier is reserved for tasks where accuracy directly impacts patient safety.
Cost Per Patient Journey¶
A typical patient journey involves approximately 6 agent calls:
| Step | Agent | Model | Input Tokens | Output Tokens | Cost |
|---|---|---|---|---|---|
| 1. Document OCR fallback | Clinical Context | Sonnet 4.6 | ~3,000 | ~1,500 | $0.030 |
| 2. Entity extraction | Clinical Context | Sonnet 4.6 | ~2,000 | ~800 | $0.018 |
| 3. Code mapping | Clinical Context | Haiku 4.5 | ~500 | ~300 | $0.002 |
| 4. Intake conversation (3 turns) | Intake | Haiku 4.5 | ~1,500 | ~900 | $0.006 |
| 5. Match reranking | Match | Haiku 4.5 | ~1,000 | ~500 | $0.003 |
| 6. Explanation generation | Explanation | Haiku 4.5 | ~800 | ~600 | $0.003 |
| Total | ~8,800 | ~4,600 | $0.062 |
Cost range per patient journey: $0.07 - $0.50
- Lower end: Text-based PDFs (no Vision OCR), simple conditions
- Upper end: Scanned documents (Vision OCR), complex multi-condition cases
- Average: ~$0.15
Monthly Cost Projections¶
| Scale | Patients/Month | Avg Cost/Patient | Monthly LLM Cost |
|---|---|---|---|
| POC | 50 | $0.15 | $7.50 |
| Early | 500 | $0.12 | $60 |
| Growth | 5,000 | $0.10 | $500 |
| Scale | 50,000 | $0.08 | $4,000 |
Cost Reduction at Scale
Per-patient cost decreases at scale due to: (1) prompt caching on repeated patterns, (2) more documents processed by PyMuPDF (fewer Vision OCR calls), and (3) model improvements reducing token counts.
Model Registry¶
Configuration File¶
All model routing is configured through config/model_registry.yaml:
# config/model_registry.yaml
models:
claude-haiku-4.5:
provider: anthropic
model_id: claude-haiku-4-5-20250514
max_tokens: 4096
temperature: 0.3
tier: economy
cost_per_1k_input: 0.001
cost_per_1k_output: 0.005
rate_limit_rpm: 1000
timeout_seconds: 30
claude-sonnet-4.6:
provider: anthropic
model_id: claude-sonnet-4-6-20250514
max_tokens: 8192
temperature: 0.1
tier: premium
cost_per_1k_input: 0.003
cost_per_1k_output: 0.015
rate_limit_rpm: 500
timeout_seconds: 60
gpt-4o-mini:
provider: openai
model_id: gpt-4o-mini
max_tokens: 4096
temperature: 0.2
tier: bulk
cost_per_1k_input: 0.00015
cost_per_1k_output: 0.0006
rate_limit_rpm: 2000
timeout_seconds: 15
# Task-to-model routing
routing:
clinical_extraction:
primary: claude-sonnet-4.6
fallback: claude-haiku-4.5
ab_split:
claude-sonnet-4.6: 100
claude-haiku-4.5: 0
patient_conversation:
primary: claude-haiku-4.5
fallback: gpt-4o-mini
ab_split:
claude-haiku-4.5: 100
intent_classification:
primary: gpt-4o-mini
fallback: claude-haiku-4.5
ab_split:
gpt-4o-mini: 80
claude-haiku-4.5: 20
match_explanation:
primary: claude-haiku-4.5
fallback: null # Template fallback, no LLM needed
ab_split:
claude-haiku-4.5: 100
document_reranking:
primary: claude-haiku-4.5
fallback: null # Cosine similarity only
ab_split:
claude-haiku-4.5: 100
vision_ocr:
primary: claude-sonnet-4.6
fallback: null # Only used when other OCR methods fail
ab_split:
claude-sonnet-4.6: 100
A/B Split Percentages¶
The ab_split configuration enables gradual model migration. For example, to test GPT-4o mini for intent classification:
intent_classification:
primary: gpt-4o-mini
ab_split:
gpt-4o-mini: 80 # 80% of requests use GPT-4o mini
claude-haiku-4.5: 20 # 20% use Haiku for comparison
Results are tracked in PostHog with the model_ab_test event, comparing accuracy, latency, and cost between the two groups.
Fallback Chains¶
Every routing entry has a fallback model. If the primary model fails (timeout, rate limit, API error), the system automatically retries with the fallback:
async def route_llm_call(
task: str,
messages: list[dict],
tenant_id: str,
) -> LLMResponse:
"""Route an LLM call based on task type and registry config."""
config = model_registry.get_routing(task)
model = select_model_ab(config)
try:
response = await call_model(model, messages, config)
await track_usage(task, model, response, tenant_id)
return response
except (TimeoutError, RateLimitError, APIError) as e:
logger.warning(f"Primary model {model} failed for {task}: {e}")
if config.fallback:
fallback_response = await call_model(config.fallback, messages, config)
await track_usage(task, config.fallback, fallback_response, tenant_id, fallback=True)
return fallback_response
raise LLMRoutingError(f"All models failed for task {task}")
Langfuse Prompt Management¶
Versioned Prompts¶
All system prompts are stored and versioned in Langfuse, not hardcoded in the application:
from langfuse import Langfuse
langfuse = Langfuse()
async def get_prompt(name: str, version: Optional[int] = None) -> str:
"""Fetch a versioned prompt from Langfuse."""
prompt = langfuse.get_prompt(
name=name,
version=version, # None = latest version
cache_ttl_seconds=300, # Cache for 5 minutes
)
return prompt.compile()
Prompt Inventory¶
| Prompt Name | Current Version | Model | Description |
|---|---|---|---|
clinical_entity_extraction |
v3 | Sonnet 4.6 | Extract conditions, procedures, labs from OCR text |
medical_code_mapping |
v2 | Haiku 4.5 | Map entities to ICD-10, CPT, LOINC codes |
fhir_resource_generation |
v2 | Haiku 4.5 | Generate FHIR R4 JSON from coded entities |
intake_conversation |
v4 | Haiku 4.5 | Conversational preference collection |
intent_classification |
v2 | GPT-4o mini | Classify patient message intent |
match_explanation |
v3 | Haiku 4.5 | Generate natural-language match explanations |
document_reranker |
v1 | Haiku 4.5 | Verify document-to-requirement matches |
vision_ocr_extraction |
v2 | Sonnet 4.6 | Extract text from scanned document images |
comorbidity_summary |
v1 | Haiku 4.5 | Summarize detected comorbidities for patient |
System Prompt Storage¶
System prompts follow a consistent structure:
# Example: clinical_entity_extraction prompt (v3)
"""
You are a medical document analysis specialist. Extract clinical entities
from the following medical document text.
## Rules
1. Extract ALL conditions, procedures, medications, lab results, and vitals
2. Include laterality (left/right/bilateral) when mentioned
3. Include dates when available
4. Include severity/stage when mentioned
5. Do NOT infer information that is not explicitly stated
6. Return results in the specified JSON schema
## Output Schema
{output_schema}
## Document Text
{document_text}
"""
Prompt Versioning Discipline
Never edit a production prompt in place. Always create a new version, test it against the evaluation dataset, and then promote it. Langfuse tracks which version was used for every generation, enabling precise debugging.
Use Case Cost Analysis¶
1. Patient Conversation ($0.03/case)¶
| Detail | Value |
|---|---|
| Model | Claude Haiku 4.5 |
| Average turns | 3-5 per intake session |
| Avg input tokens/turn | 500 (includes conversation history) |
| Avg output tokens/turn | 300 |
| Cost per turn | ~$0.002 |
| Total per case | ~$0.006 - $0.010 |
The intake conversation is the most frequent LLM interaction but also the cheapest per call. Conversation history is managed with a sliding window to keep input tokens bounded.
2. Clinical Context Extraction ($0.03/report)¶
| Detail | Value |
|---|---|
| Model | Claude Sonnet 4.6 |
| Average pages | 2-5 per document |
| Avg input tokens | 2,000-3,000 (OCR text + prompt) |
| Avg output tokens | 800-1,200 (structured JSON) |
| Cost per document | ~$0.015 - $0.030 |
This is the most expensive per-call task because it requires Sonnet-level accuracy for clinical data extraction. The cost is justified: incorrect extraction could lead to wrong provider matches.
3. Comorbidity Detection ($0/case)¶
| Detail | Value |
|---|---|
| Model | None (rule-based) |
| Method | Lookup table of 150+ comorbidity pairs |
| Input | Extracted condition codes |
| Output | Flagged comorbidity pairs with severity |
| Cost | $0 |
COMORBIDITY_PAIRS = {
("E11", "I10"): {"name": "Diabetes + Hypertension", "severity": "moderate"},
("E66", "G47.3"): {"name": "Obesity + Sleep Apnea", "severity": "moderate"},
("E11", "N18"): {"name": "Diabetes + CKD", "severity": "high"},
("I25", "E78"): {"name": "CAD + Hyperlipidemia", "severity": "moderate"},
# ... 146 more pairs
}
def detect_comorbidities(condition_codes: list[str]) -> list[dict]:
"""Detect comorbidity pairs from extracted condition codes."""
detected = []
for i, code1 in enumerate(condition_codes):
for code2 in condition_codes[i+1:]:
key = tuple(sorted([code1[:3], code2[:3]])) # Match on 3-char prefix
if key in COMORBIDITY_PAIRS:
detected.append(COMORBIDITY_PAIRS[key])
return detected
Why Rule-Based?
Comorbidity detection doesn't need LLM reasoning -- it's a well-defined medical knowledge lookup. Using rules instead of an LLM call saves ~$0.005 per case and eliminates hallucination risk for this critical safety check.
4. Match Explanations ($0.005/explanation)¶
| Detail | Value |
|---|---|
| Model | Claude Haiku 4.5 |
| Explanations per match | 3-5 (top providers) |
| Avg input tokens | 800 (match data + prompt) |
| Avg output tokens | 600 (structured explanation) |
| Cost per explanation | ~$0.001 |
| Total per case | ~$0.003 - $0.005 |
Match explanations are generated lazily -- only for the top-ranked providers that are actually shown to the patient. If a patient doesn't scroll past the top 3, explanations for providers 4-5 are never generated.
Observability and Cost Tracking¶
Langfuse Integration¶
Every LLM call is tracked in Langfuse with:
@langfuse.observe(name="clinical_extraction")
async def extract_clinical_entities(text: str, tenant_id: str):
"""Extract clinical entities with full Langfuse tracing."""
prompt = await get_prompt("clinical_entity_extraction")
model_config = model_registry.get_routing("clinical_extraction")
response = await call_model(
model=model_config.primary,
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": text},
],
)
# Langfuse automatically tracks: model, tokens, cost, latency, I/O
return parse_extraction_response(response)
Cost Dashboard Metrics¶
| Metric | Tracked In | Alert Threshold |
|---|---|---|
| Daily LLM spend | Langfuse | > $5/day (POC) |
| Cost per patient journey | Langfuse + PostHog | > $0.50 |
| Fallback rate | Events table | > 10% of calls |
| Average latency (Haiku) | Langfuse | > 3 seconds |
| Average latency (Sonnet) | Langfuse | > 10 seconds |
| Token efficiency | Langfuse | > 5,000 input tokens/call |
Monthly Cost Breakdown (POC)¶
+---------------------------+--------+--------+---------+
| Category | Calls | Avg $ | Total $ |
+---------------------------+--------+--------+---------+
| Clinical Extraction | 100 | $0.025 | $2.50 |
| Patient Conversations | 300 | $0.003 | $0.90 |
| Match Explanations | 250 | $0.001 | $0.25 |
| Document Reranking | 50 | $0.005 | $0.25 |
| Vision OCR (fallback) | 10 | $0.040 | $0.40 |
| Intent Classification | 300 | $0.001 | $0.30 |
+---------------------------+--------+--------+---------+
| TOTAL | 1,010 | | $4.60 |
+---------------------------+--------+--------+---------+
Post-Seed Plans¶
MedGemma 4B Evaluation¶
After seed funding, the team plans to evaluate Google's MedGemma 4B model as a cost-reduction option for clinical tasks:
| Consideration | Details |
|---|---|
| Model | MedGemma 4B (open-weight, medical-specialized) |
| Deployment | Self-hosted on GPU instance |
| Target tasks | Medical code mapping, comorbidity enhancement |
| Expected savings | 60-80% cost reduction for targeted tasks |
| Risk | Lower general reasoning vs. Claude, requires medical validation |
| Evaluation plan | Shadow mode on 500 cases, compare extraction accuracy |
BioMistral Shadow Mode¶
BioMistral is another candidate for self-hosted medical NLP:
| Consideration | Details |
|---|---|
| Model | BioMistral 7B |
| Deployment | Self-hosted on GPU instance |
| Target tasks | Entity extraction from structured lab reports |
| Expected savings | 70-90% cost reduction for lab report parsing |
| Risk | Narrower training data, may miss edge cases |
| Evaluation plan | Shadow mode alongside Claude Sonnet, compare F1 scores |
graph TD
A[Current: Cloud LLMs Only] --> B[Phase 1: Shadow Testing]
B --> C{Accuracy >= 90%?}
C -->|Yes| D[Phase 2: A/B Split 20%]
D --> E{Cost Savings Confirmed?}
E -->|Yes| F[Phase 3: Promote to Primary]
C -->|No| G[Continue Cloud LLMs]
E -->|No| G
style A fill:#008B8B,color:#fff
style D fill:#FF7F50,color:#fff
style F fill:#4A90D9,color:#fff
Medical AI Model Validation
Any model used for clinical tasks must pass a validation suite of 200+ annotated medical documents before being promoted from shadow mode. Patient safety is non-negotiable -- cost savings never justify reduced clinical accuracy.
Configuration Checklist¶
When adding a new LLM-powered feature:
- Add the model to
config/model_registry.yamlwith routing and fallback - Create a versioned prompt in Langfuse
- Implement the call using
route_llm_call()with the task identifier - Add a Flagsmith feature flag to enable/disable the feature
- Add cost tracking assertions in the test suite
- Add latency and fallback rate alerts in the monitoring dashboard
- Document the expected cost per call in this file