Skip to content

LLM Routing

Overview

Curaway uses a tiered model selection strategy to balance cost, latency, and accuracy across its AI-powered features. Every LLM call is routed through a centralized model registry that supports A/B testing, fallback chains, and runtime configuration via feature flags.

The guiding principle: use the cheapest model that meets the quality threshold for each task.


Tiered Model Selection

Model Tiers

graph TD
    Request[Incoming LLM Request] --> Router[Model Router]
    Router --> Tier1{Task Complexity}
    Tier1 -->|Simple: 80% of calls| Haiku[Claude Haiku 4.5]
    Tier1 -->|Complex: 20% of calls| Sonnet[Claude Sonnet 4.6]
    Tier1 -->|Classification: bulk| Mini[GPT-4o mini]

    Haiku --> Langfuse[Langfuse Tracking]
    Sonnet --> Langfuse
    Mini --> Langfuse

    style Router fill:#008B8B,color:#fff
    style Haiku fill:#4A90D9,color:#fff
    style Sonnet fill:#FF7F50,color:#fff
    style Mini fill:#6B7280,color:#fff
Tier Model Provider Use Cases % of Calls Avg Cost/Call
Economy Claude Haiku 4.5 Anthropic Conversations, intake, explanations, orchestration ~80% $0.003
Premium Claude Sonnet 4.6 Anthropic Clinical extraction, complex reasoning, reranking ~20% $0.015
Bulk GPT-4o mini OpenAI High-volume classification, intent detection As needed $0.001

Why Not a Single Model?

Using Claude Sonnet 4.6 for everything would cost ~5x more per patient journey. For conversational intake and template-based explanations, Haiku 4.5 provides comparable quality at a fraction of the cost. The premium tier is reserved for tasks where accuracy directly impacts patient safety.


Cost Per Patient Journey

A typical patient journey involves approximately 6 agent calls:

Step Agent Model Input Tokens Output Tokens Cost
1. Document OCR fallback Clinical Context Sonnet 4.6 ~3,000 ~1,500 $0.030
2. Entity extraction Clinical Context Sonnet 4.6 ~2,000 ~800 $0.018
3. Code mapping Clinical Context Haiku 4.5 ~500 ~300 $0.002
4. Intake conversation (3 turns) Intake Haiku 4.5 ~1,500 ~900 $0.006
5. Match reranking Match Haiku 4.5 ~1,000 ~500 $0.003
6. Explanation generation Explanation Haiku 4.5 ~800 ~600 $0.003
Total ~8,800 ~4,600 $0.062
Cost range per patient journey: $0.07 - $0.50
  - Lower end: Text-based PDFs (no Vision OCR), simple conditions
  - Upper end: Scanned documents (Vision OCR), complex multi-condition cases
  - Average: ~$0.15

Monthly Cost Projections

Scale Patients/Month Avg Cost/Patient Monthly LLM Cost
POC 50 $0.15 $7.50
Early 500 $0.12 $60
Growth 5,000 $0.10 $500
Scale 50,000 $0.08 $4,000

Cost Reduction at Scale

Per-patient cost decreases at scale due to: (1) prompt caching on repeated patterns, (2) more documents processed by PyMuPDF (fewer Vision OCR calls), and (3) model improvements reducing token counts.


Model Registry

Configuration File

All model routing is configured through config/model_registry.yaml:

# config/model_registry.yaml
models:
  claude-haiku-4.5:
    provider: anthropic
    model_id: claude-haiku-4-5-20250514
    max_tokens: 4096
    temperature: 0.3
    tier: economy
    cost_per_1k_input: 0.001
    cost_per_1k_output: 0.005
    rate_limit_rpm: 1000
    timeout_seconds: 30

  claude-sonnet-4.6:
    provider: anthropic
    model_id: claude-sonnet-4-6-20250514
    max_tokens: 8192
    temperature: 0.1
    tier: premium
    cost_per_1k_input: 0.003
    cost_per_1k_output: 0.015
    rate_limit_rpm: 500
    timeout_seconds: 60

  gpt-4o-mini:
    provider: openai
    model_id: gpt-4o-mini
    max_tokens: 4096
    temperature: 0.2
    tier: bulk
    cost_per_1k_input: 0.00015
    cost_per_1k_output: 0.0006
    rate_limit_rpm: 2000
    timeout_seconds: 15

# Task-to-model routing
routing:
  clinical_extraction:
    primary: claude-sonnet-4.6
    fallback: claude-haiku-4.5
    ab_split:
      claude-sonnet-4.6: 100
      claude-haiku-4.5: 0

  patient_conversation:
    primary: claude-haiku-4.5
    fallback: gpt-4o-mini
    ab_split:
      claude-haiku-4.5: 100

  intent_classification:
    primary: gpt-4o-mini
    fallback: claude-haiku-4.5
    ab_split:
      gpt-4o-mini: 80
      claude-haiku-4.5: 20

  match_explanation:
    primary: claude-haiku-4.5
    fallback: null  # Template fallback, no LLM needed
    ab_split:
      claude-haiku-4.5: 100

  document_reranking:
    primary: claude-haiku-4.5
    fallback: null  # Cosine similarity only
    ab_split:
      claude-haiku-4.5: 100

  vision_ocr:
    primary: claude-sonnet-4.6
    fallback: null  # Only used when other OCR methods fail
    ab_split:
      claude-sonnet-4.6: 100

A/B Split Percentages

The ab_split configuration enables gradual model migration. For example, to test GPT-4o mini for intent classification:

intent_classification:
  primary: gpt-4o-mini
  ab_split:
    gpt-4o-mini: 80    # 80% of requests use GPT-4o mini
    claude-haiku-4.5: 20  # 20% use Haiku for comparison

Results are tracked in PostHog with the model_ab_test event, comparing accuracy, latency, and cost between the two groups.

Fallback Chains

Every routing entry has a fallback model. If the primary model fails (timeout, rate limit, API error), the system automatically retries with the fallback:

async def route_llm_call(
    task: str,
    messages: list[dict],
    tenant_id: str,
) -> LLMResponse:
    """Route an LLM call based on task type and registry config."""
    config = model_registry.get_routing(task)
    model = select_model_ab(config)

    try:
        response = await call_model(model, messages, config)
        await track_usage(task, model, response, tenant_id)
        return response
    except (TimeoutError, RateLimitError, APIError) as e:
        logger.warning(f"Primary model {model} failed for {task}: {e}")

        if config.fallback:
            fallback_response = await call_model(config.fallback, messages, config)
            await track_usage(task, config.fallback, fallback_response, tenant_id, fallback=True)
            return fallback_response

        raise LLMRoutingError(f"All models failed for task {task}")

Langfuse Prompt Management

Versioned Prompts

All system prompts are stored and versioned in Langfuse, not hardcoded in the application:

from langfuse import Langfuse

langfuse = Langfuse()

async def get_prompt(name: str, version: Optional[int] = None) -> str:
    """Fetch a versioned prompt from Langfuse."""
    prompt = langfuse.get_prompt(
        name=name,
        version=version,        # None = latest version
        cache_ttl_seconds=300,  # Cache for 5 minutes
    )
    return prompt.compile()

Prompt Inventory

Prompt Name Current Version Model Description
clinical_entity_extraction v3 Sonnet 4.6 Extract conditions, procedures, labs from OCR text
medical_code_mapping v2 Haiku 4.5 Map entities to ICD-10, CPT, LOINC codes
fhir_resource_generation v2 Haiku 4.5 Generate FHIR R4 JSON from coded entities
intake_conversation v4 Haiku 4.5 Conversational preference collection
intent_classification v2 GPT-4o mini Classify patient message intent
match_explanation v3 Haiku 4.5 Generate natural-language match explanations
document_reranker v1 Haiku 4.5 Verify document-to-requirement matches
vision_ocr_extraction v2 Sonnet 4.6 Extract text from scanned document images
comorbidity_summary v1 Haiku 4.5 Summarize detected comorbidities for patient

System Prompt Storage

System prompts follow a consistent structure:

# Example: clinical_entity_extraction prompt (v3)
"""
You are a medical document analysis specialist. Extract clinical entities
from the following medical document text.

## Rules
1. Extract ALL conditions, procedures, medications, lab results, and vitals
2. Include laterality (left/right/bilateral) when mentioned
3. Include dates when available
4. Include severity/stage when mentioned
5. Do NOT infer information that is not explicitly stated
6. Return results in the specified JSON schema

## Output Schema
{output_schema}

## Document Text
{document_text}
"""

Prompt Versioning Discipline

Never edit a production prompt in place. Always create a new version, test it against the evaluation dataset, and then promote it. Langfuse tracks which version was used for every generation, enabling precise debugging.


Use Case Cost Analysis

1. Patient Conversation ($0.03/case)

Detail Value
Model Claude Haiku 4.5
Average turns 3-5 per intake session
Avg input tokens/turn 500 (includes conversation history)
Avg output tokens/turn 300
Cost per turn ~$0.002
Total per case ~$0.006 - $0.010

The intake conversation is the most frequent LLM interaction but also the cheapest per call. Conversation history is managed with a sliding window to keep input tokens bounded.

2. Clinical Context Extraction ($0.03/report)

Detail Value
Model Claude Sonnet 4.6
Average pages 2-5 per document
Avg input tokens 2,000-3,000 (OCR text + prompt)
Avg output tokens 800-1,200 (structured JSON)
Cost per document ~$0.015 - $0.030

This is the most expensive per-call task because it requires Sonnet-level accuracy for clinical data extraction. The cost is justified: incorrect extraction could lead to wrong provider matches.

3. Comorbidity Detection ($0/case)

Detail Value
Model None (rule-based)
Method Lookup table of 150+ comorbidity pairs
Input Extracted condition codes
Output Flagged comorbidity pairs with severity
Cost $0
COMORBIDITY_PAIRS = {
    ("E11", "I10"): {"name": "Diabetes + Hypertension", "severity": "moderate"},
    ("E66", "G47.3"): {"name": "Obesity + Sleep Apnea", "severity": "moderate"},
    ("E11", "N18"): {"name": "Diabetes + CKD", "severity": "high"},
    ("I25", "E78"): {"name": "CAD + Hyperlipidemia", "severity": "moderate"},
    # ... 146 more pairs
}

def detect_comorbidities(condition_codes: list[str]) -> list[dict]:
    """Detect comorbidity pairs from extracted condition codes."""
    detected = []
    for i, code1 in enumerate(condition_codes):
        for code2 in condition_codes[i+1:]:
            key = tuple(sorted([code1[:3], code2[:3]]))  # Match on 3-char prefix
            if key in COMORBIDITY_PAIRS:
                detected.append(COMORBIDITY_PAIRS[key])
    return detected

Why Rule-Based?

Comorbidity detection doesn't need LLM reasoning -- it's a well-defined medical knowledge lookup. Using rules instead of an LLM call saves ~$0.005 per case and eliminates hallucination risk for this critical safety check.

4. Match Explanations ($0.005/explanation)

Detail Value
Model Claude Haiku 4.5
Explanations per match 3-5 (top providers)
Avg input tokens 800 (match data + prompt)
Avg output tokens 600 (structured explanation)
Cost per explanation ~$0.001
Total per case ~$0.003 - $0.005

Match explanations are generated lazily -- only for the top-ranked providers that are actually shown to the patient. If a patient doesn't scroll past the top 3, explanations for providers 4-5 are never generated.


Observability and Cost Tracking

Langfuse Integration

Every LLM call is tracked in Langfuse with:

@langfuse.observe(name="clinical_extraction")
async def extract_clinical_entities(text: str, tenant_id: str):
    """Extract clinical entities with full Langfuse tracing."""
    prompt = await get_prompt("clinical_entity_extraction")
    model_config = model_registry.get_routing("clinical_extraction")

    response = await call_model(
        model=model_config.primary,
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": text},
        ],
    )

    # Langfuse automatically tracks: model, tokens, cost, latency, I/O
    return parse_extraction_response(response)

Cost Dashboard Metrics

Metric Tracked In Alert Threshold
Daily LLM spend Langfuse > $5/day (POC)
Cost per patient journey Langfuse + PostHog > $0.50
Fallback rate Events table > 10% of calls
Average latency (Haiku) Langfuse > 3 seconds
Average latency (Sonnet) Langfuse > 10 seconds
Token efficiency Langfuse > 5,000 input tokens/call

Monthly Cost Breakdown (POC)

+---------------------------+--------+--------+---------+
| Category                  | Calls  | Avg $  | Total $ |
+---------------------------+--------+--------+---------+
| Clinical Extraction       |    100 | $0.025 |   $2.50 |
| Patient Conversations     |    300 | $0.003 |   $0.90 |
| Match Explanations        |    250 | $0.001 |   $0.25 |
| Document Reranking        |     50 | $0.005 |   $0.25 |
| Vision OCR (fallback)     |     10 | $0.040 |   $0.40 |
| Intent Classification     |    300 | $0.001 |   $0.30 |
+---------------------------+--------+--------+---------+
| TOTAL                     |  1,010 |        |   $4.60 |
+---------------------------+--------+--------+---------+

Post-Seed Plans

MedGemma 4B Evaluation

After seed funding, the team plans to evaluate Google's MedGemma 4B model as a cost-reduction option for clinical tasks:

Consideration Details
Model MedGemma 4B (open-weight, medical-specialized)
Deployment Self-hosted on GPU instance
Target tasks Medical code mapping, comorbidity enhancement
Expected savings 60-80% cost reduction for targeted tasks
Risk Lower general reasoning vs. Claude, requires medical validation
Evaluation plan Shadow mode on 500 cases, compare extraction accuracy

BioMistral Shadow Mode

BioMistral is another candidate for self-hosted medical NLP:

Consideration Details
Model BioMistral 7B
Deployment Self-hosted on GPU instance
Target tasks Entity extraction from structured lab reports
Expected savings 70-90% cost reduction for lab report parsing
Risk Narrower training data, may miss edge cases
Evaluation plan Shadow mode alongside Claude Sonnet, compare F1 scores
graph TD
    A[Current: Cloud LLMs Only] --> B[Phase 1: Shadow Testing]
    B --> C{Accuracy >= 90%?}
    C -->|Yes| D[Phase 2: A/B Split 20%]
    D --> E{Cost Savings Confirmed?}
    E -->|Yes| F[Phase 3: Promote to Primary]
    C -->|No| G[Continue Cloud LLMs]
    E -->|No| G

    style A fill:#008B8B,color:#fff
    style D fill:#FF7F50,color:#fff
    style F fill:#4A90D9,color:#fff

Medical AI Model Validation

Any model used for clinical tasks must pass a validation suite of 200+ annotated medical documents before being promoted from shadow mode. Patient safety is non-negotiable -- cost savings never justify reduced clinical accuracy.


Configuration Checklist

When adding a new LLM-powered feature:

  1. Add the model to config/model_registry.yaml with routing and fallback
  2. Create a versioned prompt in Langfuse
  3. Implement the call using route_llm_call() with the task identifier
  4. Add a Flagsmith feature flag to enable/disable the feature
  5. Add cost tracking assertions in the test suite
  6. Add latency and fallback rate alerts in the monitoring dashboard
  7. Document the expected cost per call in this file