Skip to content

LLM Routing

Overview

Curaway uses a tiered model selection strategy to balance cost, latency, and accuracy across its AI-powered features. Every LLM call is routed through a centralized model registry that supports A/B testing, fallback chains, and runtime configuration via feature flags.

The guiding principle: use the cheapest model that meets the quality threshold for each task.


Tiered Model Selection

Model Tiers

graph TD
    Request[Incoming LLM Request] --> Router[Model Router]
    Router --> Tier1{Task Complexity}
    Tier1 -->|Simple: 80% of calls| Haiku[Claude Haiku 4.5]
    Tier1 -->|Complex: 20% of calls| Sonnet[Claude Sonnet 4.6]
    Tier1 -->|Classification: bulk| Mini[GPT-4o mini]

    Haiku --> Langfuse[Langfuse Tracking]
    Sonnet --> Langfuse
    Mini --> Langfuse

    style Router fill:#008B8B,color:#fff
    style Haiku fill:#4A90D9,color:#fff
    style Sonnet fill:#FF7F50,color:#fff
    style Mini fill:#6B7280,color:#fff
Tier Model Provider Use Cases % of Calls Avg Cost/Call
Economy Claude Haiku 4.5 Anthropic Conversations, intake, explanations, orchestration ~80% $0.003
Premium Claude Sonnet 4.6 Anthropic Clinical extraction, complex reasoning, reranking ~20% $0.015
Bulk GPT-4o mini OpenAI High-volume classification, intent detection As needed $0.001

Why Not a Single Model?

Using Claude Sonnet 4.6 for everything would cost ~5x more per patient journey. For conversational intake and template-based explanations, Haiku 4.5 provides comparable quality at a fraction of the cost. The premium tier is reserved for tasks where accuracy directly impacts patient safety.


Cost Per Patient Journey

A typical patient journey involves approximately 6 agent calls:

Step Agent Model Input Tokens Output Tokens Cost
1. Document OCR fallback Clinical Context Sonnet 4.6 ~3,000 ~1,500 $0.030
2. Entity extraction Clinical Context Sonnet 4.6 ~2,000 ~800 $0.018
3. Code mapping Clinical Context Haiku 4.5 ~500 ~300 $0.002
4. Intake conversation (3 turns) Intake Haiku 4.5 ~1,500 ~900 $0.006
5. Match reranking Match Haiku 4.5 ~1,000 ~500 $0.003
6. Explanation generation Explanation Haiku 4.5 ~800 ~600 $0.003
Total ~8,800 ~4,600 $0.062
Cost range per patient journey: $0.07 - $0.50
  - Lower end: Text-based PDFs (no Vision OCR), simple conditions
  - Upper end: Scanned documents (Vision OCR), complex multi-condition cases
  - Average: ~$0.15

Monthly Cost Projections

Scale Patients/Month Avg Cost/Patient Monthly LLM Cost
MVP 50 $0.15 $7.50
Early 500 $0.12 $60
Growth 5,000 $0.10 $500
Scale 50,000 $0.08 $4,000

Cost Reduction at Scale

Per-patient cost decreases at scale due to: (1) prompt caching on repeated patterns, (2) more documents processed by PyMuPDF (fewer Vision OCR calls), and (3) model improvements reducing token counts.


Model Registry

Configuration File

All model routing is configured through config/model_registry.yaml:

# config/model_registry.yaml
models:
  claude-haiku-4.5:
    provider: anthropic
    model_id: claude-haiku-4-5-20250514
    max_tokens: 4096
    temperature: 0.3
    tier: economy
    cost_per_1k_input: 0.001
    cost_per_1k_output: 0.005
    rate_limit_rpm: 1000
    timeout_seconds: 30

  claude-sonnet-4.6:
    provider: anthropic
    model_id: claude-sonnet-4-6-20250514
    max_tokens: 8192
    temperature: 0.1
    tier: premium
    cost_per_1k_input: 0.003
    cost_per_1k_output: 0.015
    rate_limit_rpm: 500
    timeout_seconds: 60

  gpt-4o-mini:
    provider: openai
    model_id: gpt-4o-mini
    max_tokens: 4096
    temperature: 0.2
    tier: bulk
    cost_per_1k_input: 0.00015
    cost_per_1k_output: 0.0006
    rate_limit_rpm: 2000
    timeout_seconds: 15

# Task-to-model routing
routing:
  clinical_extraction:
    primary: claude-sonnet-4.6
    fallback: claude-haiku-4.5
    ab_split:
      claude-sonnet-4.6: 100
      claude-haiku-4.5: 0

  patient_conversation:
    primary: claude-haiku-4.5
    fallback: gpt-4o-mini
    ab_split:
      claude-haiku-4.5: 100

  intent_classification:
    primary: gpt-4o-mini
    fallback: claude-haiku-4.5
    ab_split:
      gpt-4o-mini: 80
      claude-haiku-4.5: 20

  match_explanation:
    primary: claude-haiku-4.5
    fallback: null  # Template fallback, no LLM needed
    ab_split:
      claude-haiku-4.5: 100

  document_reranking:
    primary: claude-haiku-4.5
    fallback: null  # Cosine similarity only
    ab_split:
      claude-haiku-4.5: 100

  vision_ocr:
    primary: claude-sonnet-4.6
    fallback: null  # Only used when other OCR methods fail
    ab_split:
      claude-sonnet-4.6: 100

A/B Split Percentages

The ab_split configuration enables gradual model migration. For example, to test GPT-4o mini for intent classification:

intent_classification:
  primary: gpt-4o-mini
  ab_split:
    gpt-4o-mini: 80    # 80% of requests use GPT-4o mini
    claude-haiku-4.5: 20  # 20% use Haiku for comparison

Results are tracked in PostHog with the model_ab_test event, comparing accuracy, latency, and cost between the two groups.

Fallback Chains

Every routing entry has a fallback model. If the primary model fails (timeout, rate limit, API error), the system automatically retries with the fallback:

async def route_llm_call(
    task: str,
    messages: list[dict],
    tenant_id: str,
) -> LLMResponse:
    """Route an LLM call based on task type and registry config."""
    config = model_registry.get_routing(task)
    model = select_model_ab(config)

    try:
        response = await call_model(model, messages, config)
        await track_usage(task, model, response, tenant_id)
        return response
    except (TimeoutError, RateLimitError, APIError) as e:
        logger.warning(f"Primary model {model} failed for {task}: {e}")

        if config.fallback:
            fallback_response = await call_model(config.fallback, messages, config)
            await track_usage(task, config.fallback, fallback_response, tenant_id, fallback=True)
            return fallback_response

        raise LLMRoutingError(f"All models failed for task {task}")

Langfuse Prompt Management

Versioned Prompts

All system prompts are stored and versioned in Langfuse, not hardcoded in the application:

from langfuse import Langfuse

langfuse = Langfuse()

async def get_prompt(name: str, version: Optional[int] = None) -> str:
    """Fetch a versioned prompt from Langfuse."""
    prompt = langfuse.get_prompt(
        name=name,
        version=version,        # None = latest version
        cache_ttl_seconds=300,  # Cache for 5 minutes
    )
    return prompt.compile()

Prompt Inventory

Prompt Name Current Version Model Description
clinical_entity_extraction v3 Sonnet 4.6 Extract conditions, procedures, labs from OCR text
medical_code_mapping v2 Haiku 4.5 Map entities to ICD-10, CPT, LOINC codes
fhir_resource_generation v2 Haiku 4.5 Generate FHIR R4 JSON from coded entities
intake_conversation v4 Haiku 4.5 Conversational preference collection
intent_classification v2 GPT-4o mini Classify patient message intent
match_explanation v3 Haiku 4.5 Generate natural-language match explanations
document_reranker v1 Haiku 4.5 Verify document-to-requirement matches
vision_ocr_extraction v2 Sonnet 4.6 Extract text from scanned document images
comorbidity_summary v1 Haiku 4.5 Summarize detected comorbidities for patient

System Prompt Storage

System prompts follow a consistent structure:

# Example: clinical_entity_extraction prompt (v3)
"""
You are a medical document analysis specialist. Extract clinical entities
from the following medical document text.

## Rules
1. Extract ALL conditions, procedures, medications, lab results, and vitals
2. Include laterality (left/right/bilateral) when mentioned
3. Include dates when available
4. Include severity/stage when mentioned
5. Do NOT infer information that is not explicitly stated
6. Return results in the specified JSON schema

## Output Schema
{output_schema}

## Document Text
{document_text}
"""

Prompt Versioning Discipline

Never edit a production prompt in place. Always create a new version, test it against the evaluation dataset, and then promote it. Langfuse tracks which version was used for every generation, enabling precise debugging.


Use Case Cost Analysis

1. Patient Conversation ($0.03/case)

Detail Value
Model Claude Haiku 4.5
Average turns 3-5 per intake session
Avg input tokens/turn 500 (includes conversation history)
Avg output tokens/turn 300
Cost per turn ~$0.002
Total per case ~$0.006 - $0.010

The intake conversation is the most frequent LLM interaction but also the cheapest per call. Conversation history is managed with a sliding window to keep input tokens bounded.

2. Clinical Context Extraction ($0.03/report)

Detail Value
Model Claude Sonnet 4.6
Average pages 2-5 per document
Avg input tokens 2,000-3,000 (OCR text + prompt)
Avg output tokens 800-1,200 (structured JSON)
Cost per document ~$0.015 - $0.030

This is the most expensive per-call task because it requires Sonnet-level accuracy for clinical data extraction. The cost is justified: incorrect extraction could lead to wrong provider matches.

3. Comorbidity Detection ($0/case)

Detail Value
Model None (rule-based)
Method Lookup table of 150+ comorbidity pairs
Input Extracted condition codes
Output Flagged comorbidity pairs with severity
Cost $0
COMORBIDITY_PAIRS = {
    ("E11", "I10"): {"name": "Diabetes + Hypertension", "severity": "moderate"},
    ("E66", "G47.3"): {"name": "Obesity + Sleep Apnea", "severity": "moderate"},
    ("E11", "N18"): {"name": "Diabetes + CKD", "severity": "high"},
    ("I25", "E78"): {"name": "CAD + Hyperlipidemia", "severity": "moderate"},
    # ... 146 more pairs
}

def detect_comorbidities(condition_codes: list[str]) -> list[dict]:
    """Detect comorbidity pairs from extracted condition codes."""
    detected = []
    for i, code1 in enumerate(condition_codes):
        for code2 in condition_codes[i+1:]:
            key = tuple(sorted([code1[:3], code2[:3]]))  # Match on 3-char prefix
            if key in COMORBIDITY_PAIRS:
                detected.append(COMORBIDITY_PAIRS[key])
    return detected

Why Rule-Based?

Comorbidity detection doesn't need LLM reasoning -- it's a well-defined medical knowledge lookup. Using rules instead of an LLM call saves ~$0.005 per case and eliminates hallucination risk for this critical safety check.

4. Match Explanations ($0.005/explanation)

Detail Value
Model Claude Haiku 4.5
Explanations per match 3-5 (top providers)
Avg input tokens 800 (match data + prompt)
Avg output tokens 600 (structured explanation)
Cost per explanation ~$0.001
Total per case ~$0.003 - $0.005

Match explanations are generated lazily -- only for the top-ranked providers that are actually shown to the patient. If a patient doesn't scroll past the top 3, explanations for providers 4-5 are never generated.


Latency Optimizations (Session 31)

Response Streaming

generate_response_streaming() in the chat service uses llm.astream() to push individual tokens to a Redis list as they arrive from the LLM. The frontend consumes them via an SSE endpoint.

Architecture:

llm.astream() → token → Redis RPUSH case:{case_id} → SSE poll (100ms) → browser render

SSE endpoint: GET /cases/{id}/chat/stream

  • Polls the case:{case_id} Redis channel at 0.1s intervals
  • Endpoint timeout: 60s (single response window, not long-lived)
  • Event types:
  • token — individual LLM output token
  • stream_end — response complete, close connection
  • error / timeout — terminal events

Fallback: When enable_response_streaming flag is off, the chat service falls back to llm.ainvoke() and returns the full response in a single HTTP response. This is the safe default for tenants where streaming is not yet tested.

Performance: Sub-second perceived time-to-first-token (TTFT). The patient sees text appearing within ~200-400ms of the LLM starting generation, compared to waiting 3-8s for the full response.

Chat Pipeline Caching

The chat pipeline caches two expensive data fetches using a cache-aside pattern in Redis.

Cache Layer Redis Key TTL What It Caches
Patient state chat:state:{tenant_id}:{case_id} 60s EHR snapshot, case metadata, document status
Conversation context chat:conv:{tenant_id}:{case_id} 120s Last N messages for LLM context window

Invalidation triggers (cache is deleted on any of these events):

  • New chat message sent or received
  • Document uploaded or re-processed
  • FHIR resource written or updated
  • EHR snapshot rebuilt

Flag: enable_chat_cache (Flagsmith). When disabled, every chat turn fetches fresh from the database.

Non-fatal: If Redis is down or returns an error, the cache miss is swallowed and the service falls back to a fresh DB fetch. Cache failures never block the chat pipeline.

Prompt Compression

System prompts were rewritten from v1_original to v2_compressed variants, achieving 36-73% token reduction across all agent prompts.

Method: Removed redundant few-shot examples, condensed multi-paragraph instructions into bullet lists, eliminated repeated preamble across prompt sections.

Flag: prompt_version (Flagsmith) — values: v1_original | v2_compressed. Allows instant rollback if compressed prompts degrade output quality.

Parallel + Deferred Pipeline

The chat pipeline splits work into three timing tiers:

Parallel (before response): Input classifier and patient context fetch run concurrently via asyncio.gather(). This removes a serial ~500ms context-fetch wait from every chat turn.

  • Flag: enable_parallel_pipeline

Synchronous (response generation): The orchestrator and LLM conversation call run sequentially — these produce the user-visible response.

Deferred (after response starts streaming): Chat extractor (entity extraction from the conversation) and requirement matcher run as background tasks via asyncio.create_task(). These enrich the EHR and update document matching but do not block the user-visible response.

  • Flag: enable_deferred_extraction

Connection Pooling

LLM clients: ChatAnthropic and ChatOpenAI are now instantiated once at module load (singletons) rather than per-call. Creating a new client per call added ~50-200ms of TCP/TLS connection overhead. The singleton is reset between tests via importlib.reload.

SQLAlchemy: Database connection pool configured with pool_recycle=1800 (recycle connections after 30 minutes) and pool_timeout=30 (wait up to 30s for a connection from the pool).

Overall Latency Impact

Metric Before After Delta
Average chat response 13.2s 7.9s -41%

Contributors: LLM client singletons, parallel classifier+context, deferred extraction, prompt compression, chat pipeline caching, response streaming (perceived).


Observability and Cost Tracking

Langfuse Integration

Every LLM call is tracked in Langfuse with:

@langfuse.observe(name="clinical_extraction")
async def extract_clinical_entities(text: str, tenant_id: str):
    """Extract clinical entities with full Langfuse tracing."""
    prompt = await get_prompt("clinical_entity_extraction")
    model_config = model_registry.get_routing("clinical_extraction")

    response = await call_model(
        model=model_config.primary,
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": text},
        ],
    )

    # Langfuse automatically tracks: model, tokens, cost, latency, I/O
    return parse_extraction_response(response)

Cost Dashboard Metrics

Metric Tracked In Alert Threshold
Daily LLM spend Langfuse > $5/day (MVP)
Cost per patient journey Langfuse + PostHog > $0.50
Fallback rate Events table > 10% of calls
Average latency (Haiku) Langfuse > 3 seconds
Average latency (Sonnet) Langfuse > 10 seconds
Token efficiency Langfuse > 5,000 input tokens/call

Monthly Cost Breakdown (MVP)

+---------------------------+--------+--------+---------+
| Category                  | Calls  | Avg $  | Total $ |
+---------------------------+--------+--------+---------+
| Clinical Extraction       |    100 | $0.025 |   $2.50 |
| Patient Conversations     |    300 | $0.003 |   $0.90 |
| Match Explanations        |    250 | $0.001 |   $0.25 |
| Document Reranking        |     50 | $0.005 |   $0.25 |
| Vision OCR (fallback)     |     10 | $0.040 |   $0.40 |
| Intent Classification     |    300 | $0.001 |   $0.30 |
+---------------------------+--------+--------+---------+
| TOTAL                     |  1,010 |        |   $4.60 |
+---------------------------+--------+--------+---------+

Post-Seed Plans

MedGemma 4B Evaluation

After seed funding, the team plans to evaluate Google's MedGemma 4B model as a cost-reduction option for clinical tasks:

Consideration Details
Model MedGemma 4B (open-weight, medical-specialized)
Deployment Self-hosted on GPU instance
Target tasks Medical code mapping, comorbidity enhancement
Expected savings 60-80% cost reduction for targeted tasks
Risk Lower general reasoning vs. Claude, requires medical validation
Evaluation plan Shadow mode on 500 cases, compare extraction accuracy

BioMistral Shadow Mode

BioMistral is another candidate for self-hosted medical NLP:

Consideration Details
Model BioMistral 7B
Deployment Self-hosted on GPU instance
Target tasks Entity extraction from structured lab reports
Expected savings 70-90% cost reduction for lab report parsing
Risk Narrower training data, may miss edge cases
Evaluation plan Shadow mode alongside Claude Sonnet, compare F1 scores
graph TD
    A[Current: Cloud LLMs Only] --> B[Phase 1: Shadow Testing]
    B --> C{Accuracy >= 90%?}
    C -->|Yes| D[Phase 2: A/B Split 20%]
    D --> E{Cost Savings Confirmed?}
    E -->|Yes| F[Phase 3: Promote to Primary]
    C -->|No| G[Continue Cloud LLMs]
    E -->|No| G

    style A fill:#008B8B,color:#fff
    style D fill:#FF7F50,color:#fff
    style F fill:#4A90D9,color:#fff

Medical AI Model Validation

Any model used for clinical tasks must pass a validation suite of 200+ annotated medical documents before being promoted from shadow mode. Patient safety is non-negotiable -- cost savings never justify reduced clinical accuracy.


Configuration Checklist

When adding a new LLM-powered feature:

  1. Add the model to config/model_registry.yaml with routing and fallback
  2. Create a versioned prompt in Langfuse
  3. Implement the call using route_llm_call() with the task identifier
  4. Add a Flagsmith feature flag to enable/disable the feature
  5. Add cost tracking assertions in the test suite
  6. Add latency and fallback rate alerts in the monitoring dashboard
  7. Document the expected cost per call in this file