LLM Routing¶

Overview¶

Curaway uses a tiered model selection strategy to balance cost, latency, and accuracy across its AI-powered features. Every LLM call is routed through a centralized model registry that supports A/B testing, fallback chains, and runtime configuration via feature flags.

The guiding principle: use the cheapest model that meets the quality threshold for each task.

Tiered Model Selection¶

Model Tiers¶

graph TD
    Request[Incoming LLM Request] --> Router[Model Router]
    Router --> Tier1{Task Complexity}
    Tier1 -->|Simple: 80% of calls| Haiku[Claude Haiku 4.5]
    Tier1 -->|Complex: 20% of calls| Sonnet[Claude Sonnet 4.6]
    Tier1 -->|Classification: bulk| Mini[GPT-4o mini]

    Haiku --> Langfuse[Langfuse Tracking]
    Sonnet --> Langfuse
    Mini --> Langfuse

    style Router fill:#008B8B,color:#fff
    style Haiku fill:#4A90D9,color:#fff
    style Sonnet fill:#FF7F50,color:#fff
    style Mini fill:#6B7280,color:#fff

Tier	Model	Provider	Use Cases	% of Calls	Avg Cost/Call
Economy	Claude Haiku 4.5	Anthropic	Conversations, intake, explanations, orchestration	~80%	$0.003
Premium	Claude Sonnet 4.6	Anthropic	Clinical extraction, complex reasoning, reranking	~20%	$0.015
Bulk	GPT-4o mini	OpenAI	High-volume classification, intent detection	As needed	$0.001

Why Not a Single Model?

Using Claude Sonnet 4.6 for everything would cost ~5x more per patient journey. For conversational intake and template-based explanations, Haiku 4.5 provides comparable quality at a fraction of the cost. The premium tier is reserved for tasks where accuracy directly impacts patient safety.

Cost Per Patient Journey¶

A typical patient journey involves approximately 6 agent calls:

Step	Agent	Model	Input Tokens	Output Tokens	Cost
1. Document OCR fallback	Clinical Context	Sonnet 4.6	~3,000	~1,500	$0.030
2. Entity extraction	Clinical Context	Sonnet 4.6	~2,000	~800	$0.018
3. Code mapping	Clinical Context	Haiku 4.5	~500	~300	$0.002
4. Intake conversation (3 turns)	Intake	Haiku 4.5	~1,500	~900	$0.006
5. Match reranking	Match	Haiku 4.5	~1,000	~500	$0.003
6. Explanation generation	Explanation	Haiku 4.5	~800	~600	$0.003
Total			~8,800	~4,600	$0.062

Cost range per patient journey: $0.07 - $0.50
  - Lower end: Text-based PDFs (no Vision OCR), simple conditions
  - Upper end: Scanned documents (Vision OCR), complex multi-condition cases
  - Average: ~$0.15

Monthly Cost Projections¶

Scale	Patients/Month	Avg Cost/Patient	Monthly LLM Cost
MVP	50	$0.15	$7.50
Early	500	$0.12	$60
Growth	5,000	$0.10	$500
Scale	50,000	$0.08	$4,000

Cost Reduction at Scale

Per-patient cost decreases at scale due to: (1) prompt caching on repeated patterns, (2) more documents processed by PyMuPDF (fewer Vision OCR calls), and (3) model improvements reducing token counts.

Model Registry¶

Configuration File¶

All model routing is configured through config/model_registry.yaml:

# config/model_registry.yaml
models:
  claude-haiku-4.5:
    provider: anthropic
    model_id: claude-haiku-4-5-20250514
    max_tokens: 4096
    temperature: 0.3
    tier: economy
    cost_per_1k_input: 0.001
    cost_per_1k_output: 0.005
    rate_limit_rpm: 1000
    timeout_seconds: 30

  claude-sonnet-4.6:
    provider: anthropic
    model_id: claude-sonnet-4-6-20250514
    max_tokens: 8192
    temperature: 0.1
    tier: premium
    cost_per_1k_input: 0.003
    cost_per_1k_output: 0.015
    rate_limit_rpm: 500
    timeout_seconds: 60

  gpt-4o-mini:
    provider: openai
    model_id: gpt-4o-mini
    max_tokens: 4096
    temperature: 0.2
    tier: bulk
    cost_per_1k_input: 0.00015
    cost_per_1k_output: 0.0006
    rate_limit_rpm: 2000
    timeout_seconds: 15

# Task-to-model routing
routing:
  clinical_extraction:
    primary: claude-sonnet-4.6
    fallback: claude-haiku-4.5
    ab_split:
      claude-sonnet-4.6: 100
      claude-haiku-4.5: 0

  patient_conversation:
    primary: claude-haiku-4.5
    fallback: gpt-4o-mini
    ab_split:
      claude-haiku-4.5: 100

  intent_classification:
    primary: gpt-4o-mini
    fallback: claude-haiku-4.5
    ab_split:
      gpt-4o-mini: 80
      claude-haiku-4.5: 20

  match_explanation:
    primary: claude-haiku-4.5
    fallback: null  # Template fallback, no LLM needed
    ab_split:
      claude-haiku-4.5: 100

  document_reranking:
    primary: claude-haiku-4.5
    fallback: null  # Cosine similarity only
    ab_split:
      claude-haiku-4.5: 100

  vision_ocr:
    primary: claude-sonnet-4.6
    fallback: null  # Only used when other OCR methods fail
    ab_split:
      claude-sonnet-4.6: 100

A/B Split Percentages¶

The ab_split configuration enables gradual model migration. For example, to test GPT-4o mini for intent classification:

intent_classification:
  primary: gpt-4o-mini
  ab_split:
    gpt-4o-mini: 80    # 80% of requests use GPT-4o mini
    claude-haiku-4.5: 20  # 20% use Haiku for comparison

Results are tracked in PostHog with the model_ab_test event, comparing accuracy, latency, and cost between the two groups.

Fallback Chains¶

Every routing entry has a fallback model. If the primary model fails (timeout, rate limit, API error), the system automatically retries with the fallback:

async def route_llm_call(
    task: str,
    messages: list[dict],
    tenant_id: str,
) -> LLMResponse:
    """Route an LLM call based on task type and registry config."""
    config = model_registry.get_routing(task)
    model = select_model_ab(config)

    try:
        response = await call_model(model, messages, config)
        await track_usage(task, model, response, tenant_id)
        return response
    except (TimeoutError, RateLimitError, APIError) as e:
        logger.warning(f"Primary model {model} failed for {task}: {e}")

        if config.fallback:
            fallback_response = await call_model(config.fallback, messages, config)
            await track_usage(task, config.fallback, fallback_response, tenant_id, fallback=True)
            return fallback_response

        raise LLMRoutingError(f"All models failed for task {task}")

Langfuse Prompt Management¶

Versioned Prompts¶

All system prompts are stored and versioned in Langfuse, not hardcoded in the application:

from langfuse import Langfuse

langfuse = Langfuse()

async def get_prompt(name: str, version: Optional[int] = None) -> str:
    """Fetch a versioned prompt from Langfuse."""
    prompt = langfuse.get_prompt(
        name=name,
        version=version,        # None = latest version
        cache_ttl_seconds=300,  # Cache for 5 minutes
    )
    return prompt.compile()

Prompt Inventory¶

Prompt Name	Current Version	Model	Description
`clinical_entity_extraction`	v3	Sonnet 4.6	Extract conditions, procedures, labs from OCR text
`medical_code_mapping`	v2	Haiku 4.5	Map entities to ICD-10, CPT, LOINC codes
`fhir_resource_generation`	v2	Haiku 4.5	Generate FHIR R4 JSON from coded entities
`intake_conversation`	v4	Haiku 4.5	Conversational preference collection
`intent_classification`	v2	GPT-4o mini	Classify patient message intent
`match_explanation`	v3	Haiku 4.5	Generate natural-language match explanations
`document_reranker`	v1	Haiku 4.5	Verify document-to-requirement matches
`vision_ocr_extraction`	v2	Sonnet 4.6	Extract text from scanned document images
`comorbidity_summary`	v1	Haiku 4.5	Summarize detected comorbidities for patient

System Prompt Storage¶

System prompts follow a consistent structure:

# Example: clinical_entity_extraction prompt (v3)
"""
You are a medical document analysis specialist. Extract clinical entities
from the following medical document text.

## Rules
1. Extract ALL conditions, procedures, medications, lab results, and vitals
2. Include laterality (left/right/bilateral) when mentioned
3. Include dates when available
4. Include severity/stage when mentioned
5. Do NOT infer information that is not explicitly stated
6. Return results in the specified JSON schema

## Output Schema
{output_schema}

## Document Text
{document_text}
"""

Prompt Versioning Discipline

Never edit a production prompt in place. Always create a new version, test it against the evaluation dataset, and then promote it. Langfuse tracks which version was used for every generation, enabling precise debugging.

Use Case Cost Analysis¶

1. Patient Conversation ($0.03/case)¶

Detail	Value
Model	Claude Haiku 4.5
Average turns	3-5 per intake session
Avg input tokens/turn	500 (includes conversation history)
Avg output tokens/turn	300
Cost per turn	~$0.002
Total per case	~$0.006 - $0.010

The intake conversation is the most frequent LLM interaction but also the cheapest per call. Conversation history is managed with a sliding window to keep input tokens bounded.

2. Clinical Context Extraction ($0.03/report)¶

Detail	Value
Model	Claude Sonnet 4.6
Average pages	2-5 per document
Avg input tokens	2,000-3,000 (OCR text + prompt)
Avg output tokens	800-1,200 (structured JSON)
Cost per document	~$0.015 - $0.030

This is the most expensive per-call task because it requires Sonnet-level accuracy for clinical data extraction. The cost is justified: incorrect extraction could lead to wrong provider matches.

3. Comorbidity Detection ($0/case)¶

Detail	Value
Model	None (rule-based)
Method	Lookup table of 150+ comorbidity pairs
Input	Extracted condition codes
Output	Flagged comorbidity pairs with severity
Cost	$0

COMORBIDITY_PAIRS = {
    ("E11", "I10"): {"name": "Diabetes + Hypertension", "severity": "moderate"},
    ("E66", "G47.3"): {"name": "Obesity + Sleep Apnea", "severity": "moderate"},
    ("E11", "N18"): {"name": "Diabetes + CKD", "severity": "high"},
    ("I25", "E78"): {"name": "CAD + Hyperlipidemia", "severity": "moderate"},
    # ... 146 more pairs
}

def detect_comorbidities(condition_codes: list[str]) -> list[dict]:
    """Detect comorbidity pairs from extracted condition codes."""
    detected = []
    for i, code1 in enumerate(condition_codes):
        for code2 in condition_codes[i+1:]:
            key = tuple(sorted([code1[:3], code2[:3]]))  # Match on 3-char prefix
            if key in COMORBIDITY_PAIRS:
                detected.append(COMORBIDITY_PAIRS[key])
    return detected

Why Rule-Based?

Comorbidity detection doesn't need LLM reasoning -- it's a well-defined medical knowledge lookup. Using rules instead of an LLM call saves ~$0.005 per case and eliminates hallucination risk for this critical safety check.

4. Match Explanations ($0.005/explanation)¶

Detail	Value
Model	Claude Haiku 4.5
Explanations per match	3-5 (top providers)
Avg input tokens	800 (match data + prompt)
Avg output tokens	600 (structured explanation)
Cost per explanation	~$0.001
Total per case	~$0.003 - $0.005

Match explanations are generated lazily -- only for the top-ranked providers that are actually shown to the patient. If a patient doesn't scroll past the top 3, explanations for providers 4-5 are never generated.

Latency Optimizations (Session 31)¶

Response Streaming¶

generate_response_streaming() in the chat service uses llm.astream() to push individual tokens to a Redis list as they arrive from the LLM. The frontend consumes them via an SSE endpoint.

Architecture:

llm.astream() → token → Redis RPUSH case:{case_id} → SSE poll (100ms) → browser render

SSE endpoint: GET /cases/{id}/chat/stream

Polls the case:{case_id} Redis channel at 0.1s intervals
Endpoint timeout: 60s (single response window, not long-lived)
Event types:
token — individual LLM output token
stream_end — response complete, close connection
error / timeout — terminal events

Fallback: When enable_response_streaming flag is off, the chat service falls back to llm.ainvoke() and returns the full response in a single HTTP response. This is the safe default for tenants where streaming is not yet tested.

Performance: Sub-second perceived time-to-first-token (TTFT). The patient sees text appearing within ~200-400ms of the LLM starting generation, compared to waiting 3-8s for the full response.

Chat Pipeline Caching¶

The chat pipeline caches two expensive data fetches using a cache-aside pattern in Redis.

Cache Layer	Redis Key	TTL	What It Caches
Patient state	`chat:state:{tenant_id}:{case_id}`	60s	EHR snapshot, case metadata, document status
Conversation context	`chat:conv:{tenant_id}:{case_id}`	120s	Last N messages for LLM context window

Invalidation triggers (cache is deleted on any of these events):

New chat message sent or received
Document uploaded or re-processed
FHIR resource written or updated
EHR snapshot rebuilt

Flag: enable_chat_cache (Flagsmith). When disabled, every chat turn fetches fresh from the database.

Non-fatal: If Redis is down or returns an error, the cache miss is swallowed and the service falls back to a fresh DB fetch. Cache failures never block the chat pipeline.

Prompt Compression¶

System prompts were rewritten from v1_original to v2_compressed variants, achieving 36-73% token reduction across all agent prompts.

Method: Removed redundant few-shot examples, condensed multi-paragraph instructions into bullet lists, eliminated repeated preamble across prompt sections.

Flag: prompt_version (Flagsmith) — values: v1_original | v2_compressed. Allows instant rollback if compressed prompts degrade output quality.

Parallel + Deferred Pipeline¶

The chat pipeline splits work into three timing tiers:

Parallel (before response): Input classifier and patient context fetch run concurrently via asyncio.gather(). This removes a serial ~500ms context-fetch wait from every chat turn.

Flag: enable_parallel_pipeline

Synchronous (response generation): The orchestrator and LLM conversation call run sequentially — these produce the user-visible response.

Deferred (after response starts streaming): Chat extractor (entity extraction from the conversation) and requirement matcher run as background tasks via asyncio.create_task(). These enrich the EHR and update document matching but do not block the user-visible response.

Flag: enable_deferred_extraction

Connection Pooling¶

LLM clients: ChatAnthropic and ChatOpenAI are now instantiated once at module load (singletons) rather than per-call. Creating a new client per call added ~50-200ms of TCP/TLS connection overhead. The singleton is reset between tests via importlib.reload.

SQLAlchemy: Database connection pool configured with pool_recycle=1800 (recycle connections after 30 minutes) and pool_timeout=30 (wait up to 30s for a connection from the pool).

Overall Latency Impact¶

Metric	Before	After	Delta
Average chat response	13.2s	7.9s	-41%

Contributors: LLM client singletons, parallel classifier+context, deferred extraction, prompt compression, chat pipeline caching, response streaming (perceived).

Observability and Cost Tracking¶

Langfuse Integration¶

Every LLM call is tracked in Langfuse with:

@langfuse.observe(name="clinical_extraction")
async def extract_clinical_entities(text: str, tenant_id: str):
    """Extract clinical entities with full Langfuse tracing."""
    prompt = await get_prompt("clinical_entity_extraction")
    model_config = model_registry.get_routing("clinical_extraction")

    response = await call_model(
        model=model_config.primary,
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": text},
        ],
    )

    # Langfuse automatically tracks: model, tokens, cost, latency, I/O
    return parse_extraction_response(response)

Cost Dashboard Metrics¶

Metric	Tracked In	Alert Threshold
Daily LLM spend	Langfuse	> $5/day (MVP)
Cost per patient journey	Langfuse + PostHog	> $0.50
Fallback rate	Events table	> 10% of calls
Average latency (Haiku)	Langfuse	> 3 seconds
Average latency (Sonnet)	Langfuse	> 10 seconds
Token efficiency	Langfuse	> 5,000 input tokens/call

Monthly Cost Breakdown (MVP)¶

+---------------------------+--------+--------+---------+
| Category                  | Calls  | Avg $  | Total $ |
+---------------------------+--------+--------+---------+
| Clinical Extraction       |    100 | $0.025 |   $2.50 |
| Patient Conversations     |    300 | $0.003 |   $0.90 |
| Match Explanations        |    250 | $0.001 |   $0.25 |
| Document Reranking        |     50 | $0.005 |   $0.25 |
| Vision OCR (fallback)     |     10 | $0.040 |   $0.40 |
| Intent Classification     |    300 | $0.001 |   $0.30 |
+---------------------------+--------+--------+---------+
| TOTAL                     |  1,010 |        |   $4.60 |
+---------------------------+--------+--------+---------+

Post-Seed Plans¶

MedGemma 4B Evaluation¶

After seed funding, the team plans to evaluate Google's MedGemma 4B model as a cost-reduction option for clinical tasks:

Consideration	Details
Model	MedGemma 4B (open-weight, medical-specialized)
Deployment	Self-hosted on GPU instance
Target tasks	Medical code mapping, comorbidity enhancement
Expected savings	60-80% cost reduction for targeted tasks
Risk	Lower general reasoning vs. Claude, requires medical validation
Evaluation plan	Shadow mode on 500 cases, compare extraction accuracy

BioMistral Shadow Mode¶

BioMistral is another candidate for self-hosted medical NLP:

Consideration	Details
Model	BioMistral 7B
Deployment	Self-hosted on GPU instance
Target tasks	Entity extraction from structured lab reports
Expected savings	70-90% cost reduction for lab report parsing
Risk	Narrower training data, may miss edge cases
Evaluation plan	Shadow mode alongside Claude Sonnet, compare F1 scores

graph TD
    A[Current: Cloud LLMs Only] --> B[Phase 1: Shadow Testing]
    B --> C{Accuracy >= 90%?}
    C -->|Yes| D[Phase 2: A/B Split 20%]
    D --> E{Cost Savings Confirmed?}
    E -->|Yes| F[Phase 3: Promote to Primary]
    C -->|No| G[Continue Cloud LLMs]
    E -->|No| G

    style A fill:#008B8B,color:#fff
    style D fill:#FF7F50,color:#fff
    style F fill:#4A90D9,color:#fff

Medical AI Model Validation

Any model used for clinical tasks must pass a validation suite of 200+ annotated medical documents before being promoted from shadow mode. Patient safety is non-negotiable -- cost savings never justify reduced clinical accuracy.

Configuration Checklist¶

When adding a new LLM-powered feature:

Add the model to config/model_registry.yaml with routing and fallback
Create a versioned prompt in Langfuse
Implement the call using route_llm_call() with the task identifier
Add a Flagsmith feature flag to enable/disable the feature
Add cost tracking assertions in the test suite
Add latency and fallback rate alerts in the monitoring dashboard
Document the expected cost per call in this file