Skip to content

LLM Evaluation for Curaway Use Cases

Moved from CLAUDE.md. This is the detailed cost/quality analysis for each LLM use case.


Use Case 1: Patient Conversation (Intake, Questions, Brand Voice)

~500 tokens in, ~300 tokens out per message, ~15 messages per case

Model Input $/1M Output $/1M Cost/Case Quality Latency Decision
Claude Haiku 4.5 $0.80 $4.00 $0.03 Natural, empathetic, follows brand voice ~0.5s Current choice
Claude Sonnet 4.6 $3.00 $15.00 $0.10 Best quality but overkill for conversation ~1.5s Overkill for intake
GPT-4o mini $0.15 $0.60 $0.005 Good but less nuanced brand voice ~0.3s Budget option
GPT-4o $2.50 $10.00 $0.08 Strong but different tone ~1.0s Alternative
Gemini 1.5 Flash $0.075 $0.30 $0.003 Fast/cheap, variable quality ~0.2s Exploration

Decision: Claude Haiku 4.5 — $3/month for 100 cases.


Use Case 2: Clinical Context Agent (Report Parsing → ICD Codes → FHIR)

~3,000 tokens in (report), ~2,000 tokens out (structured extraction), ~3 calls per report

Model Input $/1M Output $/1M Cost/Report Clinical Accuracy FHIR Compliance Decision
Claude Haiku 4.5 $0.80 $4.00 $0.03 Good general extraction Mostly correct Current (MVP)
Claude Sonnet 4.6 $3.00 $15.00 $0.12 Excellent clinical reasoning Accurate FHIR Recommended upgrade
Claude Opus 4.6 $15.00 $75.00 $0.60 Best but expensive Gold standard Overkill for MVP
GPT-4o $2.50 $10.00 $0.08 Strong extraction Good FHIR Alternative
GPT-4o mini $0.15 $0.60 $0.005 Misses nuances Inconsistent Not recommended
Med-PaLM 2 (Google) ~$5.00 ~$15.00 $0.12 USMLE-level reasoning Not FHIR-native Limited access
BioMistral 7B (self-hosted) GPU ~$0.50/hr ~$0.02 Medical pre-training Needs fine-tuning Post-seed option
MedGemma 4B (self-hosted) GPU ~$0.30/hr ~$0.01 Google medical SLM Needs FHIR template Post-seed option
PMC-LLaMA (self-hosted) GPU ~$0.50/hr ~$0.02 PubMed pre-trained Raw extraction only Research option

Decision (MVP): Claude Haiku 4.5. Recommended upgrade: Claude Sonnet 4.6 (+$18/mo). Post-seed: Evaluate MedGemma 4B in shadow mode.


Use Case 3: Comorbidity Detection from Lab Values

Rule-based (no LLM) — app/agents/lab_analyzer.py

Approach Cost Accuracy Decision
Rule-based thresholds $0 Good for common conditions Current
LLM-based interpretation $0.02/report Catches edge cases Post-seed upgrade

Use Case 4: Match Explanations

Model Cost/Match Run Quality Decision
Template-based $0 Formulaic Current (MVP)
Claude Haiku $0.01 Natural Feature-flagged, available
Claude Sonnet $0.04 Compelling Post-seed

Model $/1M tokens Dims Quality Decision
Voyage AI 3.5-lite Free (50M/mo) 1024 General purpose Current
OpenAI text-embedding-3-small $0.02 1536 General purpose Fallback configured
BiomedCLIP (self-hosted) GPU cost 512 Medical domain Post-seed
BGE-M3 (self-hosted) GPU cost 1024 Multilingual Post-seed

Use Case 6: Voice Transcription

Model Price Medical Accuracy Decision
Web Speech API Free (browser) Poor for medical terms Current (MVP)
OpenAI Whisper $0.006/min Good with medical prompt Feature-flagged, ready
Deepgram Nova-2 Medical $0.0043/min Medical vocabulary Post-seed

Use Case 7: Pre-Operative Risk Assessment

Approach Cost Latency Accuracy Determinism Decision
Rule-based (risk_assessor.py) $0 <5ms Good for common pre-op risks Yes Current
LLM shadow (Claude Sonnet) ~$0.04/rebuild ~1–2s Catches edge cases No Post-Series A
Standalone LLM Risk Agent ~$0.04–$0.20/rebuild ~1–2s Best + free-text reasoning No After shadow validation

Full rationale in ADR-0013.


Monthly Cost Projections (100 cases, 200 reports)

Configuration Conversation Clinical Embeddings Voice Total
Current (all Haiku) $3 $6 $0 $0 $9/mo
Recommended (Haiku + Sonnet) $3 $24 $0 $0 $27/mo
Premium (Sonnet everywhere) $10 $24 $0 $0.15 $34/mo
Budget (GPT-4o mini) $0.50 $1 $0 $0 $1.50/mo
Max accuracy (Opus clinical) $3 $120 $0 $0 $123/mo
Post-seed hybrid (Haiku + MedGemma) $3 $2 (GPU) $0 $0 $5/mo + GPU

Cost Monitoring

  • Langfuse: Tracks all LLM costs in real-time (model, tokens, cost per trace)
  • Health page: /landscape shows LLM Costs card with current month spend by model
  • Budget alert: Set Langfuse alert if monthly cost exceeds $50

Medical Model Evolution Roadmap

Stage When Models What Changes
Stage 1: API-Only (current) MVP Claude Haiku 4.5 (conversation) + Haiku (clinical) No GPU. All inference via API. ~$9/mo for 100 cases.
Stage 2: Upgrade Clinical Post-demo Claude Haiku (conversation) + Sonnet (clinical) Better ICD coding accuracy. ~$27/mo.
Stage 3: Evaluate Medical SLMs Post-seed MedGemma 4B, BioClinicalBERT in shadow mode Compare clinical accuracy vs Sonnet. No production traffic.
Stage 4: Hybrid Deploy Pre-Series A MedGemma 4B/27B + Claude Haiku Self-host MedGemma on A10G GPU (~$250–500/mo).
Stage 5: Fine-Tuned Models Post-Series A MedGemma fine-tuned on Curaway data Data flywheel moat.

Post-MVP Technology Evolution

Item MVP Approach Post-MVP Upgrade Effort
Medical NER Pipeline Claude Haiku SciSpacy + BioClinicalBERT + MedCAT 1–2 weeks
Clinical Ontology Service LLM maps to ICD/SNOMED UMLS Metathesaurus + PyMedTermino 1 week
MedGemma Integration Claude for all clinical AI MedGemma 4B multimodal 2–3 weeks + GPU
Vector Embeddings Voyage AI 3.5-lite Qdrant + bge-large medical 1 week
ML Matching v2 Weighted scoring v1 Learning-to-rank 2–3 weeks
Risk Assessor Rule-based (Session 33) LLM shadow mode → full agent 1–2 + 2–3 weeks
FHIR REST Surface Custom API Standard FHIR REST endpoints 2 weeks
SMS Notifications Stub Full Twilio + TCPA compliance 1 week
Mobile Push Device registry built FCM + APNs integration 1 week
Video Consultations Model built, Daily.co stub Full SDK integration 2–3 weeks
Cache Strategy Active (Redis) Expand to provider data + embeddings Done
LangSmith Evaluations Not active Evaluation datasets for ICD accuracy 1 week

Post-Series A Technology Evolution

Technology Replaces When to Adopt Why Not Earlier
Kubernetes (EKS/GKE) Railway Pro 5+ microservices K8s overhead not justified until team >5
Kafka / AWS MSK Upstash QStash Event volume >50K/day QStash handles 500/day free
Temporal LangGraph + QStash Multi-day workflows LangGraph handles orchestration well
Aidbox (FHIR Server) fhir.resources + JSONB Provider EHR integration Zero operational cost currently
Elasticsearch PostgreSQL full-text 5K+ providers PostgreSQL sufficient for 500
Self-Hosted GPU API-based LLM Per-patient cost >$0.50 API ~$40/mo vs GPU ~$500–2K/mo
MedGemma Fine-Tuning Pretrained + API 10K+ outcome records Fine-tuning without data degrades quality
React Native Mobile Responsive web Mobile retention justifies Build when user demand proven

Key principle: Every upgrade has a trigger point based on scale, data, or business requirements — not on what's trendy.