LLM Evaluation for Curaway Use Cases
Moved from CLAUDE.md. This is the detailed cost/quality analysis for each LLM use case.
Use Case 1: Patient Conversation (Intake, Questions, Brand Voice)
~500 tokens in, ~300 tokens out per message, ~15 messages per case
| Model |
Input $/1M |
Output $/1M |
Cost/Case |
Quality |
Latency |
Decision |
| Claude Haiku 4.5 |
$0.80 |
$4.00 |
$0.03 |
Natural, empathetic, follows brand voice |
~0.5s |
Current choice |
| Claude Sonnet 4.6 |
$3.00 |
$15.00 |
$0.10 |
Best quality but overkill for conversation |
~1.5s |
Overkill for intake |
| GPT-4o mini |
$0.15 |
$0.60 |
$0.005 |
Good but less nuanced brand voice |
~0.3s |
Budget option |
| GPT-4o |
$2.50 |
$10.00 |
$0.08 |
Strong but different tone |
~1.0s |
Alternative |
| Gemini 1.5 Flash |
$0.075 |
$0.30 |
$0.003 |
Fast/cheap, variable quality |
~0.2s |
Exploration |
Decision: Claude Haiku 4.5 — $3/month for 100 cases.
Use Case 2: Clinical Context Agent (Report Parsing → ICD Codes → FHIR)
~3,000 tokens in (report), ~2,000 tokens out (structured extraction), ~3 calls per report
| Model |
Input $/1M |
Output $/1M |
Cost/Report |
Clinical Accuracy |
FHIR Compliance |
Decision |
| Claude Haiku 4.5 |
$0.80 |
$4.00 |
$0.03 |
Good general extraction |
Mostly correct |
Current (MVP) |
| Claude Sonnet 4.6 |
$3.00 |
$15.00 |
$0.12 |
Excellent clinical reasoning |
Accurate FHIR |
Recommended upgrade |
| Claude Opus 4.6 |
$15.00 |
$75.00 |
$0.60 |
Best but expensive |
Gold standard |
Overkill for MVP |
| GPT-4o |
$2.50 |
$10.00 |
$0.08 |
Strong extraction |
Good FHIR |
Alternative |
| GPT-4o mini |
$0.15 |
$0.60 |
$0.005 |
Misses nuances |
Inconsistent |
Not recommended |
| Med-PaLM 2 (Google) |
~$5.00 |
~$15.00 |
$0.12 |
USMLE-level reasoning |
Not FHIR-native |
Limited access |
| BioMistral 7B (self-hosted) |
GPU ~$0.50/hr |
— |
~$0.02 |
Medical pre-training |
Needs fine-tuning |
Post-seed option |
| MedGemma 4B (self-hosted) |
GPU ~$0.30/hr |
— |
~$0.01 |
Google medical SLM |
Needs FHIR template |
Post-seed option |
| PMC-LLaMA (self-hosted) |
GPU ~$0.50/hr |
— |
~$0.02 |
PubMed pre-trained |
Raw extraction only |
Research option |
Decision (MVP): Claude Haiku 4.5. Recommended upgrade: Claude Sonnet 4.6 (+$18/mo).
Post-seed: Evaluate MedGemma 4B in shadow mode.
Use Case 3: Comorbidity Detection from Lab Values
Rule-based (no LLM) — app/agents/lab_analyzer.py
| Approach |
Cost |
Accuracy |
Decision |
| Rule-based thresholds |
$0 |
Good for common conditions |
Current |
| LLM-based interpretation |
$0.02/report |
Catches edge cases |
Post-seed upgrade |
Use Case 4: Match Explanations
| Model |
Cost/Match Run |
Quality |
Decision |
| Template-based |
$0 |
Formulaic |
Current (MVP) |
| Claude Haiku |
$0.01 |
Natural |
Feature-flagged, available |
| Claude Sonnet |
$0.04 |
Compelling |
Post-seed |
Use Case 5: Embeddings (Semantic Search)
| Model |
$/1M tokens |
Dims |
Quality |
Decision |
| Voyage AI 3.5-lite |
Free (50M/mo) |
1024 |
General purpose |
Current |
| OpenAI text-embedding-3-small |
$0.02 |
1536 |
General purpose |
Fallback configured |
| BiomedCLIP (self-hosted) |
GPU cost |
512 |
Medical domain |
Post-seed |
| BGE-M3 (self-hosted) |
GPU cost |
1024 |
Multilingual |
Post-seed |
Use Case 6: Voice Transcription
| Model |
Price |
Medical Accuracy |
Decision |
| Web Speech API |
Free (browser) |
Poor for medical terms |
Current (MVP) |
| OpenAI Whisper |
$0.006/min |
Good with medical prompt |
Feature-flagged, ready |
| Deepgram Nova-2 Medical |
$0.0043/min |
Medical vocabulary |
Post-seed |
Use Case 7: Pre-Operative Risk Assessment
| Approach |
Cost |
Latency |
Accuracy |
Determinism |
Decision |
Rule-based (risk_assessor.py) |
$0 |
<5ms |
Good for common pre-op risks |
Yes |
Current |
| LLM shadow (Claude Sonnet) |
~$0.04/rebuild |
~1–2s |
Catches edge cases |
No |
Post-Series A |
| Standalone LLM Risk Agent |
~$0.04–$0.20/rebuild |
~1–2s |
Best + free-text reasoning |
No |
After shadow validation |
Full rationale in ADR-0013.
Monthly Cost Projections (100 cases, 200 reports)
| Configuration |
Conversation |
Clinical |
Embeddings |
Voice |
Total |
| Current (all Haiku) |
$3 |
$6 |
$0 |
$0 |
$9/mo |
| Recommended (Haiku + Sonnet) |
$3 |
$24 |
$0 |
$0 |
$27/mo |
| Premium (Sonnet everywhere) |
$10 |
$24 |
$0 |
$0.15 |
$34/mo |
| Budget (GPT-4o mini) |
$0.50 |
$1 |
$0 |
$0 |
$1.50/mo |
| Max accuracy (Opus clinical) |
$3 |
$120 |
$0 |
$0 |
$123/mo |
| Post-seed hybrid (Haiku + MedGemma) |
$3 |
$2 (GPU) |
$0 |
$0 |
$5/mo + GPU |
Cost Monitoring
- Langfuse: Tracks all LLM costs in real-time (model, tokens, cost per trace)
- Health page:
/landscape shows LLM Costs card with current month spend by model
- Budget alert: Set Langfuse alert if monthly cost exceeds $50
Medical Model Evolution Roadmap
| Stage |
When |
Models |
What Changes |
| Stage 1: API-Only (current) |
MVP |
Claude Haiku 4.5 (conversation) + Haiku (clinical) |
No GPU. All inference via API. ~$9/mo for 100 cases. |
| Stage 2: Upgrade Clinical |
Post-demo |
Claude Haiku (conversation) + Sonnet (clinical) |
Better ICD coding accuracy. ~$27/mo. |
| Stage 3: Evaluate Medical SLMs |
Post-seed |
MedGemma 4B, BioClinicalBERT in shadow mode |
Compare clinical accuracy vs Sonnet. No production traffic. |
| Stage 4: Hybrid Deploy |
Pre-Series A |
MedGemma 4B/27B + Claude Haiku |
Self-host MedGemma on A10G GPU (~$250–500/mo). |
| Stage 5: Fine-Tuned Models |
Post-Series A |
MedGemma fine-tuned on Curaway data |
Data flywheel moat. |
Post-MVP Technology Evolution
| Item |
MVP Approach |
Post-MVP Upgrade |
Effort |
| Medical NER Pipeline |
Claude Haiku |
SciSpacy + BioClinicalBERT + MedCAT |
1–2 weeks |
| Clinical Ontology Service |
LLM maps to ICD/SNOMED |
UMLS Metathesaurus + PyMedTermino |
1 week |
| MedGemma Integration |
Claude for all clinical AI |
MedGemma 4B multimodal |
2–3 weeks + GPU |
| Vector Embeddings |
Voyage AI 3.5-lite |
Qdrant + bge-large medical |
1 week |
| ML Matching v2 |
Weighted scoring v1 |
Learning-to-rank |
2–3 weeks |
| Risk Assessor |
Rule-based (Session 33) |
LLM shadow mode → full agent |
1–2 + 2–3 weeks |
| FHIR REST Surface |
Custom API |
Standard FHIR REST endpoints |
2 weeks |
| SMS Notifications |
Stub |
Full Twilio + TCPA compliance |
1 week |
| Mobile Push |
Device registry built |
FCM + APNs integration |
1 week |
| Video Consultations |
Model built, Daily.co stub |
Full SDK integration |
2–3 weeks |
| Cache Strategy |
Active (Redis) |
Expand to provider data + embeddings |
Done |
| LangSmith Evaluations |
Not active |
Evaluation datasets for ICD accuracy |
1 week |
Post-Series A Technology Evolution
| Technology |
Replaces |
When to Adopt |
Why Not Earlier |
| Kubernetes (EKS/GKE) |
Railway Pro |
5+ microservices |
K8s overhead not justified until team >5 |
| Kafka / AWS MSK |
Upstash QStash |
Event volume >50K/day |
QStash handles 500/day free |
| Temporal |
LangGraph + QStash |
Multi-day workflows |
LangGraph handles orchestration well |
| Aidbox (FHIR Server) |
fhir.resources + JSONB |
Provider EHR integration |
Zero operational cost currently |
| Elasticsearch |
PostgreSQL full-text |
5K+ providers |
PostgreSQL sufficient for 500 |
| Self-Hosted GPU |
API-based LLM |
Per-patient cost >$0.50 |
API ~$40/mo vs GPU ~$500–2K/mo |
| MedGemma Fine-Tuning |
Pretrained + API |
10K+ outcome records |
Fine-tuning without data degrades quality |
| React Native Mobile |
Responsive web |
Mobile retention justifies |
Build when user demand proven |
Key principle: Every upgrade has a trigger point based on scale, data, or business requirements — not on what's trendy.