LLM Evaluation for Curaway Use Cases¶

Moved from CLAUDE.md. This is the detailed cost/quality analysis for each LLM use case.

Use Case 1: Patient Conversation (Intake, Questions, Brand Voice)¶

~500 tokens in, ~300 tokens out per message, ~15 messages per case

Model	Input $/1M	Output $/1M	Cost/Case	Quality	Latency	Decision
Claude Haiku 4.5	$0.80	$4.00	$0.03	Natural, empathetic, follows brand voice	~0.5s	Current choice
Claude Sonnet 4.6	$3.00	$15.00	$0.10	Best quality but overkill for conversation	~1.5s	Overkill for intake
GPT-4o mini	$0.15	$0.60	$0.005	Good but less nuanced brand voice	~0.3s	Budget option
GPT-4o	$2.50	$10.00	$0.08	Strong but different tone	~1.0s	Alternative
Gemini 1.5 Flash	$0.075	$0.30	$0.003	Fast/cheap, variable quality	~0.2s	Exploration

Decision: Claude Haiku 4.5 — $3/month for 100 cases.

~3,000 tokens in (report), ~2,000 tokens out (structured extraction), ~3 calls per report

Model	Input $/1M	Output $/1M	Cost/Report	Clinical Accuracy	FHIR Compliance	Decision
Claude Haiku 4.5	$0.80	$4.00	$0.03	Good general extraction	Mostly correct	Current (MVP)
Claude Sonnet 4.6	$3.00	$15.00	$0.12	Excellent clinical reasoning	Accurate FHIR	Recommended upgrade
Claude Opus 4.6	$15.00	$75.00	$0.60	Best but expensive	Gold standard	Overkill for MVP
GPT-4o	$2.50	$10.00	$0.08	Strong extraction	Good FHIR	Alternative
GPT-4o mini	$0.15	$0.60	$0.005	Misses nuances	Inconsistent	Not recommended
Med-PaLM 2 (Google)	~$5.00	~$15.00	$0.12	USMLE-level reasoning	Not FHIR-native	Limited access
BioMistral 7B (self-hosted)	GPU ~$0.50/hr	—	~$0.02	Medical pre-training	Needs fine-tuning	Post-seed option
MedGemma 4B (self-hosted)	GPU ~$0.30/hr	—	~$0.01	Google medical SLM	Needs FHIR template	Post-seed option
PMC-LLaMA (self-hosted)	GPU ~$0.50/hr	—	~$0.02	PubMed pre-trained	Raw extraction only	Research option

Decision (MVP): Claude Haiku 4.5. Recommended upgrade: Claude Sonnet 4.6 (+$18/mo). Post-seed: Evaluate MedGemma 4B in shadow mode.

Rule-based (no LLM) — app/agents/lab_analyzer.py

Approach	Cost	Accuracy	Decision
Rule-based thresholds	$0	Good for common conditions	Current
LLM-based interpretation	$0.02/report	Catches edge cases	Post-seed upgrade

Model	Cost/Match Run	Quality	Decision
Template-based	$0	Formulaic	Current (MVP)
Claude Haiku	$0.01	Natural	Feature-flagged, available
Claude Sonnet	$0.04	Compelling	Post-seed

Model	$/1M tokens	Dims	Quality	Decision
Voyage AI 3.5-lite	Free (50M/mo)	1024	General purpose	Current
OpenAI text-embedding-3-small	$0.02	1536	General purpose	Fallback configured
BiomedCLIP (self-hosted)	GPU cost	512	Medical domain	Post-seed
BGE-M3 (self-hosted)	GPU cost	1024	Multilingual	Post-seed

Model	Price	Medical Accuracy	Decision
Web Speech API	Free (browser)	Poor for medical terms	Current (MVP)
OpenAI Whisper	$0.006/min	Good with medical prompt	Feature-flagged, ready
Deepgram Nova-2 Medical	$0.0043/min	Medical vocabulary	Post-seed

Approach	Cost	Latency	Accuracy	Determinism	Decision
Rule-based (`risk_assessor.py`)	$0	<5ms	Good for common pre-op risks	Yes	Current
LLM shadow (Claude Sonnet)	~$0.04/rebuild	~1–2s	Catches edge cases	No	Post-Series A
Standalone LLM Risk Agent	~$0.04–$0.20/rebuild	~1–2s	Best + free-text reasoning	No	After shadow validation

Full rationale in ADR-0013.

Configuration	Conversation	Clinical	Embeddings	Voice	Total
Current (all Haiku)	$3	$6	$0	$0	$9/mo
Recommended (Haiku + Sonnet)	$3	$24	$0	$0	$27/mo
Premium (Sonnet everywhere)	$10	$24	$0	$0.15	$34/mo
Budget (GPT-4o mini)	$0.50	$1	$0	$0	$1.50/mo
Max accuracy (Opus clinical)	$3	$120	$0	$0	$123/mo
Post-seed hybrid (Haiku + MedGemma)	$3	$2 (GPU)	$0	$0	$5/mo + GPU

Langfuse: Tracks all LLM costs in real-time (model, tokens, cost per trace)
Health page: /landscape shows LLM Costs card with current month spend by model
Budget alert: Set Langfuse alert if monthly cost exceeds $50

Stage	When	Models	What Changes
Stage 1: API-Only (current)	MVP	Claude Haiku 4.5 (conversation) + Haiku (clinical)	No GPU. All inference via API. ~$9/mo for 100 cases.
Stage 2: Upgrade Clinical	Post-demo	Claude Haiku (conversation) + Sonnet (clinical)	Better ICD coding accuracy. ~$27/mo.
Stage 3: Evaluate Medical SLMs	Post-seed	MedGemma 4B, BioClinicalBERT in shadow mode	Compare clinical accuracy vs Sonnet. No production traffic.
Stage 4: Hybrid Deploy	Pre-Series A	MedGemma 4B/27B + Claude Haiku	Self-host MedGemma on A10G GPU (~$250–500/mo).
Stage 5: Fine-Tuned Models	Post-Series A	MedGemma fine-tuned on Curaway data	Data flywheel moat.

Item	MVP Approach	Post-MVP Upgrade	Effort
Medical NER Pipeline	Claude Haiku	SciSpacy + BioClinicalBERT + MedCAT	1–2 weeks
Clinical Ontology Service	LLM maps to ICD/SNOMED	UMLS Metathesaurus + PyMedTermino	1 week
MedGemma Integration	Claude for all clinical AI	MedGemma 4B multimodal	2–3 weeks + GPU
Vector Embeddings	Voyage AI 3.5-lite	Qdrant + bge-large medical	1 week
ML Matching v2	Weighted scoring v1	Learning-to-rank	2–3 weeks
Risk Assessor	Rule-based (Session 33)	LLM shadow mode → full agent	1–2 + 2–3 weeks
FHIR REST Surface	Custom API	Standard FHIR REST endpoints	2 weeks
SMS Notifications	Stub	Full Twilio + TCPA compliance	1 week
Mobile Push	Device registry built	FCM + APNs integration	1 week
Video Consultations	Model built, Daily.co stub	Full SDK integration	2–3 weeks
Cache Strategy	Active (Redis)	Expand to provider data + embeddings	Done
LangSmith Evaluations	Not active	Evaluation datasets for ICD accuracy	1 week

Technology	Replaces	When to Adopt	Why Not Earlier
Kubernetes (EKS/GKE)	Railway Pro	5+ microservices	K8s overhead not justified until team >5
Kafka / AWS MSK	Upstash QStash	Event volume >50K/day	QStash handles 500/day free
Temporal	LangGraph + QStash	Multi-day workflows	LangGraph handles orchestration well
Aidbox (FHIR Server)	fhir.resources + JSONB	Provider EHR integration	Zero operational cost currently
Elasticsearch	PostgreSQL full-text	5K+ providers	PostgreSQL sufficient for 500
Self-Hosted GPU	API-based LLM	Per-patient cost >$0.50	API ~$40/mo vs GPU ~$500–2K/mo
MedGemma Fine-Tuning	Pretrained + API	10K+ outcome records	Fine-tuning without data degrades quality
React Native Mobile	Responsive web	Mobile retention justifies	Build when user demand proven

Key principle: Every upgrade has a trigger point based on scale, data, or business requirements — not on what's trendy.