Runtime Trace Example — Case 147f70ac…¶
This page documents one real production case end-to-end as captured by Langfuse, annotated for migration cost modelling. It is the empirical floor that the sequence diagrams describe abstractly.
Status: ✅ Populated 2026-04-29 from a complete Langfuse pull (all 78 LLM observations captured for this case).
Case summary¶
| Field | Value |
|---|---|
| Case ID | 147f70ac-fa12-4a03-9eca-1a87ca801ccc |
| Patient tenant | tenant-apollo-001 |
| App URL | https://app.curaway.ai/app/case/147f70ac-fa12-4a03-9eca-1a87ca801ccc |
| First LLM call | 2026-04-26 10:07:26 UTC |
| Last LLM call | 2026-04-26 10:32:57 UTC |
| Wall-clock (LLM activity) | 25 min 30 sec |
| Total LLM observations | 78 |
| Patient chat turns | 12 |
| Total LLM cost | $0.1188 |
| Total tokens (in / out) | 136,972 / 6,575 |
| Models used | Claude Haiku 4.5 (59 calls) + GPT-4o-mini fallback (19 calls) |
The patient was on
tenant-apollo-001. PHI is in the trace bodies; this page only contains aggregate numbers and agent names — no clinical content.
Material finding — fallback fired on 24% of calls¶
Curaway's design uses Claude Haiku 4.5 as the primary LLM with GPT-4o-mini as the deterministic fallback (per app/services/llm_gateway.py). For this case, 19 of 78 calls (24%) hit the fallback.
Model Calls % of total
─────────────────────────────────────────────────────
claude-haiku-4-5-20251001 59 75.6%
gpt-4o-mini-2024-07-18 19 24.4%
That fallback rate is significantly higher than expected (the system targets <2%). Two known root causes from the work queue / memories that produce this pattern:
- Anthropic credit exhaustion — Session 64 documented this exact mode; the failure is silent and only surfaces when LLM quality degrades. (feedback_anthropic_credits_alert.md)
- Anthropic API outage during the trace window (
2026-04-26 10:07–10:33 UTC).
Migration implication: the GCP plan must explicitly cover the fallback path's network plane. If Anthropic moves to Vertex AI Anthropic (in-VPC), the fallback (currently public-internet OpenAI) becomes the only PHI egress to public internet — that's a worse compliance posture for cases that hit fallback heavily, like this one. Decision needed: keep both LLMs on the same network plane, accept fallback as a graceful-degradation path that occasionally crosses the VPC boundary, or invest in a second in-VPC LLM (Vertex AI Gemini / Mistral) as the fallback.
Operational implication regardless of migration: the 24% fallback rate exceeded normal range — if this is a representative case, Anthropic credits or quota need attention. Worth a separate alarm on Langfuse fallback ratio.
LLM call breakdown by agent¶
| Agent | Calls | Tokens In | Tokens Out | Cost | p50 latency | p95 latency |
|---|---|---|---|---|---|---|
triage_agent.conversation |
12 | 39,415 | 1,004 | $0.0312 | 2,192ms | 3,915ms |
intent_extractor.extract |
12 | 23,945 | 1,335 | $0.0217 | 2,645ms | 5,344ms |
medical_extractor.extract |
12 | 24,300 | 1,185 | $0.0202 | 1,217ms | 6,321ms |
travel_extractor.extract |
12 | 16,318 | 825 | $0.0145 | 1,061ms | 3,198ms |
financial_extractor.extract |
12 | 15,709 | 813 | $0.0139 | 1,203ms | 3,144ms |
logistics_extractor.extract |
12 | 15,416 | 540 | $0.0123 | 1,365ms | 2,311ms |
medical_extractor.icd_mapping |
6 | 1,869 | 873 | $0.0051 | 1,435ms | 3,835ms |
| TOTAL | 78 | 136,972 | 6,575 | $0.1188 | — | — |
Observations¶
- Triage agent dominates cost (26%). Conversation history accumulates: across 12 turns, average ~3.3K input tokens per call. By turn 12, the context has grown to 5K+ tokens.
- Specialised extractors run on every turn. All four (
travel,financial,logistics,medical) execute per-turn unconditionally — 12 calls each. Intent classifier also runs per-turn. - ICD mapping is conditional. Only 6 calls vs 12 turns — fires only when the medical extractor finds new diagnoses to map.
- Medical extractor's p95 (6.3s) is 5× its p50 (1.2s). Outliers consistent with re-prompts on JSON parse failures. Worth investigating if migration adds Cloud Run cold-start sensitivity.
- Intent extractor's p95 (5.3s) is high relative to its tiny output. It's running on the same conversation history as triage; the p95 is dominated by a single outlier call probably during the Anthropic→GPT fallback transition.
Per-turn fan-out pattern¶
Each patient chat turn issues 6-7 LLM calls, mostly in parallel:
sequenceDiagram
autonumber
participant Patient
participant Triage as triage_agent
participant Intent as intent_extractor
participant Medical as medical_extractor
participant Financial as financial_extractor
participant Logistics as logistics_extractor
participant Travel as travel_extractor
participant ICD as icd_mapping<br/>(conditional)
Patient->>Triage: chat turn N
par parallel extraction layer
Triage->>Intent: classify turn intent
Triage->>Medical: extract clinical
Triage->>Financial: extract budget
Triage->>Logistics: extract logistics
Triage->>Travel: extract travel constraints
end
opt new diagnoses present
Medical->>ICD: map to ICD-10
end
Triage->>Triage: synthesize layer state
Triage-->>Patient: agent reply
Across this case's 12 chat turns, the totals shake out to:
- 12 × triage_agent.conversation
- 12 × intent_extractor.extract
- 12 × medical_extractor.extract
- 12 × financial_extractor.extract
- 12 × logistics_extractor.extract
- 12 × travel_extractor.extract
- 6 × medical_extractor.icd_mapping (only when new clinical content)
= 78 LLM calls per case at this conversation length. Linear in turn count.
Migration cost model¶
Per-case cost on current setup (this case)¶
| Layer | Cost on current platform |
|---|---|
| LLM (mixed Haiku 75% + GPT-4o-mini 25%) | $0.1188 |
| Postgres queries (~50 per case) | ~$0.0005 |
| Neo4j queries (~10 per case) | ~$0.0010 |
| Qdrant queries (~5 per case) | ~$0.0005 |
| Redis operations (~100 per case, mostly hits) | ~$0.0001 |
| R2 storage + bandwidth (1-2 documents) | ~$0.0001 |
| QStash dispatch (1-2 tasks) | $0 (free tier) |
| Langfuse trace export (78 traces) | ~$0.0010 |
| Per-case total | ~$0.122 |
Per-case cost projection on GCP (if no fallback)¶
If migration includes Vertex AI Anthropic (in-VPC) and fallback is reduced through credit/quota fixes:
| Layer | GCP equivalent | Estimated cost |
|---|---|---|
| LLM (Haiku 4.5 via Vertex AI) — 137K in / 6.6K out | ~5% premium over public Anthropic | ~$0.085 |
| Cloud SQL Postgres (db-custom-1-3840) | similar to today | ~$0.0005 |
| Vector search (Vertex AI Vector Search) | per-query pricing × ~5/case | ~$0.0007 |
| Neo4j (Aura with BAA) | unchanged | ~$0.001 |
| Memorystore (small instance) | similar | ~$0.0001 |
| Cloud Storage (GCS Standard) | similar | ~$0.0001 |
| Cloud Tasks (free tier) | $0 | $0 |
| Cloud Logging / self-host Langfuse | similar | ~$0.001 |
| Per-case total (Haiku-only) | ~$0.088 |
Per-case cost projection on GCP (with current 24% fallback)¶
If fallback rate stays at this case's level:
LLM cost = 0.76 × $0.090 (Haiku via Vertex) + 0.24 × $0.038 (GPT public) = $0.077
+ infra ~$0.003
= ~$0.080 per case
That's actually cheaper than today's mixed bag, since GPT-4o-mini is cheaper than Haiku for input-heavy workloads. But the network-plane cost (PHI crossing public internet on fallback) doesn't show up in the dollar number.
At scale (Haiku-primary)¶
| Volume | Today/month | GCP/month |
|---|---|---|
| 100 cases | $12.20 | $8.80 |
| 1,000 cases | $122 | $88 |
| 10,000 cases | $1,220 | $880 |
| 100,000 cases | $12,200 | $8,800 |
The cost is dominated by LLM inference, not infrastructure. GCP infra costs (Cloud SQL, Memorystore, Cloud Run, Cloud Tasks, GCS) are noise relative to Anthropic/OpenAI billing. Migration won't save infra spend; it changes compliance posture.
Cold-start sensitivity (qualitative — needs separate measurement)¶
Numbers below are estimates for migration sizing decisions; refresh with real Cloud Run measurements after the lift.
| Scenario | Wall-clock impact on first turn |
|---|---|
| Both API + worker Cloud Run instances warm | 0ms baseline |
| API cold (first auth call after idle) | +500-2000ms (JWKS fetch + container init) |
| Worker cold (first extraction job after idle) | +1000-4000ms (cold container + first DB connection) |
| Cold + Anthropic fallback fires | +500-2000ms additional (fallback path latency) |
Mitigation: min-instances=1 on the API service costs ~$5-10/mo and eliminates user-facing cold starts. Worker cold start is acceptable since extraction is async (patient sees "processing" UI).
Compliance note — Langfuse on the HIPAA cloud¶
This Langfuse instance is hipaa.cloud.langfuse.com — Langfuse's HIPAA-compliant tier, not the standard cloud.langfuse.com. This is a positive finding for the BAA review:
- Langfuse offers BAAs on the HIPAA tier
- Trace bodies (containing PHI: clinical extractions, patient messages) are stored on HIPAA-compliant infrastructure
- This may eliminate the "self-host Langfuse on GKE" option from the data flow map decision matrix — the BAA path on Langfuse Cloud HIPAA may be sufficient
Action for compliance review: verify the Langfuse BAA scope covers Curaway's trace volume + retention requirements, and that the HIPAA tier's controls (encryption, audit, deletion) match Curaway's needs.
Org-wide context (last 31 days)¶
While pulling this case's data, also captured an organisation-wide aggregate:
| Metric (org-wide, 2026-03-28 to 2026-04-28) | Value |
|---|---|
| Total LLM observations across all cases | 15,936 |
| Total LLM cost | $63.12 |
| Avg per-day cost | ~$2.04 |
| Estimated cases in this period (78 calls/case) | ~204 cases |
| Estimated avg cost per case | ~$0.31 |
The avg-per-case ($0.31) is higher than this single case ($0.119) — likely because some cases involve longer conversations, document re-extraction, or repeated matching attempts. Use the org-wide avg as the migration cost-projection baseline rather than this single trace.
How this trace was pulled (for refreshes)¶
# Auth from Railway env
LF_HOST=$(railway variables --kv | grep '^LANGFUSE_HOST=' | cut -d= -f2-)
LF_PUB=$(railway variables --kv | grep '^LANGFUSE_PUBLIC_KEY=' | cut -d= -f2-)
LF_SEC=$(railway variables --kv | grep '^LANGFUSE_SECRET_KEY=' | cut -d= -f2-)
# 1. List all traces for the case (sessionId = case_id)
curl -s -u "$LF_PUB:$LF_SEC" \
"$LF_HOST/api/public/traces?sessionId=147f70ac-fa12-4a03-9eca-1a87ca801ccc&limit=100"
# 2. Pull all org-wide observations (paged, no sessionId filter — works around
# the rate limit on per-trace fetches)
for page in $(seq 1 200); do
curl -s -u "$LF_PUB:$LF_SEC" \
"$LF_HOST/api/public/observations?type=GENERATION&page=$page&limit=100" \
-o /tmp/obs_page_$page.json
sleep 0.5
done
# 3. Filter local data by traceId in case's trace list
python3 -c "
import json, glob
case_ids = {t['id'] for t in json.load(open('/tmp/lf_trace/traces.json'))['traces']}
all_obs = []
for f in glob.glob('/tmp/obs_page_*.json'):
all_obs += json.load(open(f)).get('data', [])
case_obs = [o for o in all_obs if o.get('traceId') in case_ids]
print(f'Found {len(case_obs)} observations for the case')
"
The HIPAA-compliant Langfuse cloud (hipaa.cloud.langfuse.com) has stricter rate limits on individual /traces/{id} fetches than on bulk /observations paging. Bulk + local filter is faster.
To refresh this page on a new case¶
Replace the case_id in the script above; rerun. The structure of this page (per-agent table, fan-out diagram, cost model) generalises — only the numbers change. Worth running again for:
- A case that involved document upload + extraction (this case had no documents in the captured spans — pure conversational intake)
- A case that completed matching + explanation (this case stopped at intake)
- A case during steady Anthropic (no fallback) for the cleanest cost baseline