Runtime Trace Example — Case `147f70ac…`¶

This page documents one real production case end-to-end as captured by Langfuse, annotated for migration cost modelling. It is the empirical floor that the sequence diagrams describe abstractly.

Status: ✅ Populated 2026-04-29 from a complete Langfuse pull (all 78 LLM observations captured for this case).

Case summary¶

Field	Value
Case ID	`147f70ac-fa12-4a03-9eca-1a87ca801ccc`
Patient tenant	`tenant-apollo-001`
App URL	`https://app.curaway.ai/app/case/147f70ac-fa12-4a03-9eca-1a87ca801ccc`
First LLM call	`2026-04-26 10:07:26 UTC`
Last LLM call	`2026-04-26 10:32:57 UTC`
Wall-clock (LLM activity)	25 min 30 sec
Total LLM observations	78
Patient chat turns	12
Total LLM cost	$0.1188
Total tokens (in / out)	136,972 / 6,575
Models used	Claude Haiku 4.5 (59 calls) + GPT-4o-mini fallback (19 calls)

The patient was on tenant-apollo-001. PHI is in the trace bodies; this page only contains aggregate numbers and agent names — no clinical content.

Material finding — fallback fired on 24% of calls¶

Curaway's design uses Claude Haiku 4.5 as the primary LLM with GPT-4o-mini as the deterministic fallback (per app/services/llm_gateway.py). For this case, 19 of 78 calls (24%) hit the fallback.

Model                            Calls   % of total
─────────────────────────────────────────────────────
claude-haiku-4-5-20251001           59      75.6%
gpt-4o-mini-2024-07-18              19      24.4%

That fallback rate is significantly higher than expected (the system targets <2%). Two known root causes from the work queue / memories that produce this pattern:

Anthropic credit exhaustion — Session 64 documented this exact mode; the failure is silent and only surfaces when LLM quality degrades. (feedback_anthropic_credits_alert.md)
Anthropic API outage during the trace window (2026-04-26 10:07–10:33 UTC).

Migration implication: the GCP plan must explicitly cover the fallback path's network plane. If Anthropic moves to Vertex AI Anthropic (in-VPC), the fallback (currently public-internet OpenAI) becomes the only PHI egress to public internet — that's a worse compliance posture for cases that hit fallback heavily, like this one. Decision needed: keep both LLMs on the same network plane, accept fallback as a graceful-degradation path that occasionally crosses the VPC boundary, or invest in a second in-VPC LLM (Vertex AI Gemini / Mistral) as the fallback.

Operational implication regardless of migration: the 24% fallback rate exceeded normal range — if this is a representative case, Anthropic credits or quota need attention. Worth a separate alarm on Langfuse fallback ratio.

LLM call breakdown by agent¶

Agent	Calls	Tokens In	Tokens Out	Cost	p50 latency	p95 latency
`triage_agent.conversation`	12	39,415	1,004	$0.0312	2,192ms	3,915ms
`intent_extractor.extract`	12	23,945	1,335	$0.0217	2,645ms	5,344ms
`medical_extractor.extract`	12	24,300	1,185	$0.0202	1,217ms	6,321ms
`travel_extractor.extract`	12	16,318	825	$0.0145	1,061ms	3,198ms
`financial_extractor.extract`	12	15,709	813	$0.0139	1,203ms	3,144ms
`logistics_extractor.extract`	12	15,416	540	$0.0123	1,365ms	2,311ms
`medical_extractor.icd_mapping`	6	1,869	873	$0.0051	1,435ms	3,835ms
TOTAL	78	136,972	6,575	$0.1188	—	—

Observations¶

Triage agent dominates cost (26%). Conversation history accumulates: across 12 turns, average ~3.3K input tokens per call. By turn 12, the context has grown to 5K+ tokens.
Specialised extractors run on every turn. All four (travel, financial, logistics, medical) execute per-turn unconditionally — 12 calls each. Intent classifier also runs per-turn.
ICD mapping is conditional. Only 6 calls vs 12 turns — fires only when the medical extractor finds new diagnoses to map.
Medical extractor's p95 (6.3s) is 5× its p50 (1.2s). Outliers consistent with re-prompts on JSON parse failures. Worth investigating if migration adds Cloud Run cold-start sensitivity.
Intent extractor's p95 (5.3s) is high relative to its tiny output. It's running on the same conversation history as triage; the p95 is dominated by a single outlier call probably during the Anthropic→GPT fallback transition.

Per-turn fan-out pattern¶

Each patient chat turn issues 6-7 LLM calls, mostly in parallel:

sequenceDiagram
    autonumber
    participant Patient
    participant Triage as triage_agent
    participant Intent as intent_extractor
    participant Medical as medical_extractor
    participant Financial as financial_extractor
    participant Logistics as logistics_extractor
    participant Travel as travel_extractor
    participant ICD as icd_mapping<br/>(conditional)

    Patient->>Triage: chat turn N
    par parallel extraction layer
        Triage->>Intent: classify turn intent
        Triage->>Medical: extract clinical
        Triage->>Financial: extract budget
        Triage->>Logistics: extract logistics
        Triage->>Travel: extract travel constraints
    end
    opt new diagnoses present
        Medical->>ICD: map to ICD-10
    end
    Triage->>Triage: synthesize layer state
    Triage-->>Patient: agent reply

Across this case's 12 chat turns, the totals shake out to: - 12 × triage_agent.conversation - 12 × intent_extractor.extract - 12 × medical_extractor.extract - 12 × financial_extractor.extract - 12 × logistics_extractor.extract - 12 × travel_extractor.extract - 6 × medical_extractor.icd_mapping (only when new clinical content)

= 78 LLM calls per case at this conversation length. Linear in turn count.

Migration cost model¶

Per-case cost on current setup (this case)¶

Layer	Cost on current platform
LLM (mixed Haiku 75% + GPT-4o-mini 25%)	$0.1188
Postgres queries (~50 per case)	~$0.0005
Neo4j queries (~10 per case)	~$0.0010
Qdrant queries (~5 per case)	~$0.0005
Redis operations (~100 per case, mostly hits)	~$0.0001
R2 storage + bandwidth (1-2 documents)	~$0.0001
QStash dispatch (1-2 tasks)	$0 (free tier)
Langfuse trace export (78 traces)	~$0.0010
Per-case total	~$0.122

Per-case cost projection on GCP (if no fallback)¶

If migration includes Vertex AI Anthropic (in-VPC) and fallback is reduced through credit/quota fixes:

Layer	GCP equivalent	Estimated cost
LLM (Haiku 4.5 via Vertex AI) — 137K in / 6.6K out	~5% premium over public Anthropic	~$0.085
Cloud SQL Postgres (db-custom-1-3840)	similar to today	~$0.0005
Vector search (Vertex AI Vector Search)	per-query pricing × ~5/case	~$0.0007
Neo4j (Aura with BAA)	unchanged	~$0.001
Memorystore (small instance)	similar	~$0.0001
Cloud Storage (GCS Standard)	similar	~$0.0001
Cloud Tasks (free tier)	$0	$0
Cloud Logging / self-host Langfuse	similar	~$0.001
Per-case total (Haiku-only)		~$0.088

Per-case cost projection on GCP (with current 24% fallback)¶

If fallback rate stays at this case's level:

LLM cost = 0.76 × $0.090 (Haiku via Vertex)  +  0.24 × $0.038 (GPT public) = $0.077
+ infra ~$0.003
= ~$0.080 per case

That's actually cheaper than today's mixed bag, since GPT-4o-mini is cheaper than Haiku for input-heavy workloads. But the network-plane cost (PHI crossing public internet on fallback) doesn't show up in the dollar number.

At scale (Haiku-primary)¶

Volume	Today/month	GCP/month
100 cases	$12.20	$8.80
1,000 cases	$122	$88
10,000 cases	$1,220	$880
100,000 cases	$12,200	$8,800

The cost is dominated by LLM inference, not infrastructure. GCP infra costs (Cloud SQL, Memorystore, Cloud Run, Cloud Tasks, GCS) are noise relative to Anthropic/OpenAI billing. Migration won't save infra spend; it changes compliance posture.

Cold-start sensitivity (qualitative — needs separate measurement)¶

Numbers below are estimates for migration sizing decisions; refresh with real Cloud Run measurements after the lift.

Scenario	Wall-clock impact on first turn
Both API + worker Cloud Run instances warm	0ms baseline
API cold (first auth call after idle)	+500-2000ms (JWKS fetch + container init)
Worker cold (first extraction job after idle)	+1000-4000ms (cold container + first DB connection)
Cold + Anthropic fallback fires	+500-2000ms additional (fallback path latency)

Mitigation: min-instances=1 on the API service costs ~$5-10/mo and eliminates user-facing cold starts. Worker cold start is acceptable since extraction is async (patient sees "processing" UI).

Compliance note — Langfuse on the HIPAA cloud¶

This Langfuse instance is hipaa.cloud.langfuse.com — Langfuse's HIPAA-compliant tier, not the standard cloud.langfuse.com. This is a positive finding for the BAA review:

Langfuse offers BAAs on the HIPAA tier
Trace bodies (containing PHI: clinical extractions, patient messages) are stored on HIPAA-compliant infrastructure
This may eliminate the "self-host Langfuse on GKE" option from the data flow map decision matrix — the BAA path on Langfuse Cloud HIPAA may be sufficient

Action for compliance review: verify the Langfuse BAA scope covers Curaway's trace volume + retention requirements, and that the HIPAA tier's controls (encryption, audit, deletion) match Curaway's needs.

Org-wide context (last 31 days)¶

While pulling this case's data, also captured an organisation-wide aggregate:

Metric (org-wide, 2026-03-28 to 2026-04-28)	Value
Total LLM observations across all cases	15,936
Total LLM cost	$63.12
Avg per-day cost	~$2.04
Estimated cases in this period (78 calls/case)	~204 cases
Estimated avg cost per case	~$0.31

The avg-per-case ($0.31) is higher than this single case ($0.119) — likely because some cases involve longer conversations, document re-extraction, or repeated matching attempts. Use the org-wide avg as the migration cost-projection baseline rather than this single trace.

How this trace was pulled (for refreshes)¶

# Auth from Railway env
LF_HOST=$(railway variables --kv | grep '^LANGFUSE_HOST=' | cut -d= -f2-)
LF_PUB=$(railway variables --kv | grep '^LANGFUSE_PUBLIC_KEY=' | cut -d= -f2-)
LF_SEC=$(railway variables --kv | grep '^LANGFUSE_SECRET_KEY=' | cut -d= -f2-)

# 1. List all traces for the case (sessionId = case_id)
curl -s -u "$LF_PUB:$LF_SEC" \
  "$LF_HOST/api/public/traces?sessionId=147f70ac-fa12-4a03-9eca-1a87ca801ccc&limit=100"

# 2. Pull all org-wide observations (paged, no sessionId filter — works around
#    the rate limit on per-trace fetches)
for page in $(seq 1 200); do
  curl -s -u "$LF_PUB:$LF_SEC" \
    "$LF_HOST/api/public/observations?type=GENERATION&page=$page&limit=100" \
    -o /tmp/obs_page_$page.json
  sleep 0.5
done

# 3. Filter local data by traceId in case's trace list
python3 -c "
import json, glob
case_ids = {t['id'] for t in json.load(open('/tmp/lf_trace/traces.json'))['traces']}
all_obs = []
for f in glob.glob('/tmp/obs_page_*.json'):
    all_obs += json.load(open(f)).get('data', [])
case_obs = [o for o in all_obs if o.get('traceId') in case_ids]
print(f'Found {len(case_obs)} observations for the case')
"

The HIPAA-compliant Langfuse cloud (hipaa.cloud.langfuse.com) has stricter rate limits on individual /traces/{id} fetches than on bulk /observations paging. Bulk + local filter is faster.

To refresh this page on a new case¶

Replace the case_id in the script above; rerun. The structure of this page (per-agent table, fan-out diagram, cost model) generalises — only the numbers change. Worth running again for:

A case that involved document upload + extraction (this case had no documents in the captured spans — pure conversational intake)
A case that completed matching + explanation (this case stopped at intake)
A case during steady Anthropic (no fallback) for the cleanest cost baseline

Runtime Trace Example — Case 147f70ac…¶