Skip to content

Runtime Trace Example — Case 147f70ac…

This page documents one real production case end-to-end as captured by Langfuse, annotated for migration cost modelling. It is the empirical floor that the sequence diagrams describe abstractly.

Status: ✅ Populated 2026-04-29 from a complete Langfuse pull (all 78 LLM observations captured for this case).


Case summary

Field Value
Case ID 147f70ac-fa12-4a03-9eca-1a87ca801ccc
Patient tenant tenant-apollo-001
App URL https://app.curaway.ai/app/case/147f70ac-fa12-4a03-9eca-1a87ca801ccc
First LLM call 2026-04-26 10:07:26 UTC
Last LLM call 2026-04-26 10:32:57 UTC
Wall-clock (LLM activity) 25 min 30 sec
Total LLM observations 78
Patient chat turns 12
Total LLM cost $0.1188
Total tokens (in / out) 136,972 / 6,575
Models used Claude Haiku 4.5 (59 calls) + GPT-4o-mini fallback (19 calls)

The patient was on tenant-apollo-001. PHI is in the trace bodies; this page only contains aggregate numbers and agent names — no clinical content.


Material finding — fallback fired on 24% of calls

Curaway's design uses Claude Haiku 4.5 as the primary LLM with GPT-4o-mini as the deterministic fallback (per app/services/llm_gateway.py). For this case, 19 of 78 calls (24%) hit the fallback.

Model                            Calls   % of total
─────────────────────────────────────────────────────
claude-haiku-4-5-20251001           59      75.6%
gpt-4o-mini-2024-07-18              19      24.4%

That fallback rate is significantly higher than expected (the system targets <2%). Two known root causes from the work queue / memories that produce this pattern:

  1. Anthropic credit exhaustion — Session 64 documented this exact mode; the failure is silent and only surfaces when LLM quality degrades. (feedback_anthropic_credits_alert.md)
  2. Anthropic API outage during the trace window (2026-04-26 10:07–10:33 UTC).

Migration implication: the GCP plan must explicitly cover the fallback path's network plane. If Anthropic moves to Vertex AI Anthropic (in-VPC), the fallback (currently public-internet OpenAI) becomes the only PHI egress to public internet — that's a worse compliance posture for cases that hit fallback heavily, like this one. Decision needed: keep both LLMs on the same network plane, accept fallback as a graceful-degradation path that occasionally crosses the VPC boundary, or invest in a second in-VPC LLM (Vertex AI Gemini / Mistral) as the fallback.

Operational implication regardless of migration: the 24% fallback rate exceeded normal range — if this is a representative case, Anthropic credits or quota need attention. Worth a separate alarm on Langfuse fallback ratio.


LLM call breakdown by agent

Agent Calls Tokens In Tokens Out Cost p50 latency p95 latency
triage_agent.conversation 12 39,415 1,004 $0.0312 2,192ms 3,915ms
intent_extractor.extract 12 23,945 1,335 $0.0217 2,645ms 5,344ms
medical_extractor.extract 12 24,300 1,185 $0.0202 1,217ms 6,321ms
travel_extractor.extract 12 16,318 825 $0.0145 1,061ms 3,198ms
financial_extractor.extract 12 15,709 813 $0.0139 1,203ms 3,144ms
logistics_extractor.extract 12 15,416 540 $0.0123 1,365ms 2,311ms
medical_extractor.icd_mapping 6 1,869 873 $0.0051 1,435ms 3,835ms
TOTAL 78 136,972 6,575 $0.1188

Observations

  • Triage agent dominates cost (26%). Conversation history accumulates: across 12 turns, average ~3.3K input tokens per call. By turn 12, the context has grown to 5K+ tokens.
  • Specialised extractors run on every turn. All four (travel, financial, logistics, medical) execute per-turn unconditionally — 12 calls each. Intent classifier also runs per-turn.
  • ICD mapping is conditional. Only 6 calls vs 12 turns — fires only when the medical extractor finds new diagnoses to map.
  • Medical extractor's p95 (6.3s) is 5× its p50 (1.2s). Outliers consistent with re-prompts on JSON parse failures. Worth investigating if migration adds Cloud Run cold-start sensitivity.
  • Intent extractor's p95 (5.3s) is high relative to its tiny output. It's running on the same conversation history as triage; the p95 is dominated by a single outlier call probably during the Anthropic→GPT fallback transition.

Per-turn fan-out pattern

Each patient chat turn issues 6-7 LLM calls, mostly in parallel:

sequenceDiagram
    autonumber
    participant Patient
    participant Triage as triage_agent
    participant Intent as intent_extractor
    participant Medical as medical_extractor
    participant Financial as financial_extractor
    participant Logistics as logistics_extractor
    participant Travel as travel_extractor
    participant ICD as icd_mapping<br/>(conditional)

    Patient->>Triage: chat turn N
    par parallel extraction layer
        Triage->>Intent: classify turn intent
        Triage->>Medical: extract clinical
        Triage->>Financial: extract budget
        Triage->>Logistics: extract logistics
        Triage->>Travel: extract travel constraints
    end
    opt new diagnoses present
        Medical->>ICD: map to ICD-10
    end
    Triage->>Triage: synthesize layer state
    Triage-->>Patient: agent reply

Across this case's 12 chat turns, the totals shake out to: - 12 × triage_agent.conversation - 12 × intent_extractor.extract - 12 × medical_extractor.extract - 12 × financial_extractor.extract - 12 × logistics_extractor.extract - 12 × travel_extractor.extract - 6 × medical_extractor.icd_mapping (only when new clinical content)

= 78 LLM calls per case at this conversation length. Linear in turn count.


Migration cost model

Per-case cost on current setup (this case)

Layer Cost on current platform
LLM (mixed Haiku 75% + GPT-4o-mini 25%) $0.1188
Postgres queries (~50 per case) ~$0.0005
Neo4j queries (~10 per case) ~$0.0010
Qdrant queries (~5 per case) ~$0.0005
Redis operations (~100 per case, mostly hits) ~$0.0001
R2 storage + bandwidth (1-2 documents) ~$0.0001
QStash dispatch (1-2 tasks) $0 (free tier)
Langfuse trace export (78 traces) ~$0.0010
Per-case total ~$0.122

Per-case cost projection on GCP (if no fallback)

If migration includes Vertex AI Anthropic (in-VPC) and fallback is reduced through credit/quota fixes:

Layer GCP equivalent Estimated cost
LLM (Haiku 4.5 via Vertex AI) — 137K in / 6.6K out ~5% premium over public Anthropic ~$0.085
Cloud SQL Postgres (db-custom-1-3840) similar to today ~$0.0005
Vector search (Vertex AI Vector Search) per-query pricing × ~5/case ~$0.0007
Neo4j (Aura with BAA) unchanged ~$0.001
Memorystore (small instance) similar ~$0.0001
Cloud Storage (GCS Standard) similar ~$0.0001
Cloud Tasks (free tier) $0 $0
Cloud Logging / self-host Langfuse similar ~$0.001
Per-case total (Haiku-only) ~$0.088

Per-case cost projection on GCP (with current 24% fallback)

If fallback rate stays at this case's level:

LLM cost = 0.76 × $0.090 (Haiku via Vertex)  +  0.24 × $0.038 (GPT public) = $0.077
+ infra ~$0.003
= ~$0.080 per case

That's actually cheaper than today's mixed bag, since GPT-4o-mini is cheaper than Haiku for input-heavy workloads. But the network-plane cost (PHI crossing public internet on fallback) doesn't show up in the dollar number.

At scale (Haiku-primary)

Volume Today/month GCP/month
100 cases $12.20 $8.80
1,000 cases $122 $88
10,000 cases $1,220 $880
100,000 cases $12,200 $8,800

The cost is dominated by LLM inference, not infrastructure. GCP infra costs (Cloud SQL, Memorystore, Cloud Run, Cloud Tasks, GCS) are noise relative to Anthropic/OpenAI billing. Migration won't save infra spend; it changes compliance posture.


Cold-start sensitivity (qualitative — needs separate measurement)

Numbers below are estimates for migration sizing decisions; refresh with real Cloud Run measurements after the lift.

Scenario Wall-clock impact on first turn
Both API + worker Cloud Run instances warm 0ms baseline
API cold (first auth call after idle) +500-2000ms (JWKS fetch + container init)
Worker cold (first extraction job after idle) +1000-4000ms (cold container + first DB connection)
Cold + Anthropic fallback fires +500-2000ms additional (fallback path latency)

Mitigation: min-instances=1 on the API service costs ~$5-10/mo and eliminates user-facing cold starts. Worker cold start is acceptable since extraction is async (patient sees "processing" UI).


Compliance note — Langfuse on the HIPAA cloud

This Langfuse instance is hipaa.cloud.langfuse.com — Langfuse's HIPAA-compliant tier, not the standard cloud.langfuse.com. This is a positive finding for the BAA review:

  • Langfuse offers BAAs on the HIPAA tier
  • Trace bodies (containing PHI: clinical extractions, patient messages) are stored on HIPAA-compliant infrastructure
  • This may eliminate the "self-host Langfuse on GKE" option from the data flow map decision matrix — the BAA path on Langfuse Cloud HIPAA may be sufficient

Action for compliance review: verify the Langfuse BAA scope covers Curaway's trace volume + retention requirements, and that the HIPAA tier's controls (encryption, audit, deletion) match Curaway's needs.


Org-wide context (last 31 days)

While pulling this case's data, also captured an organisation-wide aggregate:

Metric (org-wide, 2026-03-28 to 2026-04-28) Value
Total LLM observations across all cases 15,936
Total LLM cost $63.12
Avg per-day cost ~$2.04
Estimated cases in this period (78 calls/case) ~204 cases
Estimated avg cost per case ~$0.31

The avg-per-case ($0.31) is higher than this single case ($0.119) — likely because some cases involve longer conversations, document re-extraction, or repeated matching attempts. Use the org-wide avg as the migration cost-projection baseline rather than this single trace.


How this trace was pulled (for refreshes)

# Auth from Railway env
LF_HOST=$(railway variables --kv | grep '^LANGFUSE_HOST=' | cut -d= -f2-)
LF_PUB=$(railway variables --kv | grep '^LANGFUSE_PUBLIC_KEY=' | cut -d= -f2-)
LF_SEC=$(railway variables --kv | grep '^LANGFUSE_SECRET_KEY=' | cut -d= -f2-)

# 1. List all traces for the case (sessionId = case_id)
curl -s -u "$LF_PUB:$LF_SEC" \
  "$LF_HOST/api/public/traces?sessionId=147f70ac-fa12-4a03-9eca-1a87ca801ccc&limit=100"

# 2. Pull all org-wide observations (paged, no sessionId filter — works around
#    the rate limit on per-trace fetches)
for page in $(seq 1 200); do
  curl -s -u "$LF_PUB:$LF_SEC" \
    "$LF_HOST/api/public/observations?type=GENERATION&page=$page&limit=100" \
    -o /tmp/obs_page_$page.json
  sleep 0.5
done

# 3. Filter local data by traceId in case's trace list
python3 -c "
import json, glob
case_ids = {t['id'] for t in json.load(open('/tmp/lf_trace/traces.json'))['traces']}
all_obs = []
for f in glob.glob('/tmp/obs_page_*.json'):
    all_obs += json.load(open(f)).get('data', [])
case_obs = [o for o in all_obs if o.get('traceId') in case_ids]
print(f'Found {len(case_obs)} observations for the case')
"

The HIPAA-compliant Langfuse cloud (hipaa.cloud.langfuse.com) has stricter rate limits on individual /traces/{id} fetches than on bulk /observations paging. Bulk + local filter is faster.


To refresh this page on a new case

Replace the case_id in the script above; rerun. The structure of this page (per-agent table, fan-out diagram, cost model) generalises — only the numbers change. Worth running again for:

  • A case that involved document upload + extraction (this case had no documents in the captured spans — pure conversational intake)
  • A case that completed matching + explanation (this case stopped at intake)
  • A case during steady Anthropic (no fallback) for the cleanest cost baseline