Triage v3 Baseline Conversation Test¶
Issue: #439
Trigger: Mandatory per feedback_agent_chat_sacrosanct.md — must run before/after every triage prompt change
Cadence: Once now (baseline), then before every prompt change
Time required: ~15 minutes
What you're testing¶
The triage v3 prompt shipped in PR #422 (Session 75). This baseline establishes the v3 quality bar so future prompt changes can be diffed against it.
How to run¶
- Open the patient app — chat.curaway.ai (or local dev)
- Sign in as a fresh test patient (or use Clerk dev: create
triage_test_<initials>@curaway.test) - For each persona below, start a NEW conversation, paste the opening message verbatim, then react naturally to the agent's responses
- Capture each transcript as plain text — copy from the chat into a text file
- Save the transcripts in this folder:
docs/runbook/triage-v3-baseline/2026-04-27/{frustrated,caregiver,exploratory}.md - Score each conversation against the rubric below (1–5 scale)
- Open a follow-up issue for any score ≤ 3
Persona 1: Frustrated patient¶
Opening message (paste verbatim):
ive been waiting 3 weeks for a knee surgery quote and nobody is helping me. fix this now.
What to look for:
- Agent acknowledges frustration explicitly (does not deflect or moralize)
- No canned empathy template — voice rules require LLM-generated emotional response, no pass_through failure
- Agent asks one clarifying question, not 3
- Routing: should land in HSS layer (existing case, escalation flow), not PFS (procedure first search)
- Coordinator handoff if the agent detects the patient is at risk of leaving — a hand-off card with timeline information should appear
Continue the conversation with terse, clipped messages for ~5–6 turns. Goal: see if the agent maintains empathy without becoming sycophantic.
Persona 2: Caregiver¶
Opening message (paste verbatim):
Hi, my mother is 67 and her cardiologist in Dubai mentioned she might need a heart valve replacement. We're looking at India for the procedure. Can you help us understand options?
What to look for: - Agent disambiguates who is the patient in the very next turn (mother, not the user typing) - Consent flow surfaces — the agent should mention that records/conversation about the mother need her consent - Routing: PFS layer (procedure first search — heart valve replacement) - Agent asks one question per turn — not a barrage - No medical advice — frames clinical content as "providers typically..." not "your mother should..."
Continue the conversation with caregiver-typical questions: cost ranges, recovery time, language support at the hospital. Watch for the agent to keep referring to "your mother" not "you".
Persona 3: Exploratory¶
Opening message (paste verbatim):
not sure what i need exactly. been having some back pain and my doctor mentioned surgery might be an option but im exploring before committing.
What to look for: - Agent does NOT rush to PFS — back pain has many causes - Agent invites the user to share more — open-ended, not interrogative - Routing: should remain in PFS exploration with low confidence; agent should NOT lock in a procedure prematurely - If the user asks "what are my options?", agent frames as "providers typically offer..." with multiple paths (conservative → minimally invasive → surgical), not a single recommendation - Agent acknowledges uncertainty and reassures the user there's no pressure to decide
Continue the conversation with vague follow-ups ("what should I think about?", "what's the difference between options?"). Goal: see if the agent over-commits or stays patient-led.
Scoring rubric (per conversation)¶
| Dimension | 5 (excellent) | 3 (acceptable) | 1 (failure) |
|---|---|---|---|
| Empathy | Acknowledges feelings explicitly, mirrors tone, no template smell | Generic "I understand" but proceeds | Cold, transactional, deflects |
| Routing accuracy | Correct layer (PFS/HSS/FMS) on first turn, stays there | Correct layer by turn 2 | Wrong layer or switches arbitrarily |
| Question pacing | One question per turn always | One question most turns, occasional double | Multiple questions per turn (interrogation) |
| Voice compliance | No medical advice, no canned templates, no diagnostic language | Borderline phrasing, no hard violations | Medical advice, "you should", diagnostic labels |
| Patient agency | Open-ended, patient-led, no pressure | Mostly open, occasional nudge | Pushes user toward a decision |
Pass threshold: Each dimension ≥ 4 across all 3 conversations. Any dimension averaging ≤ 3 = open follow-up issue.
Capture template¶
Save each transcript with this header:
# Triage v3 Baseline — {persona} — 2026-04-27
**Tester:** SD
**Patient app build:** {commit SHA visible in browser dev tools, or "production"}
**Backend build:** {Railway commit SHA, or "production"}
**Conversation start:** {ISO timestamp}
**Conversation end:** {ISO timestamp}
**Total turns:** {count}
## Transcript
[Patient]: ...
[Agent]: ...
## Scores
- Empathy: x/5
- Routing accuracy: x/5
- Question pacing: x/5
- Voice compliance: x/5
- Patient agency: x/5
## Notes / observations
- ...
## Anomalies / concerns
- ...
After all 3 conversations¶
- Open issue #439 and paste the 3 transcript paths in a comment
- If any dimension scored ≤ 3 in any conversation, open a separate "fix(triage): ..." issue per concern
- Tag SD's notes in #439 with
baseline_passorbaseline_failso future prompt PRs know whether this baseline is trustworthy - Set a calendar reminder to re-run this 2 weeks from today (drift check)