Feature: Chain-of-Thought Prompt Enhancement for Clinical Agents¶
Overview¶
Add explicit step-by-step reasoning instructions to two clinical agent prompts to improve ICD coding accuracy, clinical summary quality, and downstream data flow. Prompt-level change with small wiring additions.
Agents in scope: Clinical Parsing (map_to_medical_codes), Clinical Reasoning (clinical summary) Agents NOT in scope: Intake, Match, Explanation, Risk (see Decision Record below)
Decision Record: Why These Two Agents, Not Others¶
What we're implementing now¶
| Agent | Change | Why |
|---|---|---|
Clinical Parsing (map_to_medical_codes) |
Add 7-step CoT for ICD/SNOMED mapping | Laterality and specificity errors are the #1 coding risk. CoT forces explicit body-system → site → laterality → code reasoning. |
| Clinical Reasoning (clinical summary) | Add 7-step CoT for clinical picture assessment | Treatment pathway recommendations need explicit severity/contraindication reasoning for defensibility. |
What we're explicitly NOT implementing and why¶
| Agent/Component | Why NOT | What triggers revisiting |
|---|---|---|
| Intake Agent | Conversational, not reasoning-heavy. CoT would add latency to chatbot responses with no accuracy benefit. | Never — wrong pattern for conversational agents. |
| Match Agent | Deterministic weighted scoring (v2.1). LLM steps (analyze_clinical_picture, rerank_edge_cases) will consume Clinical Reasoning's CoT output instead of re-reasoning. |
If Match Agent's LLM steps need independent clinical reasoning (unlikely — deduplication is better). |
| Explanation Agent | Already produces narrative output. Will benefit from consuming CoT reasoning chain (more grounded explanations) without needing its own CoT. | If match explanations lack clinical depth — add CoT to Explanation Agent's prompt post-demo. |
| Risk Assessor | Rule-based risk_assessor.py shipped in Session 33 (#68). ~30 deterministic rules with source provenance. CoT doesn't apply to lookup tables. |
LLM Risk Agent shadow mode when 1,000+ cases collected OR clinical advisor flags a rule miss the rule set fundamentally can't catch. See "Future: Risk Assessor Shadow Mode" below. |
| Lab Analyzer | Rule-based lab_analyzer.py (Session 21). Same pattern as risk assessor. comorbidity_llm_shadow pattern already documented for future LLM upgrade. |
Same triggers as Risk Assessor. |
Tradeoffs we're accepting¶
| Tradeoff | Impact | Mitigation |
|---|---|---|
| +300-500 output tokens per patient journey | ~$0.002 extra cost, ~200ms extra latency | Well within $0.15/patient target. Streaming (SSE) masks latency. |
| Visible reasoning in Langfuse traces | Traces are longer to read | Benefit outweighs cost — traces become eval datasets. |
| Haiku may struggle with multi-step clinical reasoning | Lower accuracy on complex cases | Flagsmith flag allows per-step model escalation to Sonnet. Tiered routing config — no code change. |
| CoT output is longer — guardrail regex may false-reject | Blocked responses on valid clinical output | Verify output_validator.py handles reasoning preamble before JSON (see checklist). |
1. Clinical Parsing Agent — map_to_medical_codes¶
Current Behavior¶
Prompt instructs: "Map the following clinical entities to ICD-10 and SNOMED CT codes with confidence scores."
New CoT Instructions¶
Add the following reasoning chain to the system prompt, before the JSON output schema:
When mapping clinical entities to medical codes, reason through each step explicitly:
1. BODY SYSTEM: Identify the primary body system affected (e.g., musculoskeletal, cardiovascular)
2. CONDITION CATEGORY: Determine the broad condition category within that system (e.g., osteoarthritis within musculoskeletal)
3. ANATOMICAL SPECIFICITY: Identify the exact anatomical site (e.g., knee, not just "lower extremity")
4. LATERALITY: Explicitly determine left, right, bilateral, or unspecified. Check the source text for mentions of "left", "right", "bilateral", "L", "R", "Lt", "Rt". If the report mentions a specific side, you MUST use the lateralized code.
5. SEVERITY/STAGE: Assess if severity or staging information is present (e.g., primary vs secondary osteoarthritis, Kellgren-Lawrence grade)
6. CODE SELECTION: Select the most specific ICD-10 code matching ALL of the above. Verify the code exists. State your confidence.
7. SNOMED MAPPING: Map to the corresponding SNOMED CT concept. Verify clinical equivalence with the ICD-10 code.
Output your reasoning as numbered steps, then the structured result.
Updated Few-Shot Example¶
Input: "Right knee osteoarthritis, primary, with significant joint space narrowing"
Reasoning:
1. Body system: Musculoskeletal
2. Condition: Osteoarthritis (degenerative joint disease)
3. Site: Knee joint
4. Laterality: RIGHT — source explicitly states "right knee"
5. Severity: Primary (not secondary to trauma/other condition). Joint space narrowing suggests advanced disease (KL Grade III-IV)
6. ICD-10: M17.11 (Primary osteoarthritis, right knee) — NOT M17.9 (unspecified), NOT M17.12 (left)
7. SNOMED: 239873007 (Osteoarthritis of knee) with laterality qualifier
Output:
{
"icd10_code": "M17.11",
"icd10_description": "Primary osteoarthritis, right knee",
"snomed_code": "239873007",
"snomed_description": "Osteoarthritis of knee",
"laterality": "right",
"confidence": 0.95,
"reasoning_steps": [
{"step": 1, "label": "body_system", "value": "musculoskeletal"},
{"step": 2, "label": "condition", "value": "osteoarthritis"},
{"step": 3, "label": "site", "value": "knee"},
{"step": 4, "label": "laterality", "value": "right", "source_evidence": "source states 'right knee'"},
{"step": 5, "label": "severity", "value": "primary, advanced (joint space narrowing)"},
{"step": 6, "label": "icd10", "value": "M17.11", "rejected_alternatives": ["M17.9", "M17.12"]},
{"step": 7, "label": "snomed", "value": "239873007"}
]
}
Key: The reasoning_steps array is structured and parseable — not freeform text. This enables:
- Harvesting for LangSmith eval datasets
- Storing in events table for training data
- Comparing step 4 (laterality) against FHIR resource bodySite for consistency checks
2. Clinical Reasoning Agent — Clinical Summary¶
Current Behavior¶
Prompt instructs: "Generate a clinical summary and treatment pathway recommendations based on the patient's conditions, observations, and medical history."
New CoT Instructions¶
Reason through the clinical picture step by step:
1. PRIMARY CONDITION: What is the main condition requiring treatment? State the diagnosis with ICD code.
2. CLINICAL SEVERITY: Based on imaging, symptoms, and functional limitations, how advanced is the condition? (early/moderate/severe)
3. CONSERVATIVE TREATMENT HISTORY: Has the patient tried non-surgical options? What was the outcome? If unknown, flag as "conservative treatment history not documented."
4. SURGICAL INDICATION: Based on severity and failed conservative treatment, is surgical intervention indicated? Why or why not?
5. COMORBIDITY INTERACTIONS: Don't list comorbidities independently — reason about how they INTERACT. Example: diabetes + obesity compounds wound infection risk. Cardiac history + general anesthesia compounds perioperative risk.
6. CONTRAINDICATIONS: Are there any absolute or relative contraindications to the recommended procedure?
7. TREATMENT PATHWAY: Recommend the appropriate treatment pathway with justification tied to steps 1-6.
Output your reasoning as numbered steps, then the structured clinical summary.
Updated Few-Shot Example¶
Input: Patient Aisha, 58F. Right knee OA (M17.11). BMI 31. Type 2 diabetes (controlled, HbA1c 6.8). X-ray shows Kellgren-Lawrence Grade IV. Failed 6 months physiotherapy and NSAIDs.
Reasoning:
1. Primary condition: Primary osteoarthritis, right knee (M17.11)
2. Severity: SEVERE — KL Grade IV indicates bone-on-bone, end-stage OA
3. Conservative history: 6 months physiotherapy + NSAIDs — FAILED (continued pain and functional limitation)
4. Surgical indication: YES — end-stage OA with failed conservative management meets standard criteria for total knee replacement
5. Comorbidity interactions:
- Diabetes (HbA1c 6.8) + Obesity (BMI 31): MODERATE compound risk — elevated wound infection risk, slower healing. HbA1c 6.8 is below 8.0 surgical threshold.
- Obesity (BMI 31) + TKR: Class I obesity increases implant wear rate but is below 40+ contraindication threshold.
- No cardiac history documented — no anesthesia compounding.
6. Contraindications: None absolute. BMI 31 is below typical 40+ TKR threshold. Diabetes controlled.
7. Pathway: Total knee replacement (right) indicated. Pre-op: maintain HbA1c < 7.0, weight management, pre-op quadriceps strengthening.
Output:
{
"clinical_summary": { ... },
"comorbidity_interactions": [
{
"conditions": ["type_2_diabetes", "obesity_class_1"],
"interaction": "compound_wound_infection_risk",
"severity": "moderate",
"detail": "HbA1c 6.8 + BMI 31 elevates infection risk but both below surgical thresholds"
}
],
"reasoning_steps": [
{"step": 1, "label": "primary_condition", "value": "M17.11 right knee OA"},
{"step": 2, "label": "severity", "value": "severe (KL Grade IV)"},
{"step": 3, "label": "conservative_history", "value": "failed 6mo PT + NSAIDs"},
{"step": 4, "label": "surgical_indication", "value": "yes"},
{"step": 5, "label": "comorbidity_interactions", "value": "diabetes+obesity moderate compound risk"},
{"step": 6, "label": "contraindications", "value": "none absolute"},
{"step": 7, "label": "pathway", "value": "TKR right, pre-op optimization"}
]
}
Key: The comorbidity_interactions array is the bridge to the risk assessor. When this LLM-derived interaction analysis diverges from the rule-based risk_assessor.py output, that divergence is exactly the signal needed to justify building the LLM Risk Agent.
3. Wiring Changes¶
3.1 Flagsmith Feature Flags¶
Create two flags for A/B quality measurement:
| Flag | Default | Purpose |
|---|---|---|
cot_clinical_parsing |
true |
Enable CoT on map_to_medical_codes. When false, use original prompt (no CoT). |
cot_clinical_reasoning |
true |
Enable CoT on clinical summary. When false, use original prompt. |
These are NOT "do we ship" flags — they're measurement flags. Both default on. Allows disabling per-tenant if CoT causes issues, and enables Langfuse quality comparison between CoT and non-CoT traces.
3.2 Events Table — Structured Reasoning Storage¶
When logging agent calls to the events table, include reasoning_steps in the JSONB payload:
# In the agent's event logging
event_payload = {
"agent": "clinical_parsing",
"model": "claude-haiku-4.5",
"tokens_in": ...,
"tokens_out": ...,
"cot_enabled": True,
"reasoning_steps": coded_entity["reasoning_steps"], # structured array
"comorbidity_interactions": summary.get("comorbidity_interactions", []) # from Clinical Reasoning
}
This structured data becomes: - Training data for ML Ranking v2 (learning which reasoning patterns lead to accepted matches) - Fine-tuning data for MedGemma (Stage 5 in Medical Model Evolution Roadmap) - Ground truth for LangSmith eval datasets (when enough data exists to build them)
3.3 Match Agent Deduplication¶
Currently, Match Agent's analyze_clinical_picture node calls Claude Sonnet to review all FHIR data and generate a clinical summary. After this feature, Clinical Reasoning Agent already produces a CoT-enriched summary stored on the EHR snapshot.
Change: analyze_clinical_picture should first check if a CoT-enriched clinical summary exists on the patient's EHR snapshot. If it does, use it directly instead of making another Sonnet call. Fall back to the existing Sonnet call only if no summary exists.
This saves ~$0.03 per patient journey (one fewer Sonnet call) and ensures the match is grounded in the same clinical reasoning visible in Langfuse traces.
3.4 Guardrail Validation Check¶
Verify that app/services/output_validator.py handles CoT output format:
- CoT output = reasoning text (numbered steps) + JSON block
- Current validator may expect pure JSON or check the entire response text for forbidden patterns
- If the validator regex-scans the full output, CoT medical reasoning text (which legitimately mentions conditions, treatments, risks) could trigger false positives on "diagnosis" or "treatment recommendation" patterns
- Resolution options:
a. Extract JSON block from CoT+JSON output, validate only the JSON portion
b. Add CoT-aware context: if
cot_enabled=Truein the agent config, relax pattern matching on the reasoning preamble c. Validate the full output but add allowed exceptions for structured reasoning steps (similar to existing allowed exceptions inguardrails.yaml) - Choose the simplest option that doesn't weaken guardrails for patient-facing output. CoT reasoning is internal (logged to Langfuse/events) — only the structured JSON portion reaches the patient.
3.5 FHIR Resource Confidence Metadata¶
When CoT produces a confidence score with justification (step 6 in Clinical Parsing), include it on the FHIR resource's meta field or as an extension:
{
"resourceType": "Condition",
"code": {
"coding": [{
"system": "http://hl7.org/fhir/sid/icd-10-cm",
"code": "M17.11"
}]
},
"meta": {
"extension": [{
"url": "http://curaway.ai/fhir/coding-confidence",
"valueDecimal": 0.95
}]
}
}
This is standard FHIR practice and makes resources more valuable for provider integrations. Low-confidence codes can be flagged for human review.
4. What Comes Later — Triggered by This Feature's Data¶
4.1 ReAct Pattern (ADR-011)¶
Current status: Deferred. No baseline Langfuse failure data.
How CoT feeds ReAct: CoT prompts become the inner reasoning kernel inside future ReAct loops. When ReAct is implemented: - Today (CoT): Prompt says "reason through laterality step by step" → model outputs M17.11 in one pass - Future (ReAct): Step 1: CoT reasons → outputs M17.11. Step 2: ACT — calls UMLS API to verify code exists. Step 3: OBSERVE — UMLS confirms/rejects. Step 4: If mismatch, loop back with observation as context.
Trigger: After MVP demo, collect Langfuse traces for 50+ patient cases. Categorize failures. If >10% of Clinical Parsing traces show incorrect codes that CoT didn't catch, ReAct is justified for that agent.
4.2 LLM Risk Agent Shadow Mode¶
Current status: Rule-based risk_assessor.py shipped (Session 33, #68). ~30 rules, deterministic, <5ms.
How CoT feeds it: The comorbidity_interactions field from Clinical Reasoning CoT output can be compared against risk_assessor.py output on every EHR rebuild. When they diverge, log the divergence to events table. Pattern matches comorbidity_llm_shadow from lab_analyzer.py (Session 21).
Implementation (not in this feature — document only):
# In ehr_builder_agent, after risk_assessor runs:
rule_based_risks = risk_assessor.assess(ehr_snapshot)
cot_interactions = clinical_reasoning_output.get("comorbidity_interactions", [])
if divergence_detected(rule_based_risks, cot_interactions):
log_event("risk_shadow_divergence", {
"rule_based": rule_based_risks,
"llm_cot": cot_interactions,
"patient_id": patient_id
})
Trigger: Build the standalone LLM Risk Agent when: - ≥1,000 real patient cases processed - Clinical advisor flags a rule miss the rule set fundamentally can't catch - Divergence log shows >15% disagreement between rule-based and CoT-derived risks - Complex procedures added (transplants, oncology) where rule space explodes
4.3 LangSmith Evaluation Datasets¶
Current status: Deferred — "not enough data exists."
How CoT feeds it: Structured reasoning_steps arrays in events table are the exact format needed for LangSmith eval datasets. After 100+ patient cases with CoT traces:
1. Export reasoning_steps from events table
2. Have clinical advisor grade a sample (correct/incorrect per step)
3. Build LangSmith dataset: input → expected reasoning → expected output
4. Run regression tests on every prompt version change
4.4 Explanation Agent Enhancement¶
Not in this feature but planned: If match explanations lack clinical depth after demo, add CoT to Explanation Agent prompts. The Explanation Agent can also reference Clinical Reasoning's reasoning_steps to produce more grounded justifications without re-reasoning.
4.5 Self-Hosted Model Evaluation (GPT-OSS-20B, MedGemma)¶
When evaluating self-hosted models (post-seed), CoT traces from Claude become the gold-standard comparison dataset. Run the same prompts through MedGemma, compare reasoning_steps structure and accuracy, measure quality delta before switching production traffic.
5. Token Cost Estimate¶
| Agent | Current ~Tokens Out | With CoT ~Tokens Out | Delta |
|---|---|---|---|
Clinical Parsing (map_to_medical_codes) |
~300 | ~450 | +150 |
| Clinical Reasoning (summary) | ~400 | ~600 | +200 |
| Per patient total | ~700 | ~1,050 | +350 (~$0.002) |
At 100 cases/month: +$0.20/month. Negligible. Well within $0.15/patient target even with Sonnet escalation on complex cases.
Match Agent deduplication (§3.3) saves ~$0.03/patient, more than offsetting the CoT token cost.
6. Implementation Checklist¶
Must Do — Prompt Changes¶
- [ ] Update prompt text in
app/agents/prompts/clinical_parsing.py— add 7-step CoT tomap_to_medical_codes - [ ] Update prompt text in
app/agents/prompts/clinical_reasoning.py— add 7-step CoT to clinical summary - [ ] Add
reasoning_stepsarray to output JSON schema in both prompts - [ ] Add
comorbidity_interactionsarray to Clinical Reasoning output schema - [ ] Update few-shot examples to include CoT reasoning traces
- [ ] Create new Langfuse prompt versions for both agents (if Langfuse prompt management is active)
Must Do — Wiring¶
- [ ] Create Flagsmith flags:
cot_clinical_parsing(default: true),cot_clinical_reasoning(default: true) - [ ] Update agent code to check flag before using CoT vs non-CoT prompt version
- [ ] Update events table logging to include
reasoning_stepsandcomorbidity_interactionsin JSONB payload - [ ] Update Match Agent
analyze_clinical_pictureto consume existing Clinical Reasoning summary instead of re-calling Sonnet - [ ] Verify
output_validator.pyhandles CoT preamble before JSON — fix if needed (see §3.4) - [ ] Add confidence score to FHIR resource
metaextension (see §3.5)
Must Do — Validation¶
- [ ] Run existing unit tests — zero regressions
- [ ] Manual test with Aisha TKR case — confirm M17.11 (not M17.9), correct laterality, risk reasoning
- [ ] Verify
reasoning_stepsappears in Langfuse traces (numbered, parseable) - [ ] Verify
comorbidity_interactionsfield populated in Clinical Reasoning output - [ ] Verify Match Agent uses existing summary (check Langfuse — should NOT show a separate
analyze_clinical_pictureSonnet call when summary exists) - [ ] Verify guardrail output validator passes CoT output without false rejections
Must Do — Documentation¶
- [ ] Update CLAUDE.md §11.6 (Prompt Engineering Strategy): add "Chain-of-thought reasoning on clinical agents"
- [ ] Update CLAUDE.md §11.2.5: change from "Deferred to Post-MVP" to "Rule-based risk_assessor.py shipped (Session 33). LLM Risk Agent deferred to post-Series A."
- [ ] Fix CLAUDE.md Ground Rule 6:
Inter→Montserrat, remove/update Lovable reference - [ ] Update CLAUDE.md Post-MVP table Risk Agent row: reflect rule-based shipped, upgrade path = LLM shadow
- [ ] Add Use Case 7 to LLM Evaluation section: Risk Assessment (rule-based current, LLM shadow next, standalone post-Series A)
- [ ] Add Key Decisions Log entry: "CoT prompt enhancement added to Clinical Parsing + Clinical Reasoning agents. Lightweight precursor to ReAct (ADR-011). Reasoning traces stored for LangSmith evals and training data."
- [ ] Add Session 34 entry with all changes
Must NOT Do¶
- [ ] No new endpoints
- [ ] No migrations
- [ ] No new pip dependencies
- [ ] No changes to LangGraph node definitions
- [ ] No changes to Intake, Explanation, or Risk agents
- [ ] No
max_tokensrestrictions that would truncate reasoning - [ ] No hidden CoT — reasoning must be visible in traces
Nice to Have (post-demo, document only)¶
- [ ] A/B test CoT vs non-CoT using Langfuse quality scores across 50+ cases
- [ ] Implement risk assessor shadow comparison using
comorbidity_interactions(§4.2) - [ ] Add CoT to Explanation Agent if match explanations lack clinical depth
- [ ] Build LangSmith eval dataset from 100+ CoT reasoning traces (§4.3)
- [ ] Compare CoT traces against self-hosted model output for MedGemma evaluation (§4.5)