AI Steer: Chain-of-Thought Prompt Enhancement for Clinical Agents¶
Intent¶
Add explicit chain-of-thought (CoT) reasoning instructions to two clinical agent prompts — Clinical Parsing Agent (map_to_medical_codes) and Clinical Reasoning Agent (clinical summary). This is primarily a prompt-level change with small wiring additions for observability, A/B testing, and downstream data flow.
CoT is the lightweight precursor to ReAct (ADR-011). The reasoning traces produced here become the inner reasoning kernel inside future ReAct loops. Nothing shipped in this feature is throwaway.
Why This Matters¶
The agent pipeline already has structural CoT via LangGraph node decomposition (extract → map → generate). But within each node, prompts do not instruct step-by-step reasoning. Adding explicit CoT: - Reduces ICD-10/SNOMED hallucination (especially laterality errors like M17.11 vs M17.12) - Produces structured reasoning traces that feed LangSmith evals, training data, and risk assessor shadow comparison - Becomes the reasoning kernel inside future ReAct loops (ADR-011) - Costs only ~$0.002 extra per patient journey
Scope — What to Change¶
Prompt files (primary)¶
app/agents/prompts/clinical_parsing.py— themap_to_medical_codespromptapp/agents/prompts/clinical_reasoning.py— the clinical summary/treatment pathway prompt
Wiring additions (secondary)¶
- Flagsmith flags —
cot_clinical_parsingandcot_clinical_reasoningfor A/B measurement - Events table — store structured
reasoning_stepsJSON in event payload (not just raw LLM text) - Output schema — CoT output must include a
comorbidity_interactionsfield for risk assessor shadow comparison - Match Agent deduplication — if Clinical Reasoning produces a CoT-enriched clinical summary, Match Agent's
analyze_clinical_pictureshould consume it instead of re-reasoning from scratch - Guardrail tolerance — verify
output_validator.pyregex patterns handle CoT preamble before JSON block - CLAUDE.md updates — fix 6 stale items (see feature spec checklist)
Scope — What NOT to Change¶
- No changes to LangGraph node definitions or edges
- No changes to agent orchestrator or supervisor
- No changes to Intake Agent (conversational, CoT not beneficial)
- No changes to Match Agent prompts (deterministic scoring; it reads Clinical Reasoning output)
- No changes to Explanation Agent (already produces narrative; may consume CoT output for grounding)
- No changes to
risk_assessor.py(rule-based, deterministic — CoT doesn't apply to lookup tables) - No new endpoints, no migrations, no new dependencies
- No
max_tokensrestrictions that would truncate reasoning output
Key Constraints¶
Langfuse prompt management¶
- Changes must create new prompt versions in Langfuse, not just edit Python files
- If prompts are currently hardcoded in Python and not yet pulled from Langfuse, update the Python files AND note that Langfuse migration is pending
Prompt structure¶
- CoT instructions must come BEFORE the output format specification (JSON schema) — reasoning first, then structured output
- Keep existing few-shot examples. Add CoT reasoning traces to those examples
- Reasoning steps must be numbered and parseable (not freeform prose) — enables harvesting for LangSmith eval datasets later
Visibility¶
- Do NOT use "hidden CoT" — reasoning steps must appear in Langfuse traces for observability
- Reasoning must be stored as structured JSON in events table payload (training data for ML v2, MedGemma fine-tuning)
Model routing¶
- Currently Clinical Parsing uses Haiku. CoT is more demanding
- If Haiku CoT quality is insufficient on the Aisha test case, escalate
map_to_medical_codesstep to Sonnet using existing tiered routing config — no code change needed - Do NOT escalate the entire Clinical Context Agent — only the code-mapping step
Guardrails¶
output_validator.pyuses regex to check for forbidden patterns in LLM output- CoT adds reasoning text before the JSON block — verify the validator either: (a) only checks the JSON portion, or (b) tolerates non-JSON preamble
- If neither, update the validator to extract JSON from CoT+JSON output before validation
Token budget¶
- CoT adds ~300-500 output tokens per patient journey total
- This is acceptable — well within the $0.15/patient target
- Do not add
max_tokensthat would truncate reasoning
Connection Points — Nothing Throwaway¶
Every addition in this feature extends into a future system:
| What we ship now | What it feeds later |
|---|---|
| Numbered reasoning steps in Langfuse traces | LangSmith eval datasets (currently deferred — "not enough data") |
| Flagsmith A/B flags for CoT vs non-CoT | Quality delta measurement → feeds ReAct decision (ADR-011) |
comorbidity_interactions field in CoT output |
Risk assessor shadow comparison → triggers LLM Risk Agent build |
| Structured reasoning JSON in events table | Training data for ML Ranking v2 and MedGemma fine-tuning (Stage 5) |
| Match Agent consuming Clinical Reasoning summary | Eliminates duplicate LLM call, establishes data flow pattern |
| CoT reasoning as ReAct inner kernel | When ReAct wraps CoT, the "think" step is already optimized |
Quality Bar¶
- After changes, run existing agent tests — zero regressions
- Manually test with Aisha TKR demo case (ICD M17.11):
- Verify correct laterality (right, not unspecified)
- Verify correct SNOMED mapping
- Verify clinical summary includes BMI/age/comorbidity reasoning
- Verify
comorbidity_interactionsfield populated - Check Langfuse traces show visible numbered reasoning steps
- Verify
output_validator.pydoes not reject CoT output - Verify Match Agent reads Clinical Reasoning summary (no duplicate Claude Sonnet call)
References¶
- ADR-011 (ReAct deferral) — CoT is the lightweight alternative implemented now
- Architecture doc v1.2 §11.6 (Prompt Engineering Strategy)
- Session 33 #68 —
risk_assessor.py(rule-based, the pattern CoT shadow comparison follows) - Session 21 —
lab_analyzer.pycomorbidity_llm_shadowpattern (the exact template for shadow comparison) - Existing prompt files in
app/agents/prompts/ config/guardrails.yaml— output validation rulesapp/services/output_validator.py— regex-based output checking