Skip to content

AI Steer: Chain-of-Thought Prompt Enhancement for Clinical Agents

Intent

Add explicit chain-of-thought (CoT) reasoning instructions to two clinical agent prompts — Clinical Parsing Agent (map_to_medical_codes) and Clinical Reasoning Agent (clinical summary). This is primarily a prompt-level change with small wiring additions for observability, A/B testing, and downstream data flow.

CoT is the lightweight precursor to ReAct (ADR-011). The reasoning traces produced here become the inner reasoning kernel inside future ReAct loops. Nothing shipped in this feature is throwaway.

Why This Matters

The agent pipeline already has structural CoT via LangGraph node decomposition (extract → map → generate). But within each node, prompts do not instruct step-by-step reasoning. Adding explicit CoT: - Reduces ICD-10/SNOMED hallucination (especially laterality errors like M17.11 vs M17.12) - Produces structured reasoning traces that feed LangSmith evals, training data, and risk assessor shadow comparison - Becomes the reasoning kernel inside future ReAct loops (ADR-011) - Costs only ~$0.002 extra per patient journey

Scope — What to Change

Prompt files (primary)

  1. app/agents/prompts/clinical_parsing.py — the map_to_medical_codes prompt
  2. app/agents/prompts/clinical_reasoning.py — the clinical summary/treatment pathway prompt

Wiring additions (secondary)

  1. Flagsmith flagscot_clinical_parsing and cot_clinical_reasoning for A/B measurement
  2. Events table — store structured reasoning_steps JSON in event payload (not just raw LLM text)
  3. Output schema — CoT output must include a comorbidity_interactions field for risk assessor shadow comparison
  4. Match Agent deduplication — if Clinical Reasoning produces a CoT-enriched clinical summary, Match Agent's analyze_clinical_picture should consume it instead of re-reasoning from scratch
  5. Guardrail tolerance — verify output_validator.py regex patterns handle CoT preamble before JSON block
  6. CLAUDE.md updates — fix 6 stale items (see feature spec checklist)

Scope — What NOT to Change

  • No changes to LangGraph node definitions or edges
  • No changes to agent orchestrator or supervisor
  • No changes to Intake Agent (conversational, CoT not beneficial)
  • No changes to Match Agent prompts (deterministic scoring; it reads Clinical Reasoning output)
  • No changes to Explanation Agent (already produces narrative; may consume CoT output for grounding)
  • No changes to risk_assessor.py (rule-based, deterministic — CoT doesn't apply to lookup tables)
  • No new endpoints, no migrations, no new dependencies
  • No max_tokens restrictions that would truncate reasoning output

Key Constraints

Langfuse prompt management

  • Changes must create new prompt versions in Langfuse, not just edit Python files
  • If prompts are currently hardcoded in Python and not yet pulled from Langfuse, update the Python files AND note that Langfuse migration is pending

Prompt structure

  • CoT instructions must come BEFORE the output format specification (JSON schema) — reasoning first, then structured output
  • Keep existing few-shot examples. Add CoT reasoning traces to those examples
  • Reasoning steps must be numbered and parseable (not freeform prose) — enables harvesting for LangSmith eval datasets later

Visibility

  • Do NOT use "hidden CoT" — reasoning steps must appear in Langfuse traces for observability
  • Reasoning must be stored as structured JSON in events table payload (training data for ML v2, MedGemma fine-tuning)

Model routing

  • Currently Clinical Parsing uses Haiku. CoT is more demanding
  • If Haiku CoT quality is insufficient on the Aisha test case, escalate map_to_medical_codes step to Sonnet using existing tiered routing config — no code change needed
  • Do NOT escalate the entire Clinical Context Agent — only the code-mapping step

Guardrails

  • output_validator.py uses regex to check for forbidden patterns in LLM output
  • CoT adds reasoning text before the JSON block — verify the validator either: (a) only checks the JSON portion, or (b) tolerates non-JSON preamble
  • If neither, update the validator to extract JSON from CoT+JSON output before validation

Token budget

  • CoT adds ~300-500 output tokens per patient journey total
  • This is acceptable — well within the $0.15/patient target
  • Do not add max_tokens that would truncate reasoning

Connection Points — Nothing Throwaway

Every addition in this feature extends into a future system:

What we ship now What it feeds later
Numbered reasoning steps in Langfuse traces LangSmith eval datasets (currently deferred — "not enough data")
Flagsmith A/B flags for CoT vs non-CoT Quality delta measurement → feeds ReAct decision (ADR-011)
comorbidity_interactions field in CoT output Risk assessor shadow comparison → triggers LLM Risk Agent build
Structured reasoning JSON in events table Training data for ML Ranking v2 and MedGemma fine-tuning (Stage 5)
Match Agent consuming Clinical Reasoning summary Eliminates duplicate LLM call, establishes data flow pattern
CoT reasoning as ReAct inner kernel When ReAct wraps CoT, the "think" step is already optimized

Quality Bar

  • After changes, run existing agent tests — zero regressions
  • Manually test with Aisha TKR demo case (ICD M17.11):
  • Verify correct laterality (right, not unspecified)
  • Verify correct SNOMED mapping
  • Verify clinical summary includes BMI/age/comorbidity reasoning
  • Verify comorbidity_interactions field populated
  • Check Langfuse traces show visible numbered reasoning steps
  • Verify output_validator.py does not reject CoT output
  • Verify Match Agent reads Clinical Reasoning summary (no duplicate Claude Sonnet call)

References

  • ADR-011 (ReAct deferral) — CoT is the lightweight alternative implemented now
  • Architecture doc v1.2 §11.6 (Prompt Engineering Strategy)
  • Session 33 #68 — risk_assessor.py (rule-based, the pattern CoT shadow comparison follows)
  • Session 21 — lab_analyzer.py comorbidity_llm_shadow pattern (the exact template for shadow comparison)
  • Existing prompt files in app/agents/prompts/
  • config/guardrails.yaml — output validation rules
  • app/services/output_validator.py — regex-based output checking