AI Steer: Chain-of-Thought Prompt Enhancement for Clinical Agents¶

Intent¶

Add explicit chain-of-thought (CoT) reasoning instructions to two clinical agent prompts — Clinical Parsing Agent (map_to_medical_codes) and Clinical Reasoning Agent (clinical summary). This is primarily a prompt-level change with small wiring additions for observability, A/B testing, and downstream data flow.

CoT is the lightweight precursor to ReAct (ADR-011). The reasoning traces produced here become the inner reasoning kernel inside future ReAct loops. Nothing shipped in this feature is throwaway.

Why This Matters¶

The agent pipeline already has structural CoT via LangGraph node decomposition (extract → map → generate). But within each node, prompts do not instruct step-by-step reasoning. Adding explicit CoT: - Reduces ICD-10/SNOMED hallucination (especially laterality errors like M17.11 vs M17.12) - Produces structured reasoning traces that feed LangSmith evals, training data, and risk assessor shadow comparison - Becomes the reasoning kernel inside future ReAct loops (ADR-011) - Costs only ~$0.002 extra per patient journey

Scope — What to Change¶

Prompt files (primary)¶

app/agents/prompts/clinical_parsing.py — the map_to_medical_codes prompt
app/agents/prompts/clinical_reasoning.py — the clinical summary/treatment pathway prompt

Wiring additions (secondary)¶

Flagsmith flags — cot_clinical_parsing and cot_clinical_reasoning for A/B measurement
Events table — store structured reasoning_steps JSON in event payload (not just raw LLM text)
Output schema — CoT output must include a comorbidity_interactions field for risk assessor shadow comparison
Match Agent deduplication — if Clinical Reasoning produces a CoT-enriched clinical summary, Match Agent's analyze_clinical_picture should consume it instead of re-reasoning from scratch
Guardrail tolerance — verify output_validator.py regex patterns handle CoT preamble before JSON block
CLAUDE.md updates — fix 6 stale items (see feature spec checklist)

Scope — What NOT to Change¶

No changes to LangGraph node definitions or edges
No changes to agent orchestrator or supervisor
No changes to Intake Agent (conversational, CoT not beneficial)
No changes to Match Agent prompts (deterministic scoring; it reads Clinical Reasoning output)
No changes to Explanation Agent (already produces narrative; may consume CoT output for grounding)
No changes to risk_assessor.py (rule-based, deterministic — CoT doesn't apply to lookup tables)
No new endpoints, no migrations, no new dependencies
No max_tokens restrictions that would truncate reasoning output

Key Constraints¶

Langfuse prompt management¶

Changes must create new prompt versions in Langfuse, not just edit Python files
If prompts are currently hardcoded in Python and not yet pulled from Langfuse, update the Python files AND note that Langfuse migration is pending

Prompt structure¶

CoT instructions must come BEFORE the output format specification (JSON schema) — reasoning first, then structured output
Keep existing few-shot examples. Add CoT reasoning traces to those examples
Reasoning steps must be numbered and parseable (not freeform prose) — enables harvesting for LangSmith eval datasets later

Visibility¶

Do NOT use "hidden CoT" — reasoning steps must appear in Langfuse traces for observability
Reasoning must be stored as structured JSON in events table payload (training data for ML v2, MedGemma fine-tuning)

Model routing¶

Currently Clinical Parsing uses Haiku. CoT is more demanding
If Haiku CoT quality is insufficient on the Aisha test case, escalate map_to_medical_codes step to Sonnet using existing tiered routing config — no code change needed
Do NOT escalate the entire Clinical Context Agent — only the code-mapping step

Guardrails¶

output_validator.py uses regex to check for forbidden patterns in LLM output
CoT adds reasoning text before the JSON block — verify the validator either: (a) only checks the JSON portion, or (b) tolerates non-JSON preamble
If neither, update the validator to extract JSON from CoT+JSON output before validation

Token budget¶

CoT adds ~300-500 output tokens per patient journey total
This is acceptable — well within the $0.15/patient target
Do not add max_tokens that would truncate reasoning

Connection Points — Nothing Throwaway¶

Every addition in this feature extends into a future system:

What we ship now	What it feeds later
Numbered reasoning steps in Langfuse traces	LangSmith eval datasets (currently deferred — "not enough data")
Flagsmith A/B flags for CoT vs non-CoT	Quality delta measurement → feeds ReAct decision (ADR-011)
`comorbidity_interactions` field in CoT output	Risk assessor shadow comparison → triggers LLM Risk Agent build
Structured reasoning JSON in events table	Training data for ML Ranking v2 and MedGemma fine-tuning (Stage 5)
Match Agent consuming Clinical Reasoning summary	Eliminates duplicate LLM call, establishes data flow pattern
CoT reasoning as ReAct inner kernel	When ReAct wraps CoT, the "think" step is already optimized

Quality Bar¶

After changes, run existing agent tests — zero regressions
Manually test with Aisha TKR demo case (ICD M17.11):
Verify correct laterality (right, not unspecified)
Verify correct SNOMED mapping
Verify clinical summary includes BMI/age/comorbidity reasoning
Verify comorbidity_interactions field populated
Check Langfuse traces show visible numbered reasoning steps
Verify output_validator.py does not reject CoT output
Verify Match Agent reads Clinical Reasoning summary (no duplicate Claude Sonnet call)

References¶

ADR-011 (ReAct deferral) — CoT is the lightweight alternative implemented now
Architecture doc v1.2 §11.6 (Prompt Engineering Strategy)
Session 33 #68 — risk_assessor.py (rule-based, the pattern CoT shadow comparison follows)
Session 21 — lab_analyzer.py comorbidity_llm_shadow pattern (the exact template for shadow comparison)
Existing prompt files in app/agents/prompts/
config/guardrails.yaml — output validation rules
app/services/output_validator.py — regex-based output checking