ADR-011: ReAct Pattern — Evaluated, Deferred¶
Status: Accepted Date: 2026-04-06 Decision Makers: Srikanth Donthi (CPO/CTO) Relates to: ADR-007 (Conversation-First UX), ADR-006 (Records-First Intake)
Context¶
The ReAct pattern (Yao et al., ICLR 2023) interleaves reasoning traces ("Thought") and task-specific actions ("Act") within LLM agent execution, allowing the model to reason about observations before deciding the next action. This contrasts with our current approach where each agent executes as a single-shot or fixed-sequence LLM call chain.
ReAct is directly relevant to Curaway's agent pipeline because:
- Clinical Context Agent extracts diagnoses, ICD codes, and medications from uploaded medical reports — exactly the knowledge-intensive reasoning task where ReAct reduces hallucination (0% vs 56% in the paper's failure analysis).
- Match Agent queries Qdrant, Neo4j, and PostgreSQL in sequence — a retrieval-heavy pipeline where search reformulation (a core ReAct capability) could recover from poor initial results.
- Intake Agent must reason about what information has been collected vs what's still needed — a state-tracking task where ReAct's explicit thought traces prevent the agent from re-asking answered questions or missing gaps.
Decision¶
Defer ReAct implementation until post-demo. Do not retrofit agents with interleaved Thought→Action→Observation loops at the current MVP stage.
Rationale¶
1. No Baseline Failure Data¶
We have not yet completed a full end-to-end demo (Session 16 — Polish & Demo Prep — is in progress). Without running 20–30 patient journeys through the current single-shot agents and reviewing Langfuse traces, we cannot identify where agents actually fail. ReAct solves hallucination and grounding problems — if the real bottleneck is insufficient Neo4j seed data or missing procedure requirements, ReAct adds complexity without addressing the root cause.
2. Cost Multiplier¶
ReAct multiplies LLM calls per agent. Each Thought→Action→Observation cycle is a separate completion.
| Agent | Current Calls | With ReAct (est.) | Model | Cost Delta/Patient |
|---|---|---|---|---|
| Intake Agent | 1–2 | 5–7 | GPT-4o mini | +$0.03–0.05 |
| Clinical Context Agent | 3 | 5–8 | Claude Haiku | +$0.02–0.04 |
| Match Agent | 2 | 4–6 | Claude Sonnet | +$0.08–0.12 |
| Explanation Agent | 1 | 1 (no change) | Template/Haiku | $0 |
Total estimated increase: $0.15 → $0.25–0.35 per patient journey. Manageable at POC scale (500 patients = $175 vs $75), but unjustified without evidence that current agents are failing due to reasoning quality.
3. Latency Impact on Conversation UX¶
Curaway is a conversation-first interface. Patients watch streaming responses in real time. A 5-step ReAct loop on Claude Sonnet means 5 × 500ms = 2.5 seconds minimum before the agent produces a final response, not counting tool execution time (Neo4j queries, Qdrant searches, FHIR lookups).
Target latencies: ~300ms TTFT (simple turns), 500–800ms (clinical reasoning). ReAct pushes clinical reasoning to 2–5 seconds without mitigation.
Mitigation available (when implemented): Stream intermediate thoughts to the UI as trust-building transparency ("Reviewing your MRI report... Checking procedure requirements..."). SSE infrastructure already supports this.
4. Determinism and Auditability¶
The architecture doc explicitly states: "POC uses deterministic orchestration via LangGraph for auditability and trust." ReAct introduces non-determinism within agents — the model decides when to think vs act and which tool to call. Two identical patient profiles could produce different reasoning paths.
For healthcare, reproducible explanations matter. Our existing guardrails (input classifier, output validator, externalized YAML rules) constrain outputs but not reasoning paths.
Mitigation available (when implemented): Use "dense thought" mode (forced Thought→Action→Observation alternation) rather than "sparse thought" mode (model decides when to think). This preserves deterministic structure while adding reasoning.
5. The Paper Supports Waiting¶
ReAct's strongest results came from fine-tuning on 3,000 trajectories, not few-shot prompting alone. With prompting only, ReAct marginally outperforms or matches simpler methods (Table 1: 27.4 vs 29.4 EM on HotpotQA). We won't have training data until real patients use the platform.
6. Budget and Session Priority¶
POC budget is $1,000 total (~$648 remaining). Engineering time to retrofit agents with ReAct loops, test edge cases (stuck loops, partial state failures), update guardrails, and wire streaming intermediate thoughts to the UI is estimated at 3–4 full Claude Code sessions. Those sessions have higher ROI spent on the TKR demo, Qdrant re-seeding, and the UI/UX revision already prioritized.
Where ReAct Should Be Applied (Post-Demo)¶
When Langfuse traces reveal specific failure categories, apply ReAct selectively — not uniformly across all agents.
| Agent | Apply ReAct? | Condition |
|---|---|---|
| Intake Agent (document reasoning) | Yes — first candidate | When traces show hallucinated clinical details from uploaded PDFs |
| Match Agent (edge-case reranking) | Yes — second candidate | When traces show poor matches attributable to reasoning errors, not data gaps |
| Clinical Context Agent | Maybe | Only if ICD extraction accuracy is below threshold after Sonnet upgrade |
| Explanation Agent | No | Text generation, not retrieval/reasoning — single-shot is sufficient |
| Simple extraction tasks | No | ICD coding, document parsing, OCR — classification tasks where ReAct overhead isn't justified |
Implementation approach when triggered:¶
- Collect baseline: Run 20–30 patient journeys, export Langfuse traces
- Categorize failures using the paper's taxonomy: hallucination, reasoning error, search error, label ambiguity
- Retrofit one agent (likely Intake Agent) with ReAct loop using LangGraph's native multi-step support
- Run in shadow mode alongside single-shot via Flagsmith flag
- Compare clinical accuracy in Langfuse traces over 2–4 weeks
- Enable via Flagsmith if quality improvement is measurable
- Cap max steps (7 for reasoning tasks, 5 for retrieval) to prevent runaway loops
- Persist reasoning traces in Langfuse (not in LangGraph state) to avoid context bloat — pass only a structured
reasoning_summary(3–5 sentences) to downstream agents
Trigger Point for Revisiting¶
Revisit this decision when all three conditions are met:
- [ ] TKR demo completed and at least 20 patient journeys traced in Langfuse
- [ ] Failure analysis shows >10% hallucination rate in clinical extraction OR >15% poor-quality matches attributable to agent reasoning errors (not data gaps)
- [ ] Sonnet upgrade for Clinical Context Agent is already deployed (ADR pending) — establishes whether model quality alone resolves the issue before adding architectural complexity
Alternatives Considered¶
| Alternative | Why Not Now |
|---|---|
| Full ReAct across all agents | Cost/latency multiplier unjustified without failure data |
| CoT-SC → ReAct fallback (paper's best method) | Requires multiple sampling runs per query — too expensive and slow for real-time conversation |
| Inner Monologue (IM) style | Paper showed ReAct outperforms IM (71% vs 53% on ALFWorld) — if we add reasoning, do it properly |
| Chain-of-Thought only | Already implicit in our system prompts — adding explicit CoT without acting doesn't address retrieval failures |
References¶
- Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. arXiv:2210.03629v3.
- Internal:
config/guardrails.yaml,app/agents/orchestrator.py, Langfuse dashboard