ADR-011: ReAct Pattern — Evaluated, Deferred¶

Status: Accepted Date: 2026-04-06 Decision Makers: Srikanth Donthi (CPO/CTO) Relates to: ADR-007 (Conversation-First UX), ADR-006 (Records-First Intake)

Context¶

The ReAct pattern (Yao et al., ICLR 2023) interleaves reasoning traces ("Thought") and task-specific actions ("Act") within LLM agent execution, allowing the model to reason about observations before deciding the next action. This contrasts with our current approach where each agent executes as a single-shot or fixed-sequence LLM call chain.

ReAct is directly relevant to Curaway's agent pipeline because:

Clinical Context Agent extracts diagnoses, ICD codes, and medications from uploaded medical reports — exactly the knowledge-intensive reasoning task where ReAct reduces hallucination (0% vs 56% in the paper's failure analysis).
Match Agent queries Qdrant, Neo4j, and PostgreSQL in sequence — a retrieval-heavy pipeline where search reformulation (a core ReAct capability) could recover from poor initial results.
Intake Agent must reason about what information has been collected vs what's still needed — a state-tracking task where ReAct's explicit thought traces prevent the agent from re-asking answered questions or missing gaps.

Decision¶

Defer ReAct implementation until post-demo. Do not retrofit agents with interleaved Thought→Action→Observation loops at the current MVP stage.

Rationale¶

1. No Baseline Failure Data¶

We have not yet completed a full end-to-end demo (Session 16 — Polish & Demo Prep — is in progress). Without running 20–30 patient journeys through the current single-shot agents and reviewing Langfuse traces, we cannot identify where agents actually fail. ReAct solves hallucination and grounding problems — if the real bottleneck is insufficient Neo4j seed data or missing procedure requirements, ReAct adds complexity without addressing the root cause.

2. Cost Multiplier¶

ReAct multiplies LLM calls per agent. Each Thought→Action→Observation cycle is a separate completion.

Agent	Current Calls	With ReAct (est.)	Model	Cost Delta/Patient
Intake Agent	1–2	5–7	GPT-4o mini	+$0.03–0.05
Clinical Context Agent	3	5–8	Claude Haiku	+$0.02–0.04
Match Agent	2	4–6	Claude Sonnet	+$0.08–0.12
Explanation Agent	1	1 (no change)	Template/Haiku	$0

Total estimated increase: $0.15 → $0.25–0.35 per patient journey. Manageable at POC scale (500 patients = $175 vs $75), but unjustified without evidence that current agents are failing due to reasoning quality.

3. Latency Impact on Conversation UX¶

Curaway is a conversation-first interface. Patients watch streaming responses in real time. A 5-step ReAct loop on Claude Sonnet means 5 × 500ms = 2.5 seconds minimum before the agent produces a final response, not counting tool execution time (Neo4j queries, Qdrant searches, FHIR lookups).

Target latencies: ~300ms TTFT (simple turns), 500–800ms (clinical reasoning). ReAct pushes clinical reasoning to 2–5 seconds without mitigation.

Mitigation available (when implemented): Stream intermediate thoughts to the UI as trust-building transparency ("Reviewing your MRI report... Checking procedure requirements..."). SSE infrastructure already supports this.

4. Determinism and Auditability¶

The architecture doc explicitly states: "POC uses deterministic orchestration via LangGraph for auditability and trust." ReAct introduces non-determinism within agents — the model decides when to think vs act and which tool to call. Two identical patient profiles could produce different reasoning paths.

For healthcare, reproducible explanations matter. Our existing guardrails (input classifier, output validator, externalized YAML rules) constrain outputs but not reasoning paths.

Mitigation available (when implemented): Use "dense thought" mode (forced Thought→Action→Observation alternation) rather than "sparse thought" mode (model decides when to think). This preserves deterministic structure while adding reasoning.

5. The Paper Supports Waiting¶

ReAct's strongest results came from fine-tuning on 3,000 trajectories, not few-shot prompting alone. With prompting only, ReAct marginally outperforms or matches simpler methods (Table 1: 27.4 vs 29.4 EM on HotpotQA). We won't have training data until real patients use the platform.

6. Budget and Session Priority¶

POC budget is $1,000 total (~$648 remaining). Engineering time to retrofit agents with ReAct loops, test edge cases (stuck loops, partial state failures), update guardrails, and wire streaming intermediate thoughts to the UI is estimated at 3–4 full Claude Code sessions. Those sessions have higher ROI spent on the TKR demo, Qdrant re-seeding, and the UI/UX revision already prioritized.

Where ReAct Should Be Applied (Post-Demo)¶

When Langfuse traces reveal specific failure categories, apply ReAct selectively — not uniformly across all agents.

Agent	Apply ReAct?	Condition
Intake Agent (document reasoning)	Yes — first candidate	When traces show hallucinated clinical details from uploaded PDFs
Match Agent (edge-case reranking)	Yes — second candidate	When traces show poor matches attributable to reasoning errors, not data gaps
Clinical Context Agent	Maybe	Only if ICD extraction accuracy is below threshold after Sonnet upgrade
Explanation Agent	No	Text generation, not retrieval/reasoning — single-shot is sufficient
Simple extraction tasks	No	ICD coding, document parsing, OCR — classification tasks where ReAct overhead isn't justified

Implementation approach when triggered:¶

Collect baseline: Run 20–30 patient journeys, export Langfuse traces
Categorize failures using the paper's taxonomy: hallucination, reasoning error, search error, label ambiguity
Retrofit one agent (likely Intake Agent) with ReAct loop using LangGraph's native multi-step support
Run in shadow mode alongside single-shot via Flagsmith flag
Compare clinical accuracy in Langfuse traces over 2–4 weeks
Enable via Flagsmith if quality improvement is measurable
Cap max steps (7 for reasoning tasks, 5 for retrieval) to prevent runaway loops
Persist reasoning traces in Langfuse (not in LangGraph state) to avoid context bloat — pass only a structured reasoning_summary (3–5 sentences) to downstream agents

Trigger Point for Revisiting¶

Revisit this decision when all three conditions are met:

[ ] TKR demo completed and at least 20 patient journeys traced in Langfuse
[ ] Failure analysis shows >10% hallucination rate in clinical extraction OR >15% poor-quality matches attributable to agent reasoning errors (not data gaps)
[ ] Sonnet upgrade for Clinical Context Agent is already deployed (ADR pending) — establishes whether model quality alone resolves the issue before adding architectural complexity

Alternatives Considered¶

Alternative	Why Not Now
Full ReAct across all agents	Cost/latency multiplier unjustified without failure data
CoT-SC → ReAct fallback (paper's best method)	Requires multiple sampling runs per query — too expensive and slow for real-time conversation
Inner Monologue (IM) style	Paper showed ReAct outperforms IM (71% vs 53% on ALFWorld) — if we add reasoning, do it properly
Chain-of-Thought only	Already implicit in our system prompts — adding explicit CoT without acting doesn't address retrieval failures

References¶

Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. arXiv:2210.03629v3.
Internal: config/guardrails.yaml, app/agents/orchestrator.py, Langfuse dashboard