Synchronous Chat Extractor — Steer Document¶

Feature: Move the chat extractor onto the routing critical path so the orchestrator routes against fresh state, not stale state (Layer 3 of the conversation flow remediation plan) Version: 1.0 Date: April 2026 Author: Srikanth Donthi (CPO/CTO) Status: Implemented — synchronous extraction live Depends on: Layer 1 (gates_v2, PR #70). Independent of Layer 2 — can ship before, after, or in parallel.

1. Problem Statement¶

The chat extractor (app/services/chat_extractor.py, Session 30) was introduced to catch structured data the main conversation LLM missed — medications, allergies, demographics, preferences. It runs as a dedicated Haiku call per turn.

In Session 31, when we optimised end-to-end latency from 13.2s to 7.9s, the extractor was moved into the deferred / async lane via the enable_deferred_extraction feature flag. The extractor now runs after the response has been streamed to the patient and after the orchestrator has already decided how to route the next message.

The patient says "I take metformin and lisinopril" → response goes out → the extractor catches "metformin, lisinopril" 2-3 seconds later → but the next turn's routing decision was already made against the stale state where medications were unknown. The agent re-asks for medications. The patient gets frustrated. Eventually they give up before reaching matching.

This is a state freshness bug, not an extractor bug. The extractor itself works fine. It just runs at the wrong time.

2. Design Decision: Sync on the Routing Path, Async for the Heavy EHR Rebuild¶

Decision: Run the chat extractor synchronously before the orchestrator's routing decision. Keep the EHR rebuild (which is heavier and doesn't affect routing) in the deferred lane.

The chat extractor is one Haiku call (~400ms p50, ~700ms p99). Adding it to the routing critical path costs ~400ms per turn. This is worth it because routing on stale state is the single most common cause of the "agent re-asks" loop documented in the Layer 1 steer.

Rationale:

The extractor is fast (~400ms), idempotent, and pure (no DB writes besides the case metadata it produces). It's safe to run on the critical path.
The EHR rebuild that consumes the extractor's output is heavier (it rewrites ehr_snapshot, recomputes risks, updates completeness). The rebuild can stay deferred — we just need the extractor's output available to the orchestrator before it routes, not the full rebuild.
Layer 1's intake_answer_count and meds_confirmed_none flags are populated from the extractor. If the extractor runs after routing, those flags are stale and gates_v2 can't fire.

Rejected alternatives:

Re-route after the extractor finishes (multi-turn delay). Tried conceptually — adds 1 turn of latency from the patient's perspective. Not acceptable.
Make the orchestrator read the case state again right before routing. Doesn't help — the extractor hasn't written to the case yet, so re-reading gives the same stale data.
Run the extractor in parallel with the conversation LLM. Possible but complicated — the extractor's output needs to feed the routing decision, which currently happens before the conversation LLM. Net no improvement on the critical path.
Train the conversation LLM to also extract structured data (single call instead of two). Tempting. But every time we tried this, the conversation LLM either missed extractions or polluted the patient-facing reply with structured-data noise. The extractor exists precisely because that approach failed.

3. Pipeline Order (After This Change)¶

For a single chat turn with enable_deferred_extraction=false (Layer 3 behaviour):

Patient message arrives at /cases/{id}/chat
Input classifier (Haiku, parallel with #3) — categorises intent
Chat extractor (Haiku, on critical path) — extracts meds, allergies, demographics, preferences. Writes to case.extra_metadata immediately. NEW POSITION.
Orchestrator routing — reads fresh case state including the extractor's writes. gates_v2 now sees the latest meds/allergies.
Conversation LLM — generates the patient-facing reply
Response streamed to patient
Deferred (post-response, async):
EHR rebuild
Risk re-assessment
Decision recorder write

Steps 2 + 3 + 4 still run in parallel via asyncio.gather where safe — the extractor doesn't depend on the classifier and vice versa.

4. Latency Budget¶

Measured against the current (Session 31) p50 of 7.9s:

Stage	Before (deferred)	After (sync)
Auth + state load	0.3s	0.3s
Input classifier	0.4s (parallel)	0.4s (parallel)
Chat extractor	0s (deferred)	0.4s (parallel)
Orchestrator routing	0.1s	0.1s
Conversation LLM	4.5s	4.5s
Stream + DB writes	0.2s	0.2s
Total	~7.9s p50	~7.9s p50 (no change because extractor runs in parallel)

Worst case (extractor outpaces classifier or vice versa): +200-400ms p99. Acceptable.

If Layer 3 ever shows measurable p99 regression, the fallback is: conditional sync — only run the extractor sync when the message contains potential structured data signals (numbers, drug-name patterns, age phrases). For everyday "what's next?" messages skip the extractor entirely. Defer that optimisation until we measure a real regression.

5. Feature Flag¶

Flag name: chat_extractor_sync Default: true (we want this on by default; the deferred behaviour is the bug) Behaviour when disabled: Falls back to Session 31's enable_deferred_extraction path. Both flags can coexist for the rollout window.

6. Out of Scope (This Layer)¶

Replacing the conversation LLM with a single-call extractor+responder
Making the EHR rebuild synchronous (it stays deferred — too heavy)
Conditional / message-content-aware extractor invocation
Multi-tenant rate limiting on the extractor
Caching extractor results across turns

7. Success Criteria¶

Within one week of deploy:

Re-ask rate drops by >=50% for medications and allergies in Langfuse traces (we'll need a small Langfuse query — see Part E in the feature spec)
No measurable p99 latency regression (baseline 9.2s, target <=9.6s)
gates_v2.intake_complete events fire >=2x more frequently within the same case lifecycle, because the meds/allergies flags are now fresh when the gate is evaluated

8. Rollback¶

chat_extractor_sync = false in Flagsmith reverts to the deferred path within one cache TTL (60s). No code redeploy required.