Synchronous Chat Extractor — Steer Document¶
Feature: Move the chat extractor onto the routing critical path so the
orchestrator routes against fresh state, not stale state (Layer 3 of the
conversation flow remediation plan)
Version: 1.0
Date: April 2026
Author: Srikanth Donthi (CPO/CTO)
Status: Implemented — synchronous extraction live
Depends on: Layer 1 (gates_v2, PR #70). Independent of Layer 2 — can
ship before, after, or in parallel.
1. Problem Statement¶
The chat extractor (app/services/chat_extractor.py, Session 30) was
introduced to catch structured data the main conversation LLM missed —
medications, allergies, demographics, preferences. It runs as a dedicated
Haiku call per turn.
In Session 31, when we optimised end-to-end latency from 13.2s to 7.9s,
the extractor was moved into the deferred / async lane via the
enable_deferred_extraction feature flag. The extractor now runs after
the response has been streamed to the patient and after the
orchestrator has already decided how to route the next message.
The patient says "I take metformin and lisinopril" → response goes out → the extractor catches "metformin, lisinopril" 2-3 seconds later → but the next turn's routing decision was already made against the stale state where medications were unknown. The agent re-asks for medications. The patient gets frustrated. Eventually they give up before reaching matching.
This is a state freshness bug, not an extractor bug. The extractor itself works fine. It just runs at the wrong time.
2. Design Decision: Sync on the Routing Path, Async for the Heavy EHR Rebuild¶
Decision: Run the chat extractor synchronously before the orchestrator's routing decision. Keep the EHR rebuild (which is heavier and doesn't affect routing) in the deferred lane.
The chat extractor is one Haiku call (~400ms p50, ~700ms p99). Adding it to the routing critical path costs ~400ms per turn. This is worth it because routing on stale state is the single most common cause of the "agent re-asks" loop documented in the Layer 1 steer.
Rationale:
- The extractor is fast (~400ms), idempotent, and pure (no DB writes besides the case metadata it produces). It's safe to run on the critical path.
- The EHR rebuild that consumes the extractor's output is heavier (it
rewrites
ehr_snapshot, recomputes risks, updates completeness). The rebuild can stay deferred — we just need the extractor's output available to the orchestrator before it routes, not the full rebuild. - Layer 1's
intake_answer_countandmeds_confirmed_noneflags are populated from the extractor. If the extractor runs after routing, those flags are stale andgates_v2can't fire.
Rejected alternatives:
- Re-route after the extractor finishes (multi-turn delay). Tried conceptually — adds 1 turn of latency from the patient's perspective. Not acceptable.
- Make the orchestrator read the case state again right before routing. Doesn't help — the extractor hasn't written to the case yet, so re-reading gives the same stale data.
- Run the extractor in parallel with the conversation LLM. Possible but complicated — the extractor's output needs to feed the routing decision, which currently happens before the conversation LLM. Net no improvement on the critical path.
- Train the conversation LLM to also extract structured data (single call instead of two). Tempting. But every time we tried this, the conversation LLM either missed extractions or polluted the patient-facing reply with structured-data noise. The extractor exists precisely because that approach failed.
3. Pipeline Order (After This Change)¶
For a single chat turn with enable_deferred_extraction=false (Layer 3
behaviour):
- Patient message arrives at
/cases/{id}/chat - Input classifier (Haiku, parallel with #3) — categorises intent
- Chat extractor (Haiku, on critical path) — extracts meds,
allergies, demographics, preferences. Writes to
case.extra_metadataimmediately. NEW POSITION. - Orchestrator routing — reads fresh case state including the
extractor's writes.
gates_v2now sees the latest meds/allergies. - Conversation LLM — generates the patient-facing reply
- Response streamed to patient
- Deferred (post-response, async):
- EHR rebuild
- Risk re-assessment
- Decision recorder write
Steps 2 + 3 + 4 still run in parallel via asyncio.gather where safe —
the extractor doesn't depend on the classifier and vice versa.
4. Latency Budget¶
Measured against the current (Session 31) p50 of 7.9s:
| Stage | Before (deferred) | After (sync) |
|---|---|---|
| Auth + state load | 0.3s | 0.3s |
| Input classifier | 0.4s (parallel) | 0.4s (parallel) |
| Chat extractor | 0s (deferred) | 0.4s (parallel) |
| Orchestrator routing | 0.1s | 0.1s |
| Conversation LLM | 4.5s | 4.5s |
| Stream + DB writes | 0.2s | 0.2s |
| Total | ~7.9s p50 | ~7.9s p50 (no change because extractor runs in parallel) |
Worst case (extractor outpaces classifier or vice versa): +200-400ms p99. Acceptable.
If Layer 3 ever shows measurable p99 regression, the fallback is: conditional sync — only run the extractor sync when the message contains potential structured data signals (numbers, drug-name patterns, age phrases). For everyday "what's next?" messages skip the extractor entirely. Defer that optimisation until we measure a real regression.
5. Feature Flag¶
Flag name: chat_extractor_sync
Default: true (we want this on by default; the deferred behaviour
is the bug)
Behaviour when disabled: Falls back to Session 31's
enable_deferred_extraction path. Both flags can coexist for the
rollout window.
6. Out of Scope (This Layer)¶
- Replacing the conversation LLM with a single-call extractor+responder
- Making the EHR rebuild synchronous (it stays deferred — too heavy)
- Conditional / message-content-aware extractor invocation
- Multi-tenant rate limiting on the extractor
- Caching extractor results across turns
7. Success Criteria¶
Within one week of deploy:
- Re-ask rate drops by >=50% for medications and allergies in Langfuse traces (we'll need a small Langfuse query — see Part E in the feature spec)
- No measurable p99 latency regression (baseline 9.2s, target <=9.6s)
gates_v2.intake_completeevents fire >=2x more frequently within the same case lifecycle, because the meds/allergies flags are now fresh when the gate is evaluated
8. Rollback¶
chat_extractor_sync = false in Flagsmith reverts to the deferred
path within one cache TTL (60s). No code redeploy required.