Skip to content

Orchestrator Planner — Steer Document

Feature: Replace the orchestrator's hand-coded phase machine with an LLM-driven planner (Layer 2 of the conversation flow remediation plan) Version: 1.0 Date: April 2026 Author: Srikanth Donthi (CPO/CTO) Status: Design — Not Yet Approved for Implementation Depends on: gates_v2 (Layer 1) deployed and observed for at least one full session of real traffic


1. Problem Statement

case_orchestrator.handle_message is a long branching function that gates each turn on a flag soup: procedure_identified, intake_complete, records_requested, quick_questions_asked, ehr_constructed, medications_asked, min_info_for_matching, matching_complete, providers_selected, consent_given, forwarded. Several branches early-return after a single LLM call and flip flags as side effects, so the next turn often lands in a completely different branch than the one the previous turn finished in.

A natural follow-up patient utterance ("here's another doc", "I take metformin", "what next?") frequently routes to the wrong sub-handler. The agent's reply looks fine in isolation but the state machine doesn't move forward, so matching is never triggered. Layer 1 (gates_v2, PR #70) loosened the gates to mask the symptom, but the underlying brittleness remains: every new flag added compounds the routing surface area.

This is fundamentally a control flow problem, not a prompt problem. It cannot be fully solved by tweaking thresholds — at some point the routing logic itself has to change.


2. Design Decision: One Planner Call Per Turn

Decision: Replace the if/elif tree in handle_message with a single LLM call that, given the full case state, picks the next action. The orchestrator becomes a thin dispatcher: read state → call planner → run the chosen action handler → record the decision → return.

Rationale:

  • One LLM call has the whole picture. The current branches each see only the slice of state they were written to look at.
  • The planner is debuggable — Langfuse already records every LLM call. We get a trace of "given this state, the planner picked X because Y" for every turn. The current branching logic is invisible to Langfuse.
  • Adding a new action (e.g., "request_clarification", "ask_for_imaging") becomes a one-line addition to the planner's enum + a new handler. No new branch in the orchestrator, no flag soup.
  • The planner is one Haiku call (~$0.001/turn). At 1000 turns/day that is $1/day. Trivial cost.

Rejected alternatives:

  • LangGraph state machine. Already considered in Session 11. The problem with LangGraph for this layer is that the state graph itself becomes the new flag soup. We'd be writing nodes and edges instead of if/elif branches. Same complexity, different syntax. LangGraph wins when the graph structure is mostly stable and the work inside each node is heavy. Our case is the opposite: light work per node, many possible transitions, frequent reshuffling.
  • Rule-based router with priority order. Closer to the current design but with an explicit ordered list of rules. Cleaner than the if/elif tree but still requires hand-tuning every time a new conversation pattern emerges.
  • Reinforcement-learning router. Gold-plated. Need labelled trajectories first. Defer to post-Series A.

3. The Planner Contract

3.1 Inputs

The planner receives a structured snapshot of the case state. It does NOT see raw prior messages — those go through the existing _llm_generate sub-handlers. The planner is for routing, not for generating patient-facing text. Inputs:

PlannerInput = {
    "case_id": str,
    "procedure_identified": bool,
    "procedure_name": str | None,
    "intake_complete": bool,
    "ehr_completeness": float,        # 0.0 — 1.0
    "has_documents": bool,
    "documents_pending_processing": bool,
    "blocking_issues": list[str],     # from doc validator + risk_assessor
    "missing_critical_info": list[str], # demographics, meds, allergies, etc.
    "matching_complete": bool,
    "providers_selected": bool,
    "consent_given": bool,
    "forwarded": bool,
    "last_user_message_summary": str, # 1-line summary, NOT raw text
    "last_user_intent": str | None,   # from input classifier
    "turn_number": int,
}

Building this dict is mechanical — it's already what patient_state.py produces. Layer 2 just plumbs it into the planner call.

3.2 Outputs

PlannerOutput = {
    "next_action": Literal[
        "identify_procedure",
        "request_records",
        "process_uploaded_documents",
        "collect_intake_info",
        "advance_to_matching",
        "show_matches",
        "request_provider_selection",
        "request_consent",
        "forward_records",
        "answer_question",
        "handle_blocking_issue",
        "celebrate_journey_complete",
    ],
    "reason": str,           # 1-2 sentence rationale (logged, not shown)
    "missing_for_advance": list[str],  # what's still needed if not advancing
    "confidence": float,     # 0.0 — 1.0
}

3.3 Action Handlers

Every action in the enum maps to exactly one existing function in case_orchestrator.py. The current functions are reused as-is — only the dispatching logic changes:

Action Existing handler
identify_procedure _handle_procedure_identification
request_records LLM call with records_first phase
process_uploaded_documents _handle_attachment_response
collect_intake_info _handle_intake
advance_to_matching _handle_matching
show_matches _handle_matching (returns existing matches)
request_provider_selection hardcoded prompt (existing block)
request_consent _handle_consent
forward_records _handle_forwarding
answer_question LLM call with general phase
handle_blocking_issue LLM call with phase prompt that names the issue
celebrate_journey_complete hardcoded final message (existing block)

No new handlers in Layer 2. Adding new handlers is strictly out of scope — Layer 2 is about routing, not about new capabilities.

3.4 Fallback

If the planner LLM call fails, errors out, returns an unparseable response, or returns an action not in the enum, the orchestrator falls back to the existing if/elif logic. The legacy code stays in place, gated behind a planner_v1_enabled feature flag (default false until we observe the planner in shadow mode).


4. Shadow Mode Before Cutover

Before the planner replaces the if/elif tree, it runs in shadow mode:

  1. Existing logic decides the action (control)
  2. Planner runs in parallel and records its decision (shadow)
  3. Both decisions written to a new planner_shadow_decisions table or the existing events table with event_type=planner.shadow_decision
  4. We compare control vs shadow over ~500 real turns

If shadow agreement is >90% on the dominant actions and the disagreements look reasonable on a Langfuse spot-check, flip the flag to make the planner authoritative. If not, we tune the planner prompt and re-shadow.

This is the same pattern we use for matching strategies (Section 12 in CLAUDE.md). It's the right safety net for any control-flow change.


5. Feature Flag

Flag name: planner_v1_enabled Default: false Rollout plan: 1. Ship the planner code + shadow mode wiring with the flag false 2. Manually flip a percentage rollout in Flagsmith (5% → 25% → 100%) 3. Hard cutover happens only after 95%+ shadow-mode agreement on a week of traffic


6. Out of Scope (This Layer)

  • New action types beyond the 12 listed above
  • Multi-action sequencing in a single turn
  • Replacing the sub-handlers themselves (those keep their current prompts and logic)
  • Removing the legacy if/elif code (kept as the fallback)
  • Frontend changes (planner is server-side only)

7. Success Criteria

Within two weeks of the cutover (planner authoritative, not shadow):

  • Zero increase in the per-turn 5xx rate or LLM failure rate
  • >=95% routing agreement with the human-spot-checked "correct action" on a 100-turn sample
  • >=70% of cases reach matching (Layer 1 target was 60%)
  • No new categories of agent failures introduced (we'll review all planner.error events at end of week 1)

8. Cost & Latency

  • One Haiku call per turn, ~200 input tokens / ~80 output tokens. ~$0.001 per turn.
  • Latency ~400ms. Runs in parallel with the existing input classifier (which is also a Haiku call), so the marginal latency is roughly zero if we await both with asyncio.gather.
  • Shadow mode doubles this for the rollout window. Acceptable.

9. Rollback

The flag planner_v1_enabled is the kill switch. Setting it to false in Flagsmith reverts to the legacy if/elif tree within one cache TTL (60s). No code redeploy required.