Skip to content

v5 / v6 Conversation Prompt — Unified Plan

Author: 2026-05-15 (consolidation pass post Phase 1 completion) Authority: This doc supersedes scattered v5 / v6 tracking. When in doubt, refer back here. Owner: SD + agent platform. Status: Synthesis / audit only — no new architecture, no fresh ADRs, no implementation. Implementation runs from the per-phase plans (see §8). Companion specs (still authoritative on their own scope): - docs/specs/conversation-v6-feature.md (canonical v6 spec, rev 6, 962 lines) - docs/specs/conversation-v5-feature.md (legacy v5 spec — superseded but rules carry forward; 471 lines) - docs/specs/v6-rubric-locked.md (9-axis grader rubric) - docs/specs/v6-stage-resolver-truth-table.md (truth table for §2.6 resolver) - docs/specs/v6-stages-extractors-matrix.md (stage × extractor wiring) - docs/specs/v6-rule-location-map.md (lockstep registry for §8.5 CI gate) - docs/specs/v6-trio-consistency-findings.md (cross-doc audit, 18 findings)


0. Why this doc exists

SD asked for a single document because the v5 → v6 transition is scattered across:

  • One canonical v6 spec (conversation-v6-feature.md) plus 6 companion spec files
  • Two production prompts in tree (conversation_v4.yaml, conversation_v4.1.yaml)
  • One v6 base scaffold (conversation_v6.yaml) — all TODO(phase-2a) markers
  • Phase 0 (merged) + Phase 1 (open, stacked) PR chain
  • 8 open / closed GitHub issues (#491, #547, #550, #560, #642, #743, #836, #837)
  • Session-memory notes in ~/.claude/projects/-Users-srikanthdonthi-Code-Curaway/memory/

The risk this doc mitigates: a v5 rule, a hot-fix paragraph, or a verbatim phrase silently disappears during the v6 absorption because no single owner knows which file holds the canonical version today. The audit identifies inheritance contracts so Phase 2a can land without drift.

Key framing: v4.1 + v1.1 + #837 hot-fix is the inheritance point, not v4 alone. v6 ABSORBS, it does not replace.


1. Production baseline (as of 2026-05-15)

What is actually running in production today (verified against config/feature_flags.yaml on main and the four hot-fix issues):

Surface Value / Path Source
prompt_version default "v4" config/feature_flags.yaml:192
triage_layer_context_version default "v1" config/feature_flags.yaml:469
Valid prompt pairs (v4, v1) and (v4.1, v1.1) flag description at config/feature_flags.yaml:194
Mixed pair fallback (v4, v1) + Telegram alert flag description at config/feature_flags.yaml:194
mso_patient_offer_enabled default false (per-tenant flip after live testing) config/feature_flags.yaml:339-342
prompt_arch default (open PR #924) "v4" (no v5 fallback by design) PR #924 diff vs main
prompt_arch_v6_tenant_allowlist default (open PR #924) "[]" (empty list — zero v6 traffic) PR #924 diff vs main
#837 hot-fix paragraphs location config/prompts/base/conversation_v4.yaml:111-112 ONLY grep verified 2026-05-15
#837 hot-fix paragraphs in v4.1 NOT PRESENT (drift — see §6 risk register) grep verified 2026-05-15
Active layer contexts config/prompts/layer_contexts/intent_capture.yaml (v1) + intent_capture_v1.1.yaml (v1.1) filesystem
Active phase contexts (fallback) config/prompts/phase_contexts/v2/*.yaml (intake, records_first, identify_procedure, document_review, general, recovery_offer, recovery_checkin) filesystem
Examples config/prompts/examples/{locale}/*.yaml unchanged from v4 era
v6 scaffold in tree config/prompts/base/conversation_v6.yaml, config/prompts/stages.yaml, config/prompts/knowledge/*.yaml (all TODO(phase-2a)) PR #925 (merged into stack base)
v6 dispatcher live? YES (PR #927 in stack, dormant — compose_v6 raises NotImplementedError) app/agents/v6_dispatcher.py
Default conversation flow v4 base + v2 phase context + v1 layer context app/agents/conversation_prompt.py:140 get_system_prompt
MSO addendum production location today config/prompts/base/conversation_v4.yaml:204 mso_offer_addendum: (in body) and v4.1 baked in spec §3.5.1 row 9

Practical summary: Production traffic on 2026-05-15 runs conversation_v4.yaml with phase_contexts/v2/*.yaml + layer_contexts/intent_capture.yaml, gated by prompt_version=v4 + triage_layer_context_version=v1. Tenants with mso_patient_offer_enabled=true get (v4.1, v1.1). The #837 hot-fix paragraphs only live in v4.yaml lines 111-112 today; v4.1 has the original 5-bullet SAFETY block at lines 108-113 without the new treatment-recommendation + scope-rejection bans. This is a real drift — see §6 G-1.


2. Why v5 was folded into v6

Per docs/specs/conversation-v6-feature.md §0, two motivations forced the merge on 2026-05-12:

  1. v5 work (#783) added 6 prompt rules to fix 7 logged P0/P1/P2 bugs (#491, #547, #550, #560, #642, #743, #546) but kept v4 architecture (phase × layer composition, two parallel injection taxonomies, addendum-as-third-mechanism).
  2. A separate brainstorm surfaced that v4 architecture has structural drag: two overlapping concepts (phase / layer), three injection mechanisms (phase context / layer context / addendum), full base-prompt cloning per version, and ~35-40% redundant tokens per turn.

Doing both as one release (rules + architecture) avoids: two clinical advisor review cycles, two validation cycles (3 baselines + 3 after × 3 personas, twice), lockstep PRs during a v5-then-v6 transition window, and rule content being written twice (once into conversation_v5.yaml, once when restructuring into stages.yaml).

Concretely: there is no conversation_v5.yaml file, no v5 value for the prompt_version flag, no v5 row in prompt_loader.resolve_versions. The v5 spec at docs/specs/conversation-v5-feature.md is preserved as a rules reference, but its 6 rule additions land directly in conversation_v6.yaml per the §4 absorption mapping.


3. What's already implemented

Inventory by phase. Verified against PR list (gh pr list --state all) on 2026-05-15.

Phase 0 — Validation harness (MERGED)

PR Title Components shipped
#921 Phase 0 Steps 1 + 2 — locked rubric + LLM-grader scorer docs/specs/v6-rubric-locked.md, config/prompts/scorer/v6_compliance_scorer.yaml, app/services/prompt_compliance_scorer.py, tests/test_prompt_compliance_scorer.py (17 deterministic tests, no LLM calls in CI), scripts/test_compliance_scorer.py (manual dogfood CLI)
#922 Phase 0 Step 3 — fixture corpus 15 conversations × 9-axis tagging in tests/v6_fixtures/; tests/test_v6_fixture_corpus.py
#923 Phase 0 Steps 4 + 5 + 6 — grader cache + 6 gates + CI grader caching, determinism wrapper, cost guard; 6 deterministic gates (axes 1, 2, 6, 7, 8, flag YAML); .github/workflows/v6-prompt-compliance.yml

Live in production: ALL Phase 0 components are merged. None gate prod traffic — they gate PRs that touch prompt content.

Phase 1 — Safety net + scaffolding (OPEN — 6 stacked PRs)

PR Title Base branch Status Components shipped
#924 Step 1 — prompt_arch + tenant allowlist flags main OPEN config/feature_flags.yaml: prompt_arch=v4, prompt_arch_v6_tenant_allowlist="[]"
#925 Scaffolding — stages.yaml + knowledge + stubs main OPEN conversation_v6.yaml (94 lines, TODO markers), stages.yaml (117 lines, 12 stages with TODOs), knowledge/{financial_options,post_travel_logistics,insurance_handling,procedure_clinical_facts/knee_replacement}.yaml, stage_resolver.py (269 lines, §2.6 truth table implemented), patient_context_builder.py (155 lines, tenant assertion), prompt_loader_v6.py (62 lines, compose_v6 raises NIE)
#926 Layer 2 — boot-time YAML artifact validator feat/v6-phase1-scaffolding OPEN app/services/v6_artifact_validator.py, 14 tests
#927 Layer 3 — prompt_arch dispatcher + Langfuse tags feat/v6-phase1-yaml-validator OPEN app/agents/v6_dispatcher.py, get_system_prompt branch, _dispatch_tags contextvars.ContextVar stash
#928 Layer 4 — fallback observability + cost-tracking scaffold feat/v6-phase1-dispatcher OPEN app/services/v6_fallback_monitor.py, Telegram alert for 3 unexpected fallback reasons, record_v6_turn_cost stub
#929 Layer 5 — end-to-end safety-net smoke tests feat/v6-phase1-cost-monitoring OPEN tests/test_v6_phase1_safety_net_smoke.py (8 scenarios)

Phases 2–9 — Content port + composer + production rollout (SHIPPED 2026-05-16)

Subagent-driven deployment campaign on top of the Phase 1 substrate. All 8 phases shipped in one session.

PR Phase Title Components shipped
#933 Phase 1 (fixtures) Absorption + verbatim + lockstep fixtures (G-3, G-4, G-5) tests/test_hotfix_837_absorption.py, tests/test_v5_rule_verbatim_preservation.py, tests/test_lockstep_consistency.py (67 tests gating Phases 2-5 via xfail(strict=True) markers; markers cleared progressively)
#935 Phase 2 #837 backport into v4.1 + HARD BANS port into v6 (closes G-1) conversation_v4.1.yaml SAFETY block + conversation_v6.yaml HARD BANS section (byte-identical to v4:111-112)
#937 Phase 3 Port v5 rules 2.1/2.3/2.5/2.6 + base sections All TODO(phase-2a) markers cleared from conversation_v6.yaml: ROLE, COLLECT, VOICE+rule2.5 emotional-word list, NEVER, FORBIDDEN PHRASES, DOCUMENT-TRUST FRAMING, DEMOGRAPHIC GROUNDING, ONE QUESTION PER TURN (7-axis), REMEMBER
#938 Phase 4 stages.yaml content port (12 stages) All TODO(phase-2b) markers cleared from stages.yaml. Recovery stages carry ADR-0018 §K escalation triggers + coordinator handoff flow
#939 Phase 5 Knowledge addendums (5 files) CREATED knowledge/mso_patient_offer.yaml (relocated from v4.yaml:207-261). FILLED financial_options, insurance_handling, post_travel_logistics, procedure_clinical_facts/knee_replacement
#940 Phase 6+7 compose_v6 + 5 dataclass services (closes G-7) app/services/{case_summary,fhir_observation_summary,document_manifest,workflow_snapshot,patient_preferences}_service.py, knowledge_addendum_selector.py, prompt_loader_v6.py real compose body, patient_context_builder.py real assembly. 194 tests added
#941 Phase 8 G-15..G-18 cleanups + Phase 6+7 polish G-15 stage_resolver logging spec amendment, G-16 WorkflowState key remapping docstring, G-17 compose_v6 return-shape guard + alertable fallback reason, G-18 stage_resolver edge case test, knowledge selector lru_cache, cache_segments[0] prompt_version field, soft-return rationale docstrings on 4 services
#942 Phase 9 Production rollout — flip prompt_arch=v6 + allowlist=["*"] config/feature_flags.yaml defaults flipped. Internal-only prod (limited Curaway team); rollback by flipping default back to v4 (v4 path remains fully functional and is the dispatcher fallback target). Skipped: 24h dev test (per SD directive to enable fully on prod and test there)

Live in production (2026-05-16): v6 is the default architecture. prompt_arch_v6_tenant_allowlist=["*"] covers all internal tenants. Identity-aware overrides via Flagsmith remain available for per-tenant ramp-down if needed.

Async dispatcher conversion (deferred to follow-up): compose_v6 is async; dispatcher remains sync and catches unawaited coroutines via inspect.iscoroutine(). Conversion would touch >5 files outside the v6 chain (triage_agent.py + get_system_prompt callers). Tracked separately.

CI status (2026-05-16): All v6 prompt-compliance gates green except the LLM grader job which requires ANTHROPIC_API_KEY in GitHub Actions secrets (G-2 — pending SD action; non-blocking since deterministic gates cover axes 1, 2, 6, 7, 8).


4. v5 rules absorption checklist

Tabulating the 6 v5 rules per conversation-v6-feature.md §4 mapping table + v6-rule-location-map.md §2.27. The "Production today" column is verified against config/prompts/base/conversation_v4.yaml + v4.1.yaml on 2026-05-15.

Rule Origin issue(s) Target in v6 Production status today Verbatim fixture present?
2.1 Document-trust framing #560 conversation_v6.yaml DOCUMENT-TRUST FRAMING section NOT in v4 baseline; NOT in v4.1. The closest pre-v6 hit is phase_contexts/v2/document_review.yaml lines 6-26 which has the "NO MEDICAL INTERPRETATION" block, but the v5-spec 4-part framing + identity-clarification language is NOT in production today. NO. tests/test_v5_rule_verbatim_preservation.py does not exist yet. Verbatim phrase list defined in v6-rule-location-map.md §3.1.
2.2a Treatment-recommendation ban #642 (+ #837 hot-fix) conversation_v6.yaml HARD BANS section In conversation_v4.yaml:111 (added by #837 merged 2026-05-12). NOT in conversation_v4.1.yaml — see §6 G-1. NO. tests/test_hotfix_837_absorption.py does not exist yet.
2.2b Scope-rejection ban #743 (+ #837 hot-fix) conversation_v6.yaml HARD BANS section In conversation_v4.yaml:112. NOT in conversation_v4.1.yaml — see §6 G-1. NO. Same fixture file as 2.2a (does not exist).
2.3 Unverified demographic claim #547 conversation_v6.yaml DEMOGRAPHIC GROUNDING section (REVISED rev 3 — moved from stage-scope to BASE because demographic fabrications can fire in any stage) NOT in v4 baseline; NOT in v4.1. Closest production guard is the existing voice rules. NO. Verbatim phrase: "The report I'm reading lists the patient as X — is this for someone other than yourself?" — must land in fixture.
2.4 Records-upload re-offer B1-v4 finding (no Github issue — surfaced by manual v4 conversation audit) stages.yaml > discovery.guidance AND procedure_identification.guidance + re_offer_on_turn: [2, 3] field Partial in v4 (records-first emphasis in phase_contexts/v2/records_first.yaml) but the turn-2-3 cadence guarantee is NOT enforced today. Lingering-discovery cases can miss the re-offer entirely. NO. Fixture should be a 5+ turn discovery stagnation case asserting re-offer language on turns 2 AND 3.
2.5 Emotional verbatim echo B1 axis-3 finding (no GH issue) conversation_v6.yaml VOICE RULES section Partial in conversation_v4.yaml:40 "NAME THE SPECIFIC HARD THING" (and v4.1 same line). v5 Rule 2.7 strengthens this with an explicit emotional-word list. The 7-word list (exhausted, scared, desperate, overwhelmed, frustrated, worried, tired) is not in production. NO. Verbatim word list per v6-rule-location-map.md §3.4.
2.6 Multi-question axis discipline #491, #550 conversation_v6.yaml ONE QUESTION PER TURN + stages.yaml per-stage do_not: [stack-questions] redundant placement Partial: conversation_v4.yaml:38 has "ONE QUESTION ONLY when the patient is emotional" + intent_capture.yaml has pacing rules. The v5-spec "SAME-TURN AXIS DISCIPLINE" enumeration (Laterality / Mechanism / Severity / Timeline / Prior treatment / Demographics / Records availability) is NOT in production today. NO. Verbatim axis list + WRONG/RIGHT example pair per v6-rule-location-map.md §2.27 v5.RULE.006.

Inheritance starting point for Phase 2a:

  • Rules 2.1, 2.3, 2.5, 2.6 — start from conversation_v4.yaml (NOT v4.1, because v4.1 differs from v4 only by the MSO addendum being baked in; the bulk of voice / safety / process content is identical).
  • Rules 2.2a, 2.2b — start from conversation_v4.yaml:111-112 (the #837 hot-fix paragraphs) and verbatim-port to conversation_v6.yaml HARD BANS. The exact byte-identical text is non-negotiable (see §6 G-1).
  • MSO addendum — port from conversation_v4.yaml:204 mso_offer_addendum: (the in-body version baked into v4.1) to config/prompts/knowledge/mso_second_opinion.yaml. Gated by the SAME mso_patient_offer_enabled flag (spec §3.5.1 row 9). Phase 2b includes a regression test asserting flag value is honored across v4↔v6 toggle.

Key clarification: v4.1 is MSO-only; it's not a clinical-rules upgrade over v4. The "v4.1 / v1.1 pair" exists solely so tenants with mso_patient_offer_enabled=true get the MSO addendum without flag-conditional prompt assembly. Treating v4.1 as the inheritance point for clinical rules would be wrong — the clinical rules base is conversation_v4.yaml + #837 hot-fix paragraphs.


5. What's pending

Phase-by-phase per conversation-v6-feature.md §6 + reality on 2026-05-15:

Phase 2a — Migration: base prompt rules

  • Scope: Port v5 rules 2.1, 2.2 (verbatim from #837), 2.3, 2.5, 2.6 into conversation_v6.yaml base sections (replacing every TODO(phase-2a) marker).
  • Estimate: 1-2 days (Opus for content judgment, per conversation-v6-feature.md §6 row 2a).
  • Who: Opus author + Dr. Naidu reviewer.
  • Blockers:
  • tests/test_hotfix_837_absorption.py must land FIRST (see §6 G-3) — otherwise wording can drift during port without CI catching it.
  • tests/test_v5_rule_verbatim_preservation.py must land alongside the port (verbatim phrase fixtures per v6-rule-location-map.md §3).
  • tests/test_lockstep_consistency.py must land alongside the port (reads v6-rule-location-map.md, asserts every rule reaches its declared destination).
  • Dr. Naidu base-prompt-rules review gate (mandatory per spec §6 footnote — "All 4 windows MUST be locked on his calendar before Phase 0 starts" — confirm with SD whether this is locked).
  • LLM grader CI auth (G-2) — Phase 0 grader can't fail-close on prompt content if ANTHROPIC_API_KEY isn't wired.
  • Pre-flight check (per Phase 1 spec §3.5.1 row 8): #535 (Flagsmith identity bug) is CLOSED per gh issue view 535. Phase 1 unblock condition satisfied.

Phase 2b — Migration: stages.yaml content

  • Scope: Port phase + layer content into the 12 stages.yaml entries (replacing every TODO(phase-2b) marker — guidance, cards_to_use, advance_when, do_not, extractors_active). Lockstep — any voice-rule update to v6 also lands in v4.
  • Estimate: 2-3 days (Opus per spec §6 row 2b).
  • Who: Opus author + Dr. Naidu reviewer.
  • Blockers:
  • v6-stages-extractors-matrix.md must publish FIRST (per spec §6 row 2b — "blocking"). Status today: the matrix doc EXISTS at docs/specs/v6-stages-extractors-matrix.md (288 lines, draft 2026-05-12). Confirm Naidu has signed off on the matrix before Phase 2b starts, OR confirm it doesn't require his sign-off and only the stages.yaml content port does.
  • tests/test_extractor_prompts_pii_safe.py (CI gate per spec §3.4) must land — scaffolding scope ambiguous, may already be covered by Phase 0 gates or may be Phase 2b deliverable.
  • Dr. Naidu stages.yaml content gate (mandatory per spec §6 row 2b).
  • Phase 2a must merge first (Phase 2b depends on the base-prompt rule landing site).

Phase 3 — Admin UI extensions

  • Scope: prompt_arch selector in /admin/triage, stage debug endpoint at /api/v1/admin/cases/{case_id}/stage (with Depends(require_case_access)), knowledge addendum toggles.
  • Estimate: 1 day, Sonnet.
  • Blockers: Phase 2a + 2b merged (selector pointing at empty stages is useless).
  • Scope: RichCard.tsx extensions for view_payments, view_summary, view_consultations, stage_indicator; placeholder pages Payments.tsx, Summary.tsx, Consultations.tsx.
  • Estimate: 1 day, Sonnet.
  • Blockers: Phase 2b (stages declare cards_to_use).

Phase 5 — Extractor prompt updates

  • Scope: Replace "layer N" semantics in 5 extractor system prompts with stage-equivalent semantics (semantic-equivalent rewrite, not content change). 5 extractors: intent, medical, travel, logistics, financial. recovery_checkin_extractor (PR #832 / recovery_checkin_extractor.py) is downstream.
  • Estimate: 1-2 days, Opus for content (per spec §6 row 5).
  • Blockers: Phase 2b (stages.yaml extractors_active lists must be populated per the matrix); tests/test_extractor_layer_to_stage_rename.py (NEW per spec Appendix B) must accompany.

Phase 6 — Dual-shadow ramp 10%

  • Scope: Flip prompt_arch=v6 for 10% of tenants via prompt_arch_v6_tenant_allowlist. Side-by-side Langfuse trace comparison vs v4.
  • Estimate: 1 week observation calendar.
  • Acceptance criterion (new rev 5): Per-segment cache hit rate measured in Langfuse — Seg 2 ≥ 60%, Seg 3 ≥ 50% sustained 24h. Block ramp if either fails.
  • Blockers: Phases 2a-5 complete; cost dashboards green; Phase 6 acceptance criterion (cache hit rate) defined.

Phase 7 — Manual validation cycle

  • Scope: 3 baselines + 3 after on 3 personas (caregiver/oncology, direct/ortho, exploratory). 9-axis scoring per turn. SD + Dr. Naidu sign off per persona.
  • Estimate: 1 day live testing.
  • Blockers: Phase 6 observation complete; Naidu calendar (4th of 4 mandatory windows per spec §6).

Phase 8 — Ramp to 50% then 100%

  • Scope: Stagger; 24h hold between bumps.
  • Estimate: 3 days.
  • Blockers: Phase 7 sign-off; no regressions in Langfuse + Metabase dashboards.

Phase 9 — 2-week observation

  • Scope: Real-traffic per-case audit on a sample per persona.
  • Estimate: 2 weeks calendar.
  • Blockers: Zero clinical-safety violations during ramp.

Phase 10 — Decommission

  • Scope: Delete phase_contexts/, layer_contexts/, base prompts v1-v4, _LAYER_TO_PHASE mapping, deprecated loader functions, deprecated tests.
  • Estimate: 1-2 days (CORRECTED rev 5 from 0.5d — shadow-import audit on 8+ sites: tests/test_intake_fix5.py, tests/test_conversation_prompt.py, tests/test_prompt_loader.py, tests/test_no_medical_advice.py:PATIENT_FACING_FILES, app/agents/conversation_prompt.py:_get_phase_contexts() callsites, app/services/prompt_loader.py:PHASE_DIR/LAYER_DIR constants).
  • Blockers: Phase 9 complete; all v4 paths confirmed unused via Langfuse; re-export shims (§1.3) deleted.

Aggregate calendar (per spec §6): ~6-7 weeks from Phase 0 start to Phase 9 complete. Phase 0 + Phase 1 are done (~10 calendar days elapsed). Net remaining: ~4-5 weeks if Naidu calendar locks cleanly.


6. Gaps + risks

Items scattered across issues / specs / memory that aren't formally tracked in the phase plan. Each has an ID for cross-reference.

G-1 — conversation_v4.1.yaml is MISSING the #837 hot-fix paragraphs (CRITICAL — folded into Phase 2a)

  • Evidence: Grep on config/prompts/base/conversation_v4.1.yaml for "we don't handle that", "outside our scope", "right next step", "treatment recommendation" returns ZERO matches. The same grep on conversation_v4.yaml returns lines 111-112.
  • Impact: Tenants on mso_patient_offer_enabled=true (pair (v4.1, v1.1)) have the un-patched SAFETY block today. The two P0s (#642 treatment recommendation, #743 scope rejection) that #837 closed for v4-tenants are STILL OPEN for v4.1-tenants.
  • Resolution (per SD 2026-05-16): Folded into Phase 2a kickoff rather than treated as a separate pre-Phase-2a backport PR. Rationale: Phase 2a's first PR already lands the verbatim absorption fixture (test_hotfix_837_absorption.py — see G-3) and ports the #837 paragraphs into conversation_v6.yaml HARD BANS. Bundling the v4.1 backport into the same PR means a single Dr. Naidu review touchpoint covers both the v4.1 patch AND the v6 absorption byte-for-byte. Accepts ~1 week of un-patched v4.1 traffic in exchange for not splitting Naidu's attention across two paragraphs of identical text.
  • Phase 2a scope addendum: The first Phase 2a PR must:
  • Add tests/test_hotfix_837_absorption.py with byte-identical assertions against BOTH conversation_v4.1.yaml AND conversation_v6.yaml HARD BANS.
  • Patch conversation_v4.1.yaml to include the #837 treatment-recommendation + scope-rejection paragraphs verbatim from conversation_v4.yaml:111-112.
  • Port the same paragraphs into conversation_v6.yaml HARD BANS.
  • Single Dr. Naidu confirm covers all three (same paragraphs he already approved for v4).

G-2 — Phase 0 LLM-grader CI fails on PRs touching prompt content (HIGH)

  • Evidence: gh run list --workflow="v6 prompt compliance" shows 3 failure runs on feat/v6-phase1-scaffolding (2026-05-15 12:36, 12:40, 12:56). Deterministic gates pass; LLM grader job fails because ANTHROPIC_API_KEY is not set in GitHub Actions secrets for that workflow.
  • Impact: Phase 2a + Phase 2b PRs (which actually change prompt content) cannot pass the 9-axis CI grader — the grader can't run. SD will be tempted to admin-merge prompt changes.
  • Mitigation: Plumb ANTHROPIC_API_KEY into .github/workflows/v6-prompt-compliance.yml (single-line workflow secret add). One-shot fix; SD task.

G-3 — tests/test_hotfix_837_absorption.py does NOT exist (HIGH)

  • Evidence: ls tests/ | grep -iE "hotfix|absorb" returns nothing.
  • Spec reference: conversation-v6-feature.md §5 explicitly calls this fixture out as "NEW rev 5 per compliance review".
  • Impact: Without this fixture, Phase 2a port of the two #837 paragraphs into conversation_v6.yaml HARD BANS can drift in wording, weakening rule 2.2. The spec is explicit: "byte-identical".
  • Recommendation: Land this fixture as the FIRST work of Phase 2a (before any content port).

G-4 — tests/test_v5_rule_verbatim_preservation.py does NOT exist (HIGH)

  • Evidence: ls tests/ | grep -iE "verbatim|v5_rule" returns nothing.
  • Spec reference: conversation-v6-feature.md §4 + v6-rule-location-map.md §3 (40+ verbatim phrases enumerated).
  • Impact: Same drift risk as G-3 but for the broader v5 rule set (4-part doc-trust framing, demographic clarification, emotional word list, axis discipline list).
  • Recommendation: Land alongside test_hotfix_837_absorption.py as Phase 2a pre-work.

G-5 — tests/test_lockstep_consistency.py does NOT exist (MED)

  • Evidence: Not in tests/ directory.
  • Spec reference: conversation-v6-feature.md §8.5 + v6-rule-location-map.md §0.
  • Impact: Lockstep CI gate that reads v6-rule-location-map.md and asserts rules land at declared destinations is missing. Without it, Phase 2a/2b silently drops a rule = silent regression.
  • Recommendation: Land before Phase 2a starts (so the absorbing PR is the FIRST to be gated).

G-6 — stages.yaml > extractors_active is empty in scaffolding (MED)

  • Evidence: All 12 stages in config/prompts/stages.yaml have extractors_active: [] # TODO(phase-2b).
  • Spec reference: spec §3.4 + v6-stages-extractors-matrix.md §3 (30 run cells, 2 cond cells).
  • Impact: Phase 3 extractor work (compose_v6() reads extractors_active to know which extractors to spawn) cannot land before Phase 2b populates the lists.
  • Recommendation: This is the documented Phase 2b deliverable. No action — flagged here for visibility.

G-7 — patient_context_builder dataclass-producing services don't exist (MED)

  • Evidence: app/agents/patient_context_builder.py exists (155 lines, from PR #925) with the assembly interface, but it expects dataclasses CaseSummary, FhirObservationSummary, DocumentManifest, WorkflowSnapshot, PatientPreferences from owning-domain services. None of these dataclass-producing service functions exist yet on main.
  • Spec reference: conversation-v6-feature.md §2.4 revision rev 3 — "dataclass-producing services MUST use BaseRepository._scoped_query(tenant_id)".
  • Impact: compose_v6() cannot move past NotImplementedError without these services. This is a Phase 2a-2b dependency that's not currently broken out as its own work item.
  • Recommendation: Scope into a Phase 2a sub-task. Estimate: 1-2 days. Likely Sonnet (mechanical — wrap existing repository reads in dataclass-returning service functions).

G-8 — Naidu clinical sweep on #837 + #832 — task #169 closed but mid-stream sweep not formally re-scheduled (MED)

  • Evidence: gh issue view 169 is closed (Phase 0 multi-tenancy work). No open issue tracks "Dr. Naidu mid-stream review of merged recovery prompts + #837 wording" specifically. Spec §6 calls out 4 separate Naidu gates but the calendar lock status is not in the doc.
  • Impact: Spec §6 is explicit: "If Dr. Naidu is unavailable >2 weeks for ANY of the 4 gates, the phase pauses." All 4 windows MUST be locked before Phase 0 starts. Phase 0 already shipped — confirm whether the windows are locked for 2a / 2b / 7.
  • Recommendation: SD confirms Naidu calendar status in writing before Phase 2a kickoff.

G-9 — Mid-conversation rollback test (spec §8.2.1) — does it exist? (LOW)

  • Evidence: Spec §8.2.1 describes the expected behavior (mixed prompt_arch stamps within one conversation) but does not list a test file. No tests/test_mid_conversation_rollback_*.py in the tree.
  • Impact: Phase 6 dual-shadow ramp could trigger a mid-conversation arch flip and produce inconsistent traces. Without a fixture, the audit cannot prove the spec §8.2.1 behavior holds.
  • Recommendation: Add to Phase 5 / 6 work list. Estimate: 0.5 day.

G-10 — Identity-aware Flagsmith pass-through (#535) (RESOLVED)

  • Evidence: gh issue view 535 is CLOSED.
  • Status: Phase 1 prereq satisfied (per spec §3.5.1 row 8 — "#535 MUST be resolved before v6 Phase 1 starts"). No action.

G-11 — addendum_priority_clinical_first.py test (LOW)

  • Spec reference: §9 risk row + Appendix B testlist.
  • Status: Not in tree. Knowledge addendums in scaffolding (PR #925) lack priority: and category: fields. Phase 2b deliverable.

G-12 — 18 cross-spec inconsistencies tracked in v6-trio-consistency-findings.md (LOW-MED)

  • Evidence: docs/specs/v6-trio-consistency-findings.md enumerates 18 findings (F-01 through F-18), 5 MAJOR + 13 MINOR.
  • Major ones:
  • F-01: stage count mismatch in #859 OQ.02
  • F-02: intake referenced as stage (not in §1 list)
  • F-03: CI gate algorithm doesn't cross-read sibling specs
  • F-04: cross-spec links missing in #855 and #859
  • F-05: 17 raw open questions across 3 docs → Naidu burn risk
  • Recommendation: Squash MAJOR findings before Phase 2a starts. MINOR can defer.

G-13 — Dr. Naidu review gates not formally scheduled (HIGH)

  • Evidence: Spec §6 footnote (revised) lists 4 mandatory Naidu sign-offs (Phase −1 #837 mini, Phase 2a base rules, Phase 2b stages content, Phase 7 validation). No tracking issue or calendar artifact in the repo.
  • Recommendation: Create one tracking issue per Naidu gate; link from spec §6.

G-14 — apps/patient-app/src/components/chat/rich_content_types.generated.json manifest does NOT exist (LOW)

  • Spec reference: §3.9 — needed for the FE/BE drift CI gate.
  • Status: Scoped as "1 day work" in Phase 1, but not in any merged PR.
  • Recommendation: Land in Phase 4 (frontend phase) alongside the new RichCard.tsx entries.

G-15 — stage_resolver.py violates the companion-doc no-logging purity contract (LOW)

  • Evidence: app/services/stage_resolver.py:146,158,168 emit logger.warning / logger.debug calls. docs/specs/v6-stage-resolver-truth-table.md §1:23 states: "Pure function. No I/O, no LLM, no DB writes, no logging."
  • Impact: Practically harmless today — logger.warning is side-effecting but doesn't change return value. However it breaks property-test stability and the spec contract; a future implementer relying on the pure-function claim could be surprised.
  • Recommendation: Either tighten the spec to "no observable side effects on returned value" OR remove the loggers and surface malformed-state signals via the return value. Decide in Phase 2a kickoff.

G-16 — WorkflowState key-name remapping is silent (LOW)

  • Evidence: app/services/stage_resolver.py:72-79 silently maps spec field names (documents_uploaded, match_results_shown, provider_selected) → live model names (required_documents_uploaded, matching_complete, providers_selected). Test fixtures use the live names so the gap is invisible.
  • Impact: A future implementer following spec §2.6 literally will pass spec-named keys to WorkflowState({...}) and see all values silently default to False — every stage rule will fail to match → fallback to support on every turn.
  • Recommendation: Document the mapping in a top-of-class docstring on stage_resolver.py OR accept both names via a small adapter layer. Address before Phase 2a expands the truth-table surface.

G-17 — compose_v6 returning None produces no v6_fallback_reason (MED)

  • Evidence: app/agents/v6_dispatcher.py:188-193 constructs DispatchResult(arch="v6", v6_artifact=artifact) without inspecting artifact. conversation_prompt.py:176 guards with if dispatch.arch == "v6" and dispatch.v6_artifact is not None — so None silently falls to v4 path with NO v6_fallback_reason trace tag. The Layer 4 monitor cannot alert on this incoherent state.
  • Impact: Phase 2a wiring may briefly produce malformed compose_v6 returns during incremental rollouts. Without a fallback reason tag, the silent v4 fallback is invisible in Langfuse.
  • Recommendation: In Phase 2a, validate compose_v6's return shape (dict with system: str, stage_id: str, cache_segments: list) and emit a new v6_fallback_reason="compose_returned_invalid" trace tag when the shape is wrong. Add compose_returned_invalid to ALERTABLE_FALLBACK_REASONS in the same PR.

G-18 — Stage-resolver rule fall-through edge case has no test (LOW)

  • Evidence: When intent_completion == 1.0 AND documents_uploaded == True AND medical_status.completion < 0.7, rules 3 and 4 both fail (rule 3 requires intent < 1.0, rule 4 requires not documents_uploaded). No subsequent rule matches → fallback to support. tests/test_stage_resolver.py does not exercise this combination.
  • Impact: The intent here may be deliberate (records have been uploaded but the medical_status extractor hasn't caught up yet, so support is the correct conservative answer) — but without a test it's not pinned. A future refactor could silently change the behavior.
  • Recommendation: Add a single test asserting this combination → "support". 10 minutes of work; do during Phase 2a kickoff.

G-19 — FE TransportOfferCard follow-up items (MED)

Identified by post-merge code + test reviews 2026-05-16. Bundle into a single follow-up PR (fix/transport-offer-card-wiring) when transport endpoints near rollout.

  • patientAction is a no-op in RichCard.tsx:260-263. The card calls patientAction(bookingIdDraft, 'select') and patientAction('', 'decline_all') but the handler in RichCard is a stub. Same Phase D deferred state as RecoveryOfferCard; not a regression. Wire to ConversationAppMessageThread → chat send-message flow before transport endpoints go live.
  • declineAll API silently swallows ALL errors at apps/patient-app/src/services/transportApi.ts:88-90. Narrow to 404 only; re-throw others so the component's error banner fires correctly.
  • /design-preview/transport is publicly reachable (apps/patient-app/src/App.tsx:149). Wrap in ProtectedRoute or import.meta.env.DEV guard for consistency with other design-preview routes.
  • RichCard.tsx transport_offer branch has zero integration tests. Add a test rendering <RichCard> with contentType='transport_offer' + a minimal fixture; assert TransportOfferCard mounts.
  • transportApi.ts has no dedicated test file. Unit-test toTransportOption (snake→camel) + declineAll 404-no-op + 500-rethrow behavior.
  • declineAll rejection path test missing. Mirror the existing selectOption error test.
  • Backend: cross-module private import. app/agents/v6_dispatcher.py:82-85 imports _resolve_prompt_arch and _resolve_v6_tenant_allowlist (both _ prefixed) from prompt_loader.py. Promote to public symbols OR relocate to a shared v6_config.py before Phase 2a expands the dispatch surface.
  • Vendor name PostHog property fixed inline in the curaway-health-navigator follow-up PR (vendor_id → vendor_name) — no Phase 2a tracking needed.

7. Inheritance map (CRITICAL — v4.1 / v4 + #837 as starting point)

For each v6 absorption section, the exact source text that must be preserved verbatim. This is the input contract for Phase 2a.

v6 destination section Source (file + line range) Verbatim requirement
conversation_v6.yaml HARD BANS — rule 2.2a (treatment recommendation ban) config/prompts/base/conversation_v4.yaml:111 YES, byte-identical. Asserted by tests/test_hotfix_837_absorption.py (must be created). Source phrases: "NEVER recommend a specific procedure, surgery, or course of treatment", "the right next step is", "why [procedure] makes sense for your case", "That's a clinical decision your doctor or specialist makes", "Surfacing what a document contains is allowed; choosing the procedure for the patient is not."
conversation_v6.yaml HARD BANS — rule 2.2b (scope rejection ban) config/prompts/base/conversation_v4.yaml:112 YES, byte-identical. Source phrases: "NEVER reject a patient based on procedure type, condition, or specialty", "Curaway coordinates care across all specialties", "we don't handle that", "this is outside our scope", "Curaway isn't set up for", "Let me flag this with our care team so we can connect you with the right specialist."
conversation_v6.yaml DOCUMENT-TRUST FRAMING — rule 2.1 docs/specs/conversation-v5-feature.md:63-98 (rule definition; never landed in any base prompt file) Partial verbatim. Verbatim NEVER phrases: "different from what your doctor told you", "this is not [diagnosis]", "the diagnosis is wrong", "I'm seeing findings that contradict". Verbatim ALWAYS phrases: "I want to make sure these have been factored in", "could you check with the oncologist whether", "Surfacing factual findings IS allowed". The 4-part framing pattern's structure can be modernized; the phrase list cannot.
conversation_v6.yaml DEMOGRAPHIC GROUNDING — rule 2.3 docs/specs/conversation-v5-feature.md:131-155 YES for the identity clarification phrase: "The report I'm reading lists the patient as X — is this for someone other than yourself?". Surrounding guidance can be modernized.
conversation_v6.yaml VOICE RULES — rule 2.5 config/prompts/base/conversation_v4.yaml:40 (existing "NAME THE SPECIFIC HARD THING") + docs/specs/conversation-v5-feature.md:214-237 (v5 Rule 2.7 strengthening) YES for the 7-word list: exhausted, scared, desperate, overwhelmed, frustrated, worried, tired. Must appear as a literal list inside section anchored by # V5-RULE-2.7-EMOTIONAL-VERBATIM.
conversation_v6.yaml ONE QUESTION PER TURN — rule 2.6 config/prompts/base/conversation_v4.yaml:38 (existing "ONE QUESTION ONLY") + docs/specs/conversation-v5-feature.md:184-212 (v5 Rule 2.6 SAME-TURN AXIS) YES for the 7-axis list: Laterality, Mechanism, Severity, Timeline, Prior treatment, Demographics, Records availability. WRONG/RIGHT example pair verbatim. Each stage in stages.yaml declares do_not: [stack-questions] (redundant placement appropriate per spec §4).
conversation_v6.yaml JSON RESPONSE FORMAT config/prompts/base/conversation_v4.yaml:186 envelope OR config/prompts/base/conversation_v4.1.yaml:187 envelope (verified identical between v4 and v4.1 per spec §1.4) YES, byte-identical. The {"message": "...", "extracted_data": {...}, "detected_comorbidities": [...], "phase_complete": false, "suggested_next": null, "missing_critical_info": []} envelope must appear unchanged. Asserted by tests/test_v5_rule_verbatim_preservation.py per spec §1.4.
conversation_v6.yaml REMEMBER config/prompts/base/conversation_v4.yaml:191-195 (4 numbered rules: ACKNOWLEDGE BEFORE ASKING, NEVER DIAGNOSE, HONOR YOUR PROMISES, NEVER PROJECT EMOTIONS) Verbatim. These are the "4 most important rules" — the explicit final reminder block.
conversation_v6.yaml ROLE / COLLECT BEFORE MATCHING / VOICE / NAME / FORMAT / FACTS sections config/prompts/base/conversation_v4.yaml (NOT v4.1 — they're identical for these sections, but v4 is canonical) per the line-level map in docs/specs/v6-rule-location-map.md §2.1-2.18 Mixed verbatim / semantic. The verbatim: column in v6-rule-location-map.md is the per-rule authority.
stages.yaml > discovery.guidance + stages.yaml > procedure_identification.guidance — rule 2.4 records re-offer docs/specs/conversation-v5-feature.md:156-175 + config/prompts/phase_contexts/v2/records_first.yaml + identify_procedure.yaml Semantic only. re_offer_on_turn: [2, 3] field per spec §4. Cadence-enforced — fixture must show 5-turn discovery stagnation triggers re-offer on turns 2 AND 3.
stages.yaml > <stage>.do_not config/prompts/phase_contexts/v2/*.yaml per-phase DO NOT lists + recovery_offer.yaml + recovery_checkin.yaml patronizing-filler ban list Verbatim for ban lists. Patronizing-filler list: I hear you, I understand, I'm here for you, completely natural to feel. Source: v6-rule-location-map.md §3.5.
knowledge/mso_second_opinion.yaml config/prompts/base/conversation_v4.yaml:204 mso_offer_addendum: block Verbatim. Same gating flag (mso_patient_offer_enabled) — Phase 2b regression test asserts flag honored across v4↔v6 toggle.

The line-level absorption map for every other rule (ROLE, VOICE, NAME, NEVER, CONT, EMO, PROJ, NONSENSE, ABROAD, FIRST, APPROACH, THINK, FORMAT, SAFETY, FACTS, EXAMPLES, JSON, REMEMBER + 6 phase_contexts/v2/*.yaml) lives in docs/specs/v6-rule-location-map.md §2.1-2.26. That doc is the per-rule authority. This §7 is the summary contract for Phase 2a kickoff.


8. Phase-by-phase next-steps

Concrete ordered list of what happens AFTER Phase 1 PRs merge. Each step: dependencies, who, rough estimate.

# Step Dependencies Who Estimate
1 Merge Phase 1 stack (#924 + #925 + #926 + #927 + #928 + #929 in dependency order) LLM-grader CI auth (G-2) fixed OR explicit admin-merge approval SD + Claude 1 day calendar (CI thrash)
2 Plumb ANTHROPIC_API_KEY into v6 CI workflow (close G-2) None SD (single secret add) 5 minutes
3 Create absorption fixturestests/test_hotfix_837_absorption.py + tests/test_v5_rule_verbatim_preservation.py + tests/test_lockstep_consistency.py (close G-3, G-4, G-5) v6-rule-location-map.md published (DONE) Sonnet author (tests are deterministic — no LLM) 1-2 days
4 Lock Dr. Naidu calendar windows for 2a, 2b, 7 (close G-13) None SD calendar dependent
5 Squash G-12 MAJOR findings in v6-trio-consistency-findings.md (F-01 through F-05) None Opus or Sonnet — one PR per finding 1-2 days
6 Phase 2a kickoff — port v5 rules 2.1, 2.2 (#837 verbatim), 2.3, 2.5, 2.6 into conversation_v6.yaml AND backport #837 into conversation_v4.1.yaml in the same PR (close G-1 + Phase 2a content port in one Naidu touchpoint) Steps 2-5 complete Opus author + single Naidu reviewer pass 1-2 days + Naidu calendar
7 Patient context builder dataclass-producing services (close G-7) None (parallel to step 6) Sonnet — mechanical wrap of existing repos 1-2 days
8 Phase 2b kickoff — port stages.yaml > <stage>.{guidance, cards_to_use, advance_when, do_not, extractors_active} from phase_contexts/v2/*.yaml + v6-stages-extractors-matrix.md Phase 2a merged + Naidu signoff Opus author + Naidu reviewer 2-3 days + Naidu calendar
9 Phase 3 — admin UI extensions Phase 2a + 2b merged Sonnet 1 day
10 Phase 4 — frontend deep-link cards + placeholder pages Phase 2b merged (stages declare cards_to_use) Sonnet 1 day
11 Phase 5 — extractor prompt language sweep (5 extractors, semantic-equivalent) Phase 2b merged Opus for content judgment, Sonnet for tests 1-2 days
12 Phase 6 — dual-shadow ramp 10% Phases 2a-5 merged + cost dashboards green SD + observation calendar 1 week observation
13 Phase 7 — manual validation (3 baselines + 3 after × 3 personas, 9-axis scoring) Phase 6 observation complete + Naidu calendar SD + Naidu 1 day live + Naidu calendar
14 Phase 8 — ramp to 50% then 100% Phase 7 sign-off SD 3 days
15 Phase 9 — 2-week observation Phase 8 ramp complete SD + Naidu (sampled audits) 2 weeks calendar
16 Phase 10 — decommission v4 paths Phase 9 complete + Langfuse confirms zero v4 traffic Sonnet, full shadow-import audit 1-2 days

9. Open questions for SD

Q1 — Should Phase 2a start before or after the LLM grader CI auth issue is fixed?

Options: - (a) Fix CI auth FIRST. Phase 2a then ships with the LLM-graded gate live → maximum confidence, zero rework. - (b) Start Phase 2a NOW using deterministic gates only (axes 1, 2, 6, 7, 8 + flag YAML). LLM grader retrofit when auth lands.

Recommendation: (a). The CI fix is 5 minutes; deferring it leaves the spec-mandated 9-axis gate non-functional for the highest-risk PRs.

Q2 — Should stages_version be a separate flag from prompt_arch?

Context: Spec §3.5.1 row 7 calls for a stages_version flag for minor stages.yaml versioning (e.g., v1.0, v1.1), paired with prompt_arch=v6 via VALID_VERSION_PAIRS enforcement in apps/admin-app/src/pages/Triage.tsx.

Options: - (a) Add stages_version now (Phase 1 stack extension). Risk: scope creep on a stack that's already 6 PRs deep. - (b) Defer to Phase 2b when stages content actually evolves. Risk: first stages.yaml content port has no versioning surface — re-rolling requires a prompt_arch flip. - (c) Bake the version into stages.yaml (version: "v1.0" field, already present in PR #925 line 6) and skip the flag. Risk: no Flagsmith rollback granularity for stages content.

Recommendation: (b) — defer. The version: field in stages.yaml is enough until content actually moves.

Q3 — Should the G-1 drift (#837 missing from v4.1) be backported NOW or absorbed by Phase 2a? RESOLVED 2026-05-16

Decision (SD, 2026-05-16): Fold into Phase 2a kickoff (option b). The first Phase 2a PR will bundle the conversation_v4.1.yaml backport with the conversation_v6.yaml HARD BANS port and the test_hotfix_837_absorption.py fixture — single Naidu review touchpoint covers both files since the paragraphs are identical to what he already approved for v4. Accepts ~1 week of un-patched v4.1 traffic to consolidate Naidu's attention.

Q4 — Naidu calendar — are all 4 windows locked?

Spec §6 footnote: "All 4 windows MUST be locked on his calendar before Phase 0 starts." Phase 0 has shipped. Confirm whether 2a / 2b / 7 windows are locked, or whether SD intends to operate without them.

Recommendation: Lock them in writing this week or document the deviation.

Q5 — Should the Phase 0 LLM grader run on EVERY Phase 2a/2b PR or only on the merge-to-main commit?

Context: Spec §3.9 implies per-PR. Cost concern: each grader run is ~$0.30 + 30s. If Phase 2a iterates 5 times, that's $1.50 + 2-3 min CI per iteration.

Options: - (a) Every PR push (highest catch rate). - (b) Only on PR open + on each commit author-tagged @grader (manual trigger via PR comment). - (c) Only on merge-to-main (lowest cost, slowest feedback).

Recommendation: (a) — $1.50 / iteration is irrelevant; clinical-safety regressions are not.

Q6 — Should support stage be the default for new cases (per spec §10 Q1)?

Spec note: Currently spec'd as a fallback safety net. Could also be the entry stage. SD has not resolved.

Recommendation: Surface to Naidu in the Phase 2a base-prompt-rules review. He should decide; spec defers.

Q7 — Phase 6 cache-hit acceptance criteria — what if Seg 2 < 60% during ramp?

Spec note: §2.5 acceptance criterion blocks ramp if Seg 2 <60% or Seg 3 <50% sustained 24h.

Options on miss: - (a) Pause ramp, investigate cache invalidation patterns (likely culprit: too-aggressive invalidate_case_cache() calls). - (b) Ramp anyway with cost mitigation (smaller stage profiles). - (c) Raise the threshold (acknowledge cache hit rate is fundamentally constrained by Anthropic's invalidation behavior).

Recommendation: Document the SOP for (a) in docs/runbook/prompt-rollback.md (new — per spec Appendix B Docs section). Don't pre-decide between (a/b/c) — depends on what the dashboard shows.


10. Appendix — file / issue / memory index

Specs (read in this order for new readers)

Doc Status Purpose
docs/specs/conversation-v6-feature.md rev 6, final for Phase 0 kickoff Canonical v6 spec, 962 lines
docs/specs/conversation-v5-feature.md legacy / superseded Original v5 rule definitions (rules 2.1-2.7) — still the canonical wording source for absorption
docs/specs/v6-rubric-locked.md locked rev 2, 257 lines 9-axis grader rubric (Phase 0 + Phase 7 consumer)
docs/specs/v6-stage-resolver-truth-table.md DRAFT 2026-05-12, 354 lines Companion to v6 spec §2.6 (NOT YET BLOCKING — Phase 1 is shipping)
docs/specs/v6-stages-extractors-matrix.md DRAFT 2026-05-12, 288 lines Companion to v6 spec §3.4 — blocks Phase 2b
docs/specs/v6-rule-location-map.md DRAFT, 737 lines Lockstep registry for §8.5 CI gate — blocks Phase 2a
docs/specs/v6-trio-consistency-findings.md 268 lines 18-finding cross-doc audit (5 MAJOR + 13 MINOR)

Tracking issues

Issue State Purpose
#836 OPEN v6 epic — tracks the full Phase 0-10 sequence
#837 MERGED PR Production hot-fix — two new SAFETY bullets in v4.yaml (treatment recommendation + scope rejection bans)
#832 MERGED PR Recovery prompts + extractor + orchestrator wiring (ADR-0018 §K) — downstream dependency for Phase 5
#491 OPEN Multi-question discipline → v5 rule 2.6
#547 CLOSED Demographic fabrication → v5 rule 2.3
#550 OPEN Laterality re-ask → v5 rule 2.6
#560 OPEN Document trust framing → v5 rule 2.1
#642 CLOSED Treatment recommendation → v5 rule 2.2 + #837 hot-fix
#743 CLOSED Scope rejection → v5 rule 2.2 + #837 hot-fix
#535 CLOSED Flagsmith identity bug — Phase 1 prereq (RESOLVED)
#359 CLOSED Prompt versioning + audit trail

PRs

PR State Title
#921 MERGED Phase 0 Steps 1+2 — locked rubric + LLM-grader scorer
#922 MERGED Phase 0 Step 3 — fixture corpus
#923 MERGED Phase 0 Steps 4+5+6 — grader cache + 6 gates + CI
#924 OPEN Phase 1 Step 1 — prompt_arch + tenant allowlist flags
#925 OPEN Phase 1 scaffolding — stages.yaml + knowledge + stubs
#926 OPEN Phase 1 Layer 2 — YAML artifact validator
#927 OPEN Phase 1 Layer 3 — dispatcher + Langfuse tags
#928 OPEN Phase 1 Layer 4 — fallback observability + cost scaffold
#929 OPEN Phase 1 Layer 5 — end-to-end smoke tests

Memory files relevant to this plan

File Purpose
feedback_agent_chat_sacrosanct.md Discipline for every prompt change — 3 baselines + 3 after on 3 personas
reference_v4_parser_strict_false.md json.loads(strict=False) requirement — preserved in spec §1.4
feedback_flagsmith_dual_env.md Every flag flip applies to BOTH Production and Development envs
reference_flagsmith_v2_env_patch.md V2 versioning + env-scoped PATCH endpoint
feedback_check_railway_after_migration_merge.md Migration Roundtrip CI is continue-on-error: true — confirm prod deploy after merge
project_execution_order_transport_v6.md Transport admin → 3-reviewer subagents → v6 implementation (per SD 2026-05-15)
project_work_queue.md Cross-session items (Clerk webhook, etc.)

Code paths most relevant to v6

Path Role
config/prompts/base/conversation_v4.yaml Production base (with #837 hot-fix at lines 111-112)
config/prompts/base/conversation_v4.1.yaml Production base with MSO addendum baked in (MISSING #837 paragraphs — see G-1)
config/prompts/base/conversation_v6.yaml v6 scaffold (TODO markers)
config/prompts/stages.yaml v6 stages scaffold
config/prompts/knowledge/*.yaml v6 knowledge addendums (4 files scaffolded)
config/prompts/layer_contexts/intent_capture.yaml v1 layer context
config/prompts/layer_contexts/intent_capture_v1.1.yaml v1.1 layer context (paired with v4.1)
config/prompts/phase_contexts/v2/*.yaml Production v2 phase contexts (intake, records_first, identify_procedure, document_review, general, recovery_offer, recovery_checkin)
config/feature_flags.yaml Flag defaults
app/agents/conversation_prompt.py get_system_prompt — has v6 dispatcher branch (PR #927)
app/services/prompt_loader.py _resolve_prompt_version, resolve_versions, _resolve_prompt_arch, _resolve_v6_tenant_allowlist
app/agents/v6_dispatcher.py v6 arch dispatch decision (PR #927)
app/services/v6_artifact_validator.py Boot-time YAML validator (PR #926)
app/services/v6_fallback_monitor.py Telegram alert + cost scaffold (PR #928)
app/services/stage_resolver.py §2.6 truth-table resolver (PR #925)
app/agents/patient_context_builder.py §2.4 context block builder (PR #925)
app/services/prompt_loader_v6.py compose_v6 stub (PR #925)

End of unified plan.