v5 / v6 Conversation Prompt — Unified Plan¶
Author: 2026-05-15 (consolidation pass post Phase 1 completion) Authority: This doc supersedes scattered v5 / v6 tracking. When in doubt, refer back here. Owner: SD + agent platform. Status: Synthesis / audit only — no new architecture, no fresh ADRs, no implementation. Implementation runs from the per-phase plans (see §8). Companion specs (still authoritative on their own scope): -
docs/specs/conversation-v6-feature.md(canonical v6 spec, rev 6, 962 lines) -docs/specs/conversation-v5-feature.md(legacy v5 spec — superseded but rules carry forward; 471 lines) -docs/specs/v6-rubric-locked.md(9-axis grader rubric) -docs/specs/v6-stage-resolver-truth-table.md(truth table for §2.6 resolver) -docs/specs/v6-stages-extractors-matrix.md(stage × extractor wiring) -docs/specs/v6-rule-location-map.md(lockstep registry for §8.5 CI gate) -docs/specs/v6-trio-consistency-findings.md(cross-doc audit, 18 findings)
0. Why this doc exists¶
SD asked for a single document because the v5 → v6 transition is scattered across:
- One canonical v6 spec (
conversation-v6-feature.md) plus 6 companion spec files - Two production prompts in tree (
conversation_v4.yaml,conversation_v4.1.yaml) - One v6 base scaffold (
conversation_v6.yaml) — allTODO(phase-2a)markers - Phase 0 (merged) + Phase 1 (open, stacked) PR chain
- 8 open / closed GitHub issues (#491, #547, #550, #560, #642, #743, #836, #837)
- Session-memory notes in
~/.claude/projects/-Users-srikanthdonthi-Code-Curaway/memory/
The risk this doc mitigates: a v5 rule, a hot-fix paragraph, or a verbatim phrase silently disappears during the v6 absorption because no single owner knows which file holds the canonical version today. The audit identifies inheritance contracts so Phase 2a can land without drift.
Key framing: v4.1 + v1.1 + #837 hot-fix is the inheritance point, not v4 alone. v6 ABSORBS, it does not replace.
1. Production baseline (as of 2026-05-15)¶
What is actually running in production today (verified against config/feature_flags.yaml on main and the four hot-fix issues):
| Surface | Value / Path | Source |
|---|---|---|
prompt_version default |
"v4" |
config/feature_flags.yaml:192 |
triage_layer_context_version default |
"v1" |
config/feature_flags.yaml:469 |
| Valid prompt pairs | (v4, v1) and (v4.1, v1.1) |
flag description at config/feature_flags.yaml:194 |
| Mixed pair fallback | (v4, v1) + Telegram alert |
flag description at config/feature_flags.yaml:194 |
mso_patient_offer_enabled default |
false (per-tenant flip after live testing) |
config/feature_flags.yaml:339-342 |
prompt_arch default (open PR #924) |
"v4" (no v5 fallback by design) |
PR #924 diff vs main |
prompt_arch_v6_tenant_allowlist default (open PR #924) |
"[]" (empty list — zero v6 traffic) |
PR #924 diff vs main |
| #837 hot-fix paragraphs location | config/prompts/base/conversation_v4.yaml:111-112 ONLY |
grep verified 2026-05-15 |
| #837 hot-fix paragraphs in v4.1 | NOT PRESENT (drift — see §6 risk register) | grep verified 2026-05-15 |
| Active layer contexts | config/prompts/layer_contexts/intent_capture.yaml (v1) + intent_capture_v1.1.yaml (v1.1) |
filesystem |
| Active phase contexts (fallback) | config/prompts/phase_contexts/v2/*.yaml (intake, records_first, identify_procedure, document_review, general, recovery_offer, recovery_checkin) |
filesystem |
| Examples | config/prompts/examples/{locale}/*.yaml |
unchanged from v4 era |
| v6 scaffold in tree | config/prompts/base/conversation_v6.yaml, config/prompts/stages.yaml, config/prompts/knowledge/*.yaml (all TODO(phase-2a)) |
PR #925 (merged into stack base) |
| v6 dispatcher live? | YES (PR #927 in stack, dormant — compose_v6 raises NotImplementedError) |
app/agents/v6_dispatcher.py |
| Default conversation flow | v4 base + v2 phase context + v1 layer context | app/agents/conversation_prompt.py:140 get_system_prompt |
| MSO addendum production location today | config/prompts/base/conversation_v4.yaml:204 mso_offer_addendum: (in body) and v4.1 baked in |
spec §3.5.1 row 9 |
Practical summary: Production traffic on 2026-05-15 runs conversation_v4.yaml with phase_contexts/v2/*.yaml + layer_contexts/intent_capture.yaml, gated by prompt_version=v4 + triage_layer_context_version=v1. Tenants with mso_patient_offer_enabled=true get (v4.1, v1.1). The #837 hot-fix paragraphs only live in v4.yaml lines 111-112 today; v4.1 has the original 5-bullet SAFETY block at lines 108-113 without the new treatment-recommendation + scope-rejection bans. This is a real drift — see §6 G-1.
2. Why v5 was folded into v6¶
Per docs/specs/conversation-v6-feature.md §0, two motivations forced the merge on 2026-05-12:
- v5 work (#783) added 6 prompt rules to fix 7 logged P0/P1/P2 bugs (#491, #547, #550, #560, #642, #743, #546) but kept v4 architecture (phase × layer composition, two parallel injection taxonomies, addendum-as-third-mechanism).
- A separate brainstorm surfaced that v4 architecture has structural drag: two overlapping concepts (phase / layer), three injection mechanisms (phase context / layer context / addendum), full base-prompt cloning per version, and ~35-40% redundant tokens per turn.
Doing both as one release (rules + architecture) avoids: two clinical advisor review cycles, two validation cycles (3 baselines + 3 after × 3 personas, twice), lockstep PRs during a v5-then-v6 transition window, and rule content being written twice (once into conversation_v5.yaml, once when restructuring into stages.yaml).
Concretely: there is no conversation_v5.yaml file, no v5 value for the prompt_version flag, no v5 row in prompt_loader.resolve_versions. The v5 spec at docs/specs/conversation-v5-feature.md is preserved as a rules reference, but its 6 rule additions land directly in conversation_v6.yaml per the §4 absorption mapping.
3. What's already implemented¶
Inventory by phase. Verified against PR list (gh pr list --state all) on 2026-05-15.
Phase 0 — Validation harness (MERGED)¶
| PR | Title | Components shipped |
|---|---|---|
| #921 | Phase 0 Steps 1 + 2 — locked rubric + LLM-grader scorer | docs/specs/v6-rubric-locked.md, config/prompts/scorer/v6_compliance_scorer.yaml, app/services/prompt_compliance_scorer.py, tests/test_prompt_compliance_scorer.py (17 deterministic tests, no LLM calls in CI), scripts/test_compliance_scorer.py (manual dogfood CLI) |
| #922 | Phase 0 Step 3 — fixture corpus | 15 conversations × 9-axis tagging in tests/v6_fixtures/; tests/test_v6_fixture_corpus.py |
| #923 | Phase 0 Steps 4 + 5 + 6 — grader cache + 6 gates + CI | grader caching, determinism wrapper, cost guard; 6 deterministic gates (axes 1, 2, 6, 7, 8, flag YAML); .github/workflows/v6-prompt-compliance.yml |
Live in production: ALL Phase 0 components are merged. None gate prod traffic — they gate PRs that touch prompt content.
Phase 1 — Safety net + scaffolding (OPEN — 6 stacked PRs)¶
| PR | Title | Base branch | Status | Components shipped |
|---|---|---|---|---|
| #924 | Step 1 — prompt_arch + tenant allowlist flags |
main |
OPEN | config/feature_flags.yaml: prompt_arch=v4, prompt_arch_v6_tenant_allowlist="[]" |
| #925 | Scaffolding — stages.yaml + knowledge + stubs | main |
OPEN | conversation_v6.yaml (94 lines, TODO markers), stages.yaml (117 lines, 12 stages with TODOs), knowledge/{financial_options,post_travel_logistics,insurance_handling,procedure_clinical_facts/knee_replacement}.yaml, stage_resolver.py (269 lines, §2.6 truth table implemented), patient_context_builder.py (155 lines, tenant assertion), prompt_loader_v6.py (62 lines, compose_v6 raises NIE) |
| #926 | Layer 2 — boot-time YAML artifact validator | feat/v6-phase1-scaffolding |
OPEN | app/services/v6_artifact_validator.py, 14 tests |
| #927 | Layer 3 — prompt_arch dispatcher + Langfuse tags |
feat/v6-phase1-yaml-validator |
OPEN | app/agents/v6_dispatcher.py, get_system_prompt branch, _dispatch_tags contextvars.ContextVar stash |
| #928 | Layer 4 — fallback observability + cost-tracking scaffold | feat/v6-phase1-dispatcher |
OPEN | app/services/v6_fallback_monitor.py, Telegram alert for 3 unexpected fallback reasons, record_v6_turn_cost stub |
| #929 | Layer 5 — end-to-end safety-net smoke tests | feat/v6-phase1-cost-monitoring |
OPEN | tests/test_v6_phase1_safety_net_smoke.py (8 scenarios) |
Phases 2–9 — Content port + composer + production rollout (SHIPPED 2026-05-16)¶
Subagent-driven deployment campaign on top of the Phase 1 substrate. All 8 phases shipped in one session.
| PR | Phase | Title | Components shipped |
|---|---|---|---|
| #933 | Phase 1 (fixtures) | Absorption + verbatim + lockstep fixtures (G-3, G-4, G-5) | tests/test_hotfix_837_absorption.py, tests/test_v5_rule_verbatim_preservation.py, tests/test_lockstep_consistency.py (67 tests gating Phases 2-5 via xfail(strict=True) markers; markers cleared progressively) |
| #935 | Phase 2 | #837 backport into v4.1 + HARD BANS port into v6 (closes G-1) | conversation_v4.1.yaml SAFETY block + conversation_v6.yaml HARD BANS section (byte-identical to v4:111-112) |
| #937 | Phase 3 | Port v5 rules 2.1/2.3/2.5/2.6 + base sections | All TODO(phase-2a) markers cleared from conversation_v6.yaml: ROLE, COLLECT, VOICE+rule2.5 emotional-word list, NEVER, FORBIDDEN PHRASES, DOCUMENT-TRUST FRAMING, DEMOGRAPHIC GROUNDING, ONE QUESTION PER TURN (7-axis), REMEMBER |
| #938 | Phase 4 | stages.yaml content port (12 stages) | All TODO(phase-2b) markers cleared from stages.yaml. Recovery stages carry ADR-0018 §K escalation triggers + coordinator handoff flow |
| #939 | Phase 5 | Knowledge addendums (5 files) | CREATED knowledge/mso_patient_offer.yaml (relocated from v4.yaml:207-261). FILLED financial_options, insurance_handling, post_travel_logistics, procedure_clinical_facts/knee_replacement |
| #940 | Phase 6+7 | compose_v6 + 5 dataclass services (closes G-7) | app/services/{case_summary,fhir_observation_summary,document_manifest,workflow_snapshot,patient_preferences}_service.py, knowledge_addendum_selector.py, prompt_loader_v6.py real compose body, patient_context_builder.py real assembly. 194 tests added |
| #941 | Phase 8 | G-15..G-18 cleanups + Phase 6+7 polish | G-15 stage_resolver logging spec amendment, G-16 WorkflowState key remapping docstring, G-17 compose_v6 return-shape guard + alertable fallback reason, G-18 stage_resolver edge case test, knowledge selector lru_cache, cache_segments[0] prompt_version field, soft-return rationale docstrings on 4 services |
| #942 | Phase 9 | Production rollout — flip prompt_arch=v6 + allowlist=["*"] |
config/feature_flags.yaml defaults flipped. Internal-only prod (limited Curaway team); rollback by flipping default back to v4 (v4 path remains fully functional and is the dispatcher fallback target). Skipped: 24h dev test (per SD directive to enable fully on prod and test there) |
Live in production (2026-05-16): v6 is the default architecture. prompt_arch_v6_tenant_allowlist=["*"] covers all internal tenants. Identity-aware overrides via Flagsmith remain available for per-tenant ramp-down if needed.
Async dispatcher conversion (deferred to follow-up): compose_v6 is async; dispatcher remains sync and catches unawaited coroutines via inspect.iscoroutine(). Conversion would touch >5 files outside the v6 chain (triage_agent.py + get_system_prompt callers). Tracked separately.
CI status (2026-05-16): All v6 prompt-compliance gates green except the LLM grader job which requires ANTHROPIC_API_KEY in GitHub Actions secrets (G-2 — pending SD action; non-blocking since deterministic gates cover axes 1, 2, 6, 7, 8).
4. v5 rules absorption checklist¶
Tabulating the 6 v5 rules per conversation-v6-feature.md §4 mapping table + v6-rule-location-map.md §2.27. The "Production today" column is verified against config/prompts/base/conversation_v4.yaml + v4.1.yaml on 2026-05-15.
| Rule | Origin issue(s) | Target in v6 | Production status today | Verbatim fixture present? |
|---|---|---|---|---|
| 2.1 Document-trust framing | #560 | conversation_v6.yaml DOCUMENT-TRUST FRAMING section |
NOT in v4 baseline; NOT in v4.1. The closest pre-v6 hit is phase_contexts/v2/document_review.yaml lines 6-26 which has the "NO MEDICAL INTERPRETATION" block, but the v5-spec 4-part framing + identity-clarification language is NOT in production today. |
NO. tests/test_v5_rule_verbatim_preservation.py does not exist yet. Verbatim phrase list defined in v6-rule-location-map.md §3.1. |
| 2.2a Treatment-recommendation ban | #642 (+ #837 hot-fix) | conversation_v6.yaml HARD BANS section |
In conversation_v4.yaml:111 (added by #837 merged 2026-05-12). NOT in conversation_v4.1.yaml — see §6 G-1. |
NO. tests/test_hotfix_837_absorption.py does not exist yet. |
| 2.2b Scope-rejection ban | #743 (+ #837 hot-fix) | conversation_v6.yaml HARD BANS section |
In conversation_v4.yaml:112. NOT in conversation_v4.1.yaml — see §6 G-1. |
NO. Same fixture file as 2.2a (does not exist). |
| 2.3 Unverified demographic claim | #547 | conversation_v6.yaml DEMOGRAPHIC GROUNDING section (REVISED rev 3 — moved from stage-scope to BASE because demographic fabrications can fire in any stage) |
NOT in v4 baseline; NOT in v4.1. Closest production guard is the existing voice rules. | NO. Verbatim phrase: "The report I'm reading lists the patient as X — is this for someone other than yourself?" — must land in fixture. |
| 2.4 Records-upload re-offer | B1-v4 finding (no Github issue — surfaced by manual v4 conversation audit) | stages.yaml > discovery.guidance AND procedure_identification.guidance + re_offer_on_turn: [2, 3] field |
Partial in v4 (records-first emphasis in phase_contexts/v2/records_first.yaml) but the turn-2-3 cadence guarantee is NOT enforced today. Lingering-discovery cases can miss the re-offer entirely. |
NO. Fixture should be a 5+ turn discovery stagnation case asserting re-offer language on turns 2 AND 3. |
| 2.5 Emotional verbatim echo | B1 axis-3 finding (no GH issue) | conversation_v6.yaml VOICE RULES section |
Partial in conversation_v4.yaml:40 "NAME THE SPECIFIC HARD THING" (and v4.1 same line). v5 Rule 2.7 strengthens this with an explicit emotional-word list. The 7-word list (exhausted, scared, desperate, overwhelmed, frustrated, worried, tired) is not in production. |
NO. Verbatim word list per v6-rule-location-map.md §3.4. |
| 2.6 Multi-question axis discipline | #491, #550 | conversation_v6.yaml ONE QUESTION PER TURN + stages.yaml per-stage do_not: [stack-questions] redundant placement |
Partial: conversation_v4.yaml:38 has "ONE QUESTION ONLY when the patient is emotional" + intent_capture.yaml has pacing rules. The v5-spec "SAME-TURN AXIS DISCIPLINE" enumeration (Laterality / Mechanism / Severity / Timeline / Prior treatment / Demographics / Records availability) is NOT in production today. |
NO. Verbatim axis list + WRONG/RIGHT example pair per v6-rule-location-map.md §2.27 v5.RULE.006. |
Inheritance starting point for Phase 2a:
- Rules 2.1, 2.3, 2.5, 2.6 — start from
conversation_v4.yaml(NOT v4.1, because v4.1 differs from v4 only by the MSO addendum being baked in; the bulk of voice / safety / process content is identical). - Rules 2.2a, 2.2b — start from
conversation_v4.yaml:111-112(the #837 hot-fix paragraphs) and verbatim-port toconversation_v6.yamlHARD BANS. The exact byte-identical text is non-negotiable (see §6 G-1). - MSO addendum — port from
conversation_v4.yaml:204mso_offer_addendum:(the in-body version baked into v4.1) toconfig/prompts/knowledge/mso_second_opinion.yaml. Gated by the SAMEmso_patient_offer_enabledflag (spec §3.5.1 row 9). Phase 2b includes a regression test asserting flag value is honored across v4↔v6 toggle.
Key clarification: v4.1 is MSO-only; it's not a clinical-rules upgrade over v4. The "v4.1 / v1.1 pair" exists solely so tenants with mso_patient_offer_enabled=true get the MSO addendum without flag-conditional prompt assembly. Treating v4.1 as the inheritance point for clinical rules would be wrong — the clinical rules base is conversation_v4.yaml + #837 hot-fix paragraphs.
5. What's pending¶
Phase-by-phase per conversation-v6-feature.md §6 + reality on 2026-05-15:
Phase 2a — Migration: base prompt rules¶
- Scope: Port v5 rules 2.1, 2.2 (verbatim from #837), 2.3, 2.5, 2.6 into
conversation_v6.yamlbase sections (replacing everyTODO(phase-2a)marker). - Estimate: 1-2 days (Opus for content judgment, per
conversation-v6-feature.md §6row 2a). - Who: Opus author + Dr. Naidu reviewer.
- Blockers:
tests/test_hotfix_837_absorption.pymust land FIRST (see §6 G-3) — otherwise wording can drift during port without CI catching it.tests/test_v5_rule_verbatim_preservation.pymust land alongside the port (verbatim phrase fixtures perv6-rule-location-map.md §3).tests/test_lockstep_consistency.pymust land alongside the port (readsv6-rule-location-map.md, asserts every rule reaches its declared destination).- Dr. Naidu base-prompt-rules review gate (mandatory per spec §6 footnote — "All 4 windows MUST be locked on his calendar before Phase 0 starts" — confirm with SD whether this is locked).
- LLM grader CI auth (G-2) — Phase 0 grader can't fail-close on prompt content if
ANTHROPIC_API_KEYisn't wired. - Pre-flight check (per Phase 1 spec §3.5.1 row 8): #535 (Flagsmith identity bug) is CLOSED per
gh issue view 535. Phase 1 unblock condition satisfied.
Phase 2b — Migration: stages.yaml content¶
- Scope: Port phase + layer content into the 12
stages.yamlentries (replacing everyTODO(phase-2b)marker —guidance,cards_to_use,advance_when,do_not,extractors_active). Lockstep — any voice-rule update to v6 also lands in v4. - Estimate: 2-3 days (Opus per spec §6 row 2b).
- Who: Opus author + Dr. Naidu reviewer.
- Blockers:
v6-stages-extractors-matrix.mdmust publish FIRST (per spec §6 row 2b — "blocking"). Status today: the matrix doc EXISTS atdocs/specs/v6-stages-extractors-matrix.md(288 lines, draft 2026-05-12). Confirm Naidu has signed off on the matrix before Phase 2b starts, OR confirm it doesn't require his sign-off and only the stages.yaml content port does.tests/test_extractor_prompts_pii_safe.py(CI gate per spec §3.4) must land — scaffolding scope ambiguous, may already be covered by Phase 0 gates or may be Phase 2b deliverable.- Dr. Naidu stages.yaml content gate (mandatory per spec §6 row 2b).
- Phase 2a must merge first (Phase 2b depends on the base-prompt rule landing site).
Phase 3 — Admin UI extensions¶
- Scope:
prompt_archselector in/admin/triage, stage debug endpoint at/api/v1/admin/cases/{case_id}/stage(withDepends(require_case_access)), knowledge addendum toggles. - Estimate: 1 day, Sonnet.
- Blockers: Phase 2a + 2b merged (selector pointing at empty stages is useless).
Phase 4 — Frontend deep-link cards¶
- Scope:
RichCard.tsxextensions forview_payments,view_summary,view_consultations,stage_indicator; placeholder pagesPayments.tsx,Summary.tsx,Consultations.tsx. - Estimate: 1 day, Sonnet.
- Blockers: Phase 2b (stages declare
cards_to_use).
Phase 5 — Extractor prompt updates¶
- Scope: Replace "layer N" semantics in 5 extractor system prompts with stage-equivalent semantics (semantic-equivalent rewrite, not content change). 5 extractors: intent, medical, travel, logistics, financial.
recovery_checkin_extractor(PR #832 /recovery_checkin_extractor.py) is downstream. - Estimate: 1-2 days, Opus for content (per spec §6 row 5).
- Blockers: Phase 2b (stages.yaml
extractors_activelists must be populated per the matrix);tests/test_extractor_layer_to_stage_rename.py(NEW per spec Appendix B) must accompany.
Phase 6 — Dual-shadow ramp 10%¶
- Scope: Flip
prompt_arch=v6for 10% of tenants viaprompt_arch_v6_tenant_allowlist. Side-by-side Langfuse trace comparison vs v4. - Estimate: 1 week observation calendar.
- Acceptance criterion (new rev 5): Per-segment cache hit rate measured in Langfuse — Seg 2 ≥ 60%, Seg 3 ≥ 50% sustained 24h. Block ramp if either fails.
- Blockers: Phases 2a-5 complete; cost dashboards green; Phase 6 acceptance criterion (cache hit rate) defined.
Phase 7 — Manual validation cycle¶
- Scope: 3 baselines + 3 after on 3 personas (caregiver/oncology, direct/ortho, exploratory). 9-axis scoring per turn. SD + Dr. Naidu sign off per persona.
- Estimate: 1 day live testing.
- Blockers: Phase 6 observation complete; Naidu calendar (4th of 4 mandatory windows per spec §6).
Phase 8 — Ramp to 50% then 100%¶
- Scope: Stagger; 24h hold between bumps.
- Estimate: 3 days.
- Blockers: Phase 7 sign-off; no regressions in Langfuse + Metabase dashboards.
Phase 9 — 2-week observation¶
- Scope: Real-traffic per-case audit on a sample per persona.
- Estimate: 2 weeks calendar.
- Blockers: Zero clinical-safety violations during ramp.
Phase 10 — Decommission¶
- Scope: Delete
phase_contexts/,layer_contexts/, base prompts v1-v4,_LAYER_TO_PHASEmapping, deprecated loader functions, deprecated tests. - Estimate: 1-2 days (CORRECTED rev 5 from 0.5d — shadow-import audit on 8+ sites:
tests/test_intake_fix5.py,tests/test_conversation_prompt.py,tests/test_prompt_loader.py,tests/test_no_medical_advice.py:PATIENT_FACING_FILES,app/agents/conversation_prompt.py:_get_phase_contexts()callsites,app/services/prompt_loader.py:PHASE_DIR/LAYER_DIRconstants). - Blockers: Phase 9 complete; all v4 paths confirmed unused via Langfuse; re-export shims (§1.3) deleted.
Aggregate calendar (per spec §6): ~6-7 weeks from Phase 0 start to Phase 9 complete. Phase 0 + Phase 1 are done (~10 calendar days elapsed). Net remaining: ~4-5 weeks if Naidu calendar locks cleanly.
6. Gaps + risks¶
Items scattered across issues / specs / memory that aren't formally tracked in the phase plan. Each has an ID for cross-reference.
G-1 — conversation_v4.1.yaml is MISSING the #837 hot-fix paragraphs (CRITICAL — folded into Phase 2a)¶
- Evidence: Grep on
config/prompts/base/conversation_v4.1.yamlfor "we don't handle that", "outside our scope", "right next step", "treatment recommendation" returns ZERO matches. The same grep onconversation_v4.yamlreturns lines 111-112. - Impact: Tenants on
mso_patient_offer_enabled=true(pair(v4.1, v1.1)) have the un-patched SAFETY block today. The two P0s (#642 treatment recommendation, #743 scope rejection) that #837 closed for v4-tenants are STILL OPEN for v4.1-tenants. - Resolution (per SD 2026-05-16): Folded into Phase 2a kickoff rather than treated as a separate pre-Phase-2a backport PR. Rationale: Phase 2a's first PR already lands the verbatim absorption fixture (
test_hotfix_837_absorption.py— see G-3) and ports the #837 paragraphs intoconversation_v6.yamlHARD BANS. Bundling the v4.1 backport into the same PR means a single Dr. Naidu review touchpoint covers both the v4.1 patch AND the v6 absorption byte-for-byte. Accepts ~1 week of un-patched v4.1 traffic in exchange for not splitting Naidu's attention across two paragraphs of identical text. - Phase 2a scope addendum: The first Phase 2a PR must:
- Add
tests/test_hotfix_837_absorption.pywith byte-identical assertions against BOTHconversation_v4.1.yamlANDconversation_v6.yamlHARD BANS. - Patch
conversation_v4.1.yamlto include the #837 treatment-recommendation + scope-rejection paragraphs verbatim fromconversation_v4.yaml:111-112. - Port the same paragraphs into
conversation_v6.yamlHARD BANS. - Single Dr. Naidu confirm covers all three (same paragraphs he already approved for v4).
G-2 — Phase 0 LLM-grader CI fails on PRs touching prompt content (HIGH)¶
- Evidence:
gh run list --workflow="v6 prompt compliance"shows 3 failure runs onfeat/v6-phase1-scaffolding(2026-05-15 12:36, 12:40, 12:56). Deterministic gates pass; LLM grader job fails becauseANTHROPIC_API_KEYis not set in GitHub Actions secrets for that workflow. - Impact: Phase 2a + Phase 2b PRs (which actually change prompt content) cannot pass the 9-axis CI grader — the grader can't run. SD will be tempted to admin-merge prompt changes.
- Mitigation: Plumb
ANTHROPIC_API_KEYinto.github/workflows/v6-prompt-compliance.yml(single-line workflow secret add). One-shot fix; SD task.
G-3 — tests/test_hotfix_837_absorption.py does NOT exist (HIGH)¶
- Evidence:
ls tests/ | grep -iE "hotfix|absorb"returns nothing. - Spec reference:
conversation-v6-feature.md §5explicitly calls this fixture out as "NEW rev 5 per compliance review". - Impact: Without this fixture, Phase 2a port of the two #837 paragraphs into
conversation_v6.yamlHARD BANS can drift in wording, weakening rule 2.2. The spec is explicit: "byte-identical". - Recommendation: Land this fixture as the FIRST work of Phase 2a (before any content port).
G-4 — tests/test_v5_rule_verbatim_preservation.py does NOT exist (HIGH)¶
- Evidence:
ls tests/ | grep -iE "verbatim|v5_rule"returns nothing. - Spec reference:
conversation-v6-feature.md §4+v6-rule-location-map.md §3(40+ verbatim phrases enumerated). - Impact: Same drift risk as G-3 but for the broader v5 rule set (4-part doc-trust framing, demographic clarification, emotional word list, axis discipline list).
- Recommendation: Land alongside
test_hotfix_837_absorption.pyas Phase 2a pre-work.
G-5 — tests/test_lockstep_consistency.py does NOT exist (MED)¶
- Evidence: Not in
tests/directory. - Spec reference:
conversation-v6-feature.md §8.5+v6-rule-location-map.md §0. - Impact: Lockstep CI gate that reads
v6-rule-location-map.mdand asserts rules land at declared destinations is missing. Without it, Phase 2a/2b silently drops a rule = silent regression. - Recommendation: Land before Phase 2a starts (so the absorbing PR is the FIRST to be gated).
G-6 — stages.yaml > extractors_active is empty in scaffolding (MED)¶
- Evidence: All 12 stages in
config/prompts/stages.yamlhaveextractors_active: [] # TODO(phase-2b). - Spec reference: spec §3.4 +
v6-stages-extractors-matrix.md §3(30runcells, 2condcells). - Impact: Phase 3 extractor work (
compose_v6()readsextractors_activeto know which extractors to spawn) cannot land before Phase 2b populates the lists. - Recommendation: This is the documented Phase 2b deliverable. No action — flagged here for visibility.
G-7 — patient_context_builder dataclass-producing services don't exist (MED)¶
- Evidence:
app/agents/patient_context_builder.pyexists (155 lines, from PR #925) with the assembly interface, but it expects dataclassesCaseSummary,FhirObservationSummary,DocumentManifest,WorkflowSnapshot,PatientPreferencesfrom owning-domain services. None of these dataclass-producing service functions exist yet onmain. - Spec reference:
conversation-v6-feature.md §2.4revision rev 3 — "dataclass-producing services MUST useBaseRepository._scoped_query(tenant_id)". - Impact:
compose_v6()cannot move pastNotImplementedErrorwithout these services. This is a Phase 2a-2b dependency that's not currently broken out as its own work item. - Recommendation: Scope into a Phase 2a sub-task. Estimate: 1-2 days. Likely Sonnet (mechanical — wrap existing repository reads in dataclass-returning service functions).
G-8 — Naidu clinical sweep on #837 + #832 — task #169 closed but mid-stream sweep not formally re-scheduled (MED)¶
- Evidence:
gh issue view 169is closed (Phase 0 multi-tenancy work). No open issue tracks "Dr. Naidu mid-stream review of merged recovery prompts + #837 wording" specifically. Spec §6 calls out 4 separate Naidu gates but the calendar lock status is not in the doc. - Impact: Spec §6 is explicit: "If Dr. Naidu is unavailable >2 weeks for ANY of the 4 gates, the phase pauses." All 4 windows MUST be locked before Phase 0 starts. Phase 0 already shipped — confirm whether the windows are locked for 2a / 2b / 7.
- Recommendation: SD confirms Naidu calendar status in writing before Phase 2a kickoff.
G-9 — Mid-conversation rollback test (spec §8.2.1) — does it exist? (LOW)¶
- Evidence: Spec §8.2.1 describes the expected behavior (mixed
prompt_archstamps within one conversation) but does not list a test file. Notests/test_mid_conversation_rollback_*.pyin the tree. - Impact: Phase 6 dual-shadow ramp could trigger a mid-conversation arch flip and produce inconsistent traces. Without a fixture, the audit cannot prove the spec §8.2.1 behavior holds.
- Recommendation: Add to Phase 5 / 6 work list. Estimate: 0.5 day.
G-10 — Identity-aware Flagsmith pass-through (#535) (RESOLVED)¶
- Evidence:
gh issue view 535is CLOSED. - Status: Phase 1 prereq satisfied (per spec §3.5.1 row 8 — "#535 MUST be resolved before v6 Phase 1 starts"). No action.
G-11 — addendum_priority_clinical_first.py test (LOW)¶
- Spec reference: §9 risk row + Appendix B testlist.
- Status: Not in tree. Knowledge addendums in scaffolding (PR #925) lack
priority:andcategory:fields. Phase 2b deliverable.
G-12 — 18 cross-spec inconsistencies tracked in v6-trio-consistency-findings.md (LOW-MED)¶
- Evidence:
docs/specs/v6-trio-consistency-findings.mdenumerates 18 findings (F-01 through F-18), 5 MAJOR + 13 MINOR. - Major ones:
- F-01: stage count mismatch in #859 OQ.02
- F-02:
intakereferenced as stage (not in §1 list) - F-03: CI gate algorithm doesn't cross-read sibling specs
- F-04: cross-spec links missing in #855 and #859
- F-05: 17 raw open questions across 3 docs → Naidu burn risk
- Recommendation: Squash MAJOR findings before Phase 2a starts. MINOR can defer.
G-13 — Dr. Naidu review gates not formally scheduled (HIGH)¶
- Evidence: Spec §6 footnote (revised) lists 4 mandatory Naidu sign-offs (Phase −1 #837 mini, Phase 2a base rules, Phase 2b stages content, Phase 7 validation). No tracking issue or calendar artifact in the repo.
- Recommendation: Create one tracking issue per Naidu gate; link from spec §6.
G-14 — apps/patient-app/src/components/chat/rich_content_types.generated.json manifest does NOT exist (LOW)¶
- Spec reference: §3.9 — needed for the FE/BE drift CI gate.
- Status: Scoped as "1 day work" in Phase 1, but not in any merged PR.
- Recommendation: Land in Phase 4 (frontend phase) alongside the new
RichCard.tsxentries.
G-15 — stage_resolver.py violates the companion-doc no-logging purity contract (LOW)¶
- Evidence:
app/services/stage_resolver.py:146,158,168emitlogger.warning/logger.debugcalls.docs/specs/v6-stage-resolver-truth-table.md §1:23states: "Pure function. No I/O, no LLM, no DB writes, no logging." - Impact: Practically harmless today —
logger.warningis side-effecting but doesn't change return value. However it breaks property-test stability and the spec contract; a future implementer relying on the pure-function claim could be surprised. - Recommendation: Either tighten the spec to "no observable side effects on returned value" OR remove the loggers and surface malformed-state signals via the return value. Decide in Phase 2a kickoff.
G-16 — WorkflowState key-name remapping is silent (LOW)¶
- Evidence:
app/services/stage_resolver.py:72-79silently maps spec field names (documents_uploaded,match_results_shown,provider_selected) → live model names (required_documents_uploaded,matching_complete,providers_selected). Test fixtures use the live names so the gap is invisible. - Impact: A future implementer following spec §2.6 literally will pass spec-named keys to
WorkflowState({...})and see all values silently default to False — every stage rule will fail to match → fallback tosupporton every turn. - Recommendation: Document the mapping in a top-of-class docstring on
stage_resolver.pyOR accept both names via a small adapter layer. Address before Phase 2a expands the truth-table surface.
G-17 — compose_v6 returning None produces no v6_fallback_reason (MED)¶
- Evidence:
app/agents/v6_dispatcher.py:188-193constructsDispatchResult(arch="v6", v6_artifact=artifact)without inspectingartifact.conversation_prompt.py:176guards withif dispatch.arch == "v6" and dispatch.v6_artifact is not None— soNonesilently falls to v4 path with NOv6_fallback_reasontrace tag. The Layer 4 monitor cannot alert on this incoherent state. - Impact: Phase 2a wiring may briefly produce malformed
compose_v6returns during incremental rollouts. Without a fallback reason tag, the silent v4 fallback is invisible in Langfuse. - Recommendation: In Phase 2a, validate
compose_v6's return shape (dictwithsystem: str,stage_id: str,cache_segments: list) and emit a newv6_fallback_reason="compose_returned_invalid"trace tag when the shape is wrong. Addcompose_returned_invalidtoALERTABLE_FALLBACK_REASONSin the same PR.
G-18 — Stage-resolver rule fall-through edge case has no test (LOW)¶
- Evidence: When
intent_completion == 1.0 AND documents_uploaded == True AND medical_status.completion < 0.7, rules 3 and 4 both fail (rule 3 requiresintent < 1.0, rule 4 requiresnot documents_uploaded). No subsequent rule matches → fallback tosupport.tests/test_stage_resolver.pydoes not exercise this combination. - Impact: The intent here may be deliberate (records have been uploaded but the medical_status extractor hasn't caught up yet, so
supportis the correct conservative answer) — but without a test it's not pinned. A future refactor could silently change the behavior. - Recommendation: Add a single test asserting this combination →
"support". 10 minutes of work; do during Phase 2a kickoff.
G-19 — FE TransportOfferCard follow-up items (MED)¶
Identified by post-merge code + test reviews 2026-05-16. Bundle into a single follow-up PR (fix/transport-offer-card-wiring) when transport endpoints near rollout.
patientActionis a no-op inRichCard.tsx:260-263. The card callspatientAction(bookingIdDraft, 'select')andpatientAction('', 'decline_all')but the handler in RichCard is a stub. Same Phase D deferred state as RecoveryOfferCard; not a regression. Wire toConversationApp→MessageThread→ chat send-message flow before transport endpoints go live.declineAllAPI silently swallows ALL errors atapps/patient-app/src/services/transportApi.ts:88-90. Narrow to 404 only; re-throw others so the component's error banner fires correctly./design-preview/transportis publicly reachable (apps/patient-app/src/App.tsx:149). Wrap inProtectedRouteorimport.meta.env.DEVguard for consistency with other design-preview routes.RichCard.tsxtransport_offerbranch has zero integration tests. Add a test rendering<RichCard>withcontentType='transport_offer'+ a minimal fixture; assertTransportOfferCardmounts.transportApi.tshas no dedicated test file. Unit-testtoTransportOption(snake→camel) +declineAll404-no-op + 500-rethrow behavior.declineAllrejection path test missing. Mirror the existingselectOptionerror test.- Backend: cross-module private import.
app/agents/v6_dispatcher.py:82-85imports_resolve_prompt_archand_resolve_v6_tenant_allowlist(both_prefixed) fromprompt_loader.py. Promote to public symbols OR relocate to a sharedv6_config.pybefore Phase 2a expands the dispatch surface. - Vendor name PostHog property fixed inline in the curaway-health-navigator follow-up PR (vendor_id → vendor_name) — no Phase 2a tracking needed.
7. Inheritance map (CRITICAL — v4.1 / v4 + #837 as starting point)¶
For each v6 absorption section, the exact source text that must be preserved verbatim. This is the input contract for Phase 2a.
| v6 destination section | Source (file + line range) | Verbatim requirement |
|---|---|---|
conversation_v6.yaml HARD BANS — rule 2.2a (treatment recommendation ban) |
config/prompts/base/conversation_v4.yaml:111 |
YES, byte-identical. Asserted by tests/test_hotfix_837_absorption.py (must be created). Source phrases: "NEVER recommend a specific procedure, surgery, or course of treatment", "the right next step is", "why [procedure] makes sense for your case", "That's a clinical decision your doctor or specialist makes", "Surfacing what a document contains is allowed; choosing the procedure for the patient is not." |
conversation_v6.yaml HARD BANS — rule 2.2b (scope rejection ban) |
config/prompts/base/conversation_v4.yaml:112 |
YES, byte-identical. Source phrases: "NEVER reject a patient based on procedure type, condition, or specialty", "Curaway coordinates care across all specialties", "we don't handle that", "this is outside our scope", "Curaway isn't set up for", "Let me flag this with our care team so we can connect you with the right specialist." |
conversation_v6.yaml DOCUMENT-TRUST FRAMING — rule 2.1 |
docs/specs/conversation-v5-feature.md:63-98 (rule definition; never landed in any base prompt file) |
Partial verbatim. Verbatim NEVER phrases: "different from what your doctor told you", "this is not [diagnosis]", "the diagnosis is wrong", "I'm seeing findings that contradict". Verbatim ALWAYS phrases: "I want to make sure these have been factored in", "could you check with the oncologist whether", "Surfacing factual findings IS allowed". The 4-part framing pattern's structure can be modernized; the phrase list cannot. |
conversation_v6.yaml DEMOGRAPHIC GROUNDING — rule 2.3 |
docs/specs/conversation-v5-feature.md:131-155 |
YES for the identity clarification phrase: "The report I'm reading lists the patient as X — is this for someone other than yourself?". Surrounding guidance can be modernized. |
conversation_v6.yaml VOICE RULES — rule 2.5 |
config/prompts/base/conversation_v4.yaml:40 (existing "NAME THE SPECIFIC HARD THING") + docs/specs/conversation-v5-feature.md:214-237 (v5 Rule 2.7 strengthening) |
YES for the 7-word list: exhausted, scared, desperate, overwhelmed, frustrated, worried, tired. Must appear as a literal list inside section anchored by # V5-RULE-2.7-EMOTIONAL-VERBATIM. |
conversation_v6.yaml ONE QUESTION PER TURN — rule 2.6 |
config/prompts/base/conversation_v4.yaml:38 (existing "ONE QUESTION ONLY") + docs/specs/conversation-v5-feature.md:184-212 (v5 Rule 2.6 SAME-TURN AXIS) |
YES for the 7-axis list: Laterality, Mechanism, Severity, Timeline, Prior treatment, Demographics, Records availability. WRONG/RIGHT example pair verbatim. Each stage in stages.yaml declares do_not: [stack-questions] (redundant placement appropriate per spec §4). |
conversation_v6.yaml JSON RESPONSE FORMAT |
config/prompts/base/conversation_v4.yaml:186 envelope OR config/prompts/base/conversation_v4.1.yaml:187 envelope (verified identical between v4 and v4.1 per spec §1.4) |
YES, byte-identical. The {"message": "...", "extracted_data": {...}, "detected_comorbidities": [...], "phase_complete": false, "suggested_next": null, "missing_critical_info": []} envelope must appear unchanged. Asserted by tests/test_v5_rule_verbatim_preservation.py per spec §1.4. |
conversation_v6.yaml REMEMBER |
config/prompts/base/conversation_v4.yaml:191-195 (4 numbered rules: ACKNOWLEDGE BEFORE ASKING, NEVER DIAGNOSE, HONOR YOUR PROMISES, NEVER PROJECT EMOTIONS) |
Verbatim. These are the "4 most important rules" — the explicit final reminder block. |
conversation_v6.yaml ROLE / COLLECT BEFORE MATCHING / VOICE / NAME / FORMAT / FACTS sections |
config/prompts/base/conversation_v4.yaml (NOT v4.1 — they're identical for these sections, but v4 is canonical) per the line-level map in docs/specs/v6-rule-location-map.md §2.1-2.18 |
Mixed verbatim / semantic. The verbatim: column in v6-rule-location-map.md is the per-rule authority. |
stages.yaml > discovery.guidance + stages.yaml > procedure_identification.guidance — rule 2.4 records re-offer |
docs/specs/conversation-v5-feature.md:156-175 + config/prompts/phase_contexts/v2/records_first.yaml + identify_procedure.yaml |
Semantic only. re_offer_on_turn: [2, 3] field per spec §4. Cadence-enforced — fixture must show 5-turn discovery stagnation triggers re-offer on turns 2 AND 3. |
stages.yaml > <stage>.do_not |
config/prompts/phase_contexts/v2/*.yaml per-phase DO NOT lists + recovery_offer.yaml + recovery_checkin.yaml patronizing-filler ban list |
Verbatim for ban lists. Patronizing-filler list: I hear you, I understand, I'm here for you, completely natural to feel. Source: v6-rule-location-map.md §3.5. |
knowledge/mso_second_opinion.yaml |
config/prompts/base/conversation_v4.yaml:204 mso_offer_addendum: block |
Verbatim. Same gating flag (mso_patient_offer_enabled) — Phase 2b regression test asserts flag honored across v4↔v6 toggle. |
The line-level absorption map for every other rule (ROLE, VOICE, NAME, NEVER, CONT, EMO, PROJ, NONSENSE, ABROAD, FIRST, APPROACH, THINK, FORMAT, SAFETY, FACTS, EXAMPLES, JSON, REMEMBER + 6 phase_contexts/v2/*.yaml) lives in docs/specs/v6-rule-location-map.md §2.1-2.26. That doc is the per-rule authority. This §7 is the summary contract for Phase 2a kickoff.
8. Phase-by-phase next-steps¶
Concrete ordered list of what happens AFTER Phase 1 PRs merge. Each step: dependencies, who, rough estimate.
| # | Step | Dependencies | Who | Estimate |
|---|---|---|---|---|
| 1 | Merge Phase 1 stack (#924 + #925 + #926 + #927 + #928 + #929 in dependency order) | LLM-grader CI auth (G-2) fixed OR explicit admin-merge approval | SD + Claude | 1 day calendar (CI thrash) |
| 2 | Plumb ANTHROPIC_API_KEY into v6 CI workflow (close G-2) |
None | SD (single secret add) | 5 minutes |
| 3 | Create absorption fixtures — tests/test_hotfix_837_absorption.py + tests/test_v5_rule_verbatim_preservation.py + tests/test_lockstep_consistency.py (close G-3, G-4, G-5) |
v6-rule-location-map.md published (DONE) |
Sonnet author (tests are deterministic — no LLM) | 1-2 days |
| 4 | Lock Dr. Naidu calendar windows for 2a, 2b, 7 (close G-13) | None | SD | calendar dependent |
| 5 | Squash G-12 MAJOR findings in v6-trio-consistency-findings.md (F-01 through F-05) |
None | Opus or Sonnet — one PR per finding | 1-2 days |
| 6 | Phase 2a kickoff — port v5 rules 2.1, 2.2 (#837 verbatim), 2.3, 2.5, 2.6 into conversation_v6.yaml AND backport #837 into conversation_v4.1.yaml in the same PR (close G-1 + Phase 2a content port in one Naidu touchpoint) |
Steps 2-5 complete | Opus author + single Naidu reviewer pass | 1-2 days + Naidu calendar |
| 7 | Patient context builder dataclass-producing services (close G-7) | None (parallel to step 6) | Sonnet — mechanical wrap of existing repos | 1-2 days |
| 8 | Phase 2b kickoff — port stages.yaml > <stage>.{guidance, cards_to_use, advance_when, do_not, extractors_active} from phase_contexts/v2/*.yaml + v6-stages-extractors-matrix.md |
Phase 2a merged + Naidu signoff | Opus author + Naidu reviewer | 2-3 days + Naidu calendar |
| 9 | Phase 3 — admin UI extensions | Phase 2a + 2b merged | Sonnet | 1 day |
| 10 | Phase 4 — frontend deep-link cards + placeholder pages | Phase 2b merged (stages declare cards_to_use) |
Sonnet | 1 day |
| 11 | Phase 5 — extractor prompt language sweep (5 extractors, semantic-equivalent) | Phase 2b merged | Opus for content judgment, Sonnet for tests | 1-2 days |
| 12 | Phase 6 — dual-shadow ramp 10% | Phases 2a-5 merged + cost dashboards green | SD + observation calendar | 1 week observation |
| 13 | Phase 7 — manual validation (3 baselines + 3 after × 3 personas, 9-axis scoring) | Phase 6 observation complete + Naidu calendar | SD + Naidu | 1 day live + Naidu calendar |
| 14 | Phase 8 — ramp to 50% then 100% | Phase 7 sign-off | SD | 3 days |
| 15 | Phase 9 — 2-week observation | Phase 8 ramp complete | SD + Naidu (sampled audits) | 2 weeks calendar |
| 16 | Phase 10 — decommission v4 paths | Phase 9 complete + Langfuse confirms zero v4 traffic | Sonnet, full shadow-import audit | 1-2 days |
9. Open questions for SD¶
Q1 — Should Phase 2a start before or after the LLM grader CI auth issue is fixed?¶
Options: - (a) Fix CI auth FIRST. Phase 2a then ships with the LLM-graded gate live → maximum confidence, zero rework. - (b) Start Phase 2a NOW using deterministic gates only (axes 1, 2, 6, 7, 8 + flag YAML). LLM grader retrofit when auth lands.
Recommendation: (a). The CI fix is 5 minutes; deferring it leaves the spec-mandated 9-axis gate non-functional for the highest-risk PRs.
Q2 — Should stages_version be a separate flag from prompt_arch?¶
Context: Spec §3.5.1 row 7 calls for a stages_version flag for minor stages.yaml versioning (e.g., v1.0, v1.1), paired with prompt_arch=v6 via VALID_VERSION_PAIRS enforcement in apps/admin-app/src/pages/Triage.tsx.
Options:
- (a) Add stages_version now (Phase 1 stack extension). Risk: scope creep on a stack that's already 6 PRs deep.
- (b) Defer to Phase 2b when stages content actually evolves. Risk: first stages.yaml content port has no versioning surface — re-rolling requires a prompt_arch flip.
- (c) Bake the version into stages.yaml (version: "v1.0" field, already present in PR #925 line 6) and skip the flag. Risk: no Flagsmith rollback granularity for stages content.
Recommendation: (b) — defer. The version: field in stages.yaml is enough until content actually moves.
Q3 — Should the G-1 drift (#837 missing from v4.1) be backported NOW or absorbed by Phase 2a? RESOLVED 2026-05-16¶
Decision (SD, 2026-05-16): Fold into Phase 2a kickoff (option b). The first Phase 2a PR will bundle the conversation_v4.1.yaml backport with the conversation_v6.yaml HARD BANS port and the test_hotfix_837_absorption.py fixture — single Naidu review touchpoint covers both files since the paragraphs are identical to what he already approved for v4. Accepts ~1 week of un-patched v4.1 traffic to consolidate Naidu's attention.
Q4 — Naidu calendar — are all 4 windows locked?¶
Spec §6 footnote: "All 4 windows MUST be locked on his calendar before Phase 0 starts." Phase 0 has shipped. Confirm whether 2a / 2b / 7 windows are locked, or whether SD intends to operate without them.
Recommendation: Lock them in writing this week or document the deviation.
Q5 — Should the Phase 0 LLM grader run on EVERY Phase 2a/2b PR or only on the merge-to-main commit?¶
Context: Spec §3.9 implies per-PR. Cost concern: each grader run is ~$0.30 + 30s. If Phase 2a iterates 5 times, that's $1.50 + 2-3 min CI per iteration.
Options:
- (a) Every PR push (highest catch rate).
- (b) Only on PR open + on each commit author-tagged @grader (manual trigger via PR comment).
- (c) Only on merge-to-main (lowest cost, slowest feedback).
Recommendation: (a) — $1.50 / iteration is irrelevant; clinical-safety regressions are not.
Q6 — Should support stage be the default for new cases (per spec §10 Q1)?¶
Spec note: Currently spec'd as a fallback safety net. Could also be the entry stage. SD has not resolved.
Recommendation: Surface to Naidu in the Phase 2a base-prompt-rules review. He should decide; spec defers.
Q7 — Phase 6 cache-hit acceptance criteria — what if Seg 2 < 60% during ramp?¶
Spec note: §2.5 acceptance criterion blocks ramp if Seg 2 <60% or Seg 3 <50% sustained 24h.
Options on miss:
- (a) Pause ramp, investigate cache invalidation patterns (likely culprit: too-aggressive invalidate_case_cache() calls).
- (b) Ramp anyway with cost mitigation (smaller stage profiles).
- (c) Raise the threshold (acknowledge cache hit rate is fundamentally constrained by Anthropic's invalidation behavior).
Recommendation: Document the SOP for (a) in docs/runbook/prompt-rollback.md (new — per spec Appendix B Docs section). Don't pre-decide between (a/b/c) — depends on what the dashboard shows.
10. Appendix — file / issue / memory index¶
Specs (read in this order for new readers)¶
| Doc | Status | Purpose |
|---|---|---|
docs/specs/conversation-v6-feature.md |
rev 6, final for Phase 0 kickoff | Canonical v6 spec, 962 lines |
docs/specs/conversation-v5-feature.md |
legacy / superseded | Original v5 rule definitions (rules 2.1-2.7) — still the canonical wording source for absorption |
docs/specs/v6-rubric-locked.md |
locked rev 2, 257 lines | 9-axis grader rubric (Phase 0 + Phase 7 consumer) |
docs/specs/v6-stage-resolver-truth-table.md |
DRAFT 2026-05-12, 354 lines | Companion to v6 spec §2.6 (NOT YET BLOCKING — Phase 1 is shipping) |
docs/specs/v6-stages-extractors-matrix.md |
DRAFT 2026-05-12, 288 lines | Companion to v6 spec §3.4 — blocks Phase 2b |
docs/specs/v6-rule-location-map.md |
DRAFT, 737 lines | Lockstep registry for §8.5 CI gate — blocks Phase 2a |
docs/specs/v6-trio-consistency-findings.md |
268 lines | 18-finding cross-doc audit (5 MAJOR + 13 MINOR) |
Tracking issues¶
| Issue | State | Purpose |
|---|---|---|
| #836 | OPEN | v6 epic — tracks the full Phase 0-10 sequence |
| #837 | MERGED PR | Production hot-fix — two new SAFETY bullets in v4.yaml (treatment recommendation + scope rejection bans) |
| #832 | MERGED PR | Recovery prompts + extractor + orchestrator wiring (ADR-0018 §K) — downstream dependency for Phase 5 |
| #491 | OPEN | Multi-question discipline → v5 rule 2.6 |
| #547 | CLOSED | Demographic fabrication → v5 rule 2.3 |
| #550 | OPEN | Laterality re-ask → v5 rule 2.6 |
| #560 | OPEN | Document trust framing → v5 rule 2.1 |
| #642 | CLOSED | Treatment recommendation → v5 rule 2.2 + #837 hot-fix |
| #743 | CLOSED | Scope rejection → v5 rule 2.2 + #837 hot-fix |
| #535 | CLOSED | Flagsmith identity bug — Phase 1 prereq (RESOLVED) |
| #359 | CLOSED | Prompt versioning + audit trail |
PRs¶
| PR | State | Title |
|---|---|---|
| #921 | MERGED | Phase 0 Steps 1+2 — locked rubric + LLM-grader scorer |
| #922 | MERGED | Phase 0 Step 3 — fixture corpus |
| #923 | MERGED | Phase 0 Steps 4+5+6 — grader cache + 6 gates + CI |
| #924 | OPEN | Phase 1 Step 1 — prompt_arch + tenant allowlist flags |
| #925 | OPEN | Phase 1 scaffolding — stages.yaml + knowledge + stubs |
| #926 | OPEN | Phase 1 Layer 2 — YAML artifact validator |
| #927 | OPEN | Phase 1 Layer 3 — dispatcher + Langfuse tags |
| #928 | OPEN | Phase 1 Layer 4 — fallback observability + cost scaffold |
| #929 | OPEN | Phase 1 Layer 5 — end-to-end smoke tests |
Memory files relevant to this plan¶
| File | Purpose |
|---|---|
feedback_agent_chat_sacrosanct.md |
Discipline for every prompt change — 3 baselines + 3 after on 3 personas |
reference_v4_parser_strict_false.md |
json.loads(strict=False) requirement — preserved in spec §1.4 |
feedback_flagsmith_dual_env.md |
Every flag flip applies to BOTH Production and Development envs |
reference_flagsmith_v2_env_patch.md |
V2 versioning + env-scoped PATCH endpoint |
feedback_check_railway_after_migration_merge.md |
Migration Roundtrip CI is continue-on-error: true — confirm prod deploy after merge |
project_execution_order_transport_v6.md |
Transport admin → 3-reviewer subagents → v6 implementation (per SD 2026-05-15) |
project_work_queue.md |
Cross-session items (Clerk webhook, etc.) |
Code paths most relevant to v6¶
| Path | Role |
|---|---|
config/prompts/base/conversation_v4.yaml |
Production base (with #837 hot-fix at lines 111-112) |
config/prompts/base/conversation_v4.1.yaml |
Production base with MSO addendum baked in (MISSING #837 paragraphs — see G-1) |
config/prompts/base/conversation_v6.yaml |
v6 scaffold (TODO markers) |
config/prompts/stages.yaml |
v6 stages scaffold |
config/prompts/knowledge/*.yaml |
v6 knowledge addendums (4 files scaffolded) |
config/prompts/layer_contexts/intent_capture.yaml |
v1 layer context |
config/prompts/layer_contexts/intent_capture_v1.1.yaml |
v1.1 layer context (paired with v4.1) |
config/prompts/phase_contexts/v2/*.yaml |
Production v2 phase contexts (intake, records_first, identify_procedure, document_review, general, recovery_offer, recovery_checkin) |
config/feature_flags.yaml |
Flag defaults |
app/agents/conversation_prompt.py |
get_system_prompt — has v6 dispatcher branch (PR #927) |
app/services/prompt_loader.py |
_resolve_prompt_version, resolve_versions, _resolve_prompt_arch, _resolve_v6_tenant_allowlist |
app/agents/v6_dispatcher.py |
v6 arch dispatch decision (PR #927) |
app/services/v6_artifact_validator.py |
Boot-time YAML validator (PR #926) |
app/services/v6_fallback_monitor.py |
Telegram alert + cost scaffold (PR #928) |
app/services/stage_resolver.py |
§2.6 truth-table resolver (PR #925) |
app/agents/patient_context_builder.py |
§2.4 context block builder (PR #925) |
app/services/prompt_loader_v6.py |
compose_v6 stub (PR #925) |
End of unified plan.