Runbook — Patient Tenant Cutover¶
Status: Operator runbook. Source of truth for execution:
docs/specs/patient-tenant-cutover-phase2-feature.md(Phase 2a) and a future Phase 2b spec.This runbook is the quick-reference card for the operator running the maintenance window. The full risk analysis, mid-window connection behavior, predicate derivations, smoke matrix, and rollback safety bounds live in the spec. Read the spec end-to-end at least once before scheduling the window; consult this runbook during execution.
Phase 1 — DONE (PR #512, merged 2026-04-30)¶
tenant-curaway-patients row exists with mirrored apollo settings (ap-south-1/en/INR/daily). Added to PROTECTED_TENANT_IDS. No data has moved.
Phase 2a — patient data cutover¶
Spec: docs/specs/patient-tenant-cutover-phase2-feature.md — 21-check dev-branch smoke matrix; concrete SQL predicates verified against live schema (Session 86 audit); execution order pinned in §"Required execution order inside the cutover transaction".
Quick prerequisites¶
- Phase 1 migration
5d49162011a5_add_tenant_curaway_patients_rowalready applied (verifyrailway run alembic currentreturns5d49162011a5). - DB snapshot rehearsed on Railway dev branch within last 7 days; restore time documented.
POST /api/v1/admin/system/drain-streamsadmin endpoint shipped + baked ≥ 24h.- Telegram alerts wired (
cutover_started,cutover_completed,cutover_failed). - Backend image with
default_tenant_id='tenant-curaway-patients'AND the flipped_ORG_TENANT_MAP_FALLBACKpatient-org row staged onchore/post-cutover-default-tenant. - Vercel rewrite for
/maintenance.htmlrehearsed on a preview branch. - All Set A smoke checks (21 of them) green on the dev-branch rehearsal.
Execution sequence (~15 min target)¶
The full spec timeline lives in §"Maintenance window timeline". Highlights:
- T+0:00 — Vercel rewrite live on patient-app;
/admin/system/drain-streamsfired; Slack#curaway-opswindow-open notice. - T+0:01 —
pg_dumpsnapshot to S3. - T+0:04 —
python scripts/cutover_patient_tenant.py --execute --operator-confirmed(single transaction, follows the 14-step execution order from the spec). - T+0:09 — Backend deploy:
default_tenant_id=tenant-curaway-patients+_ORG_TENANT_MAP_FALLBACKpatient-org →tenant-curaway-patients. - T+0:11 — Vercel env update:
VITE_TENANT_ID=tenant-curaway-patients; patient-app redeploys. - T+0:11–T+0:13 — Post-cutover cleanup sweep (catches SSE/WS pollution if any escaped drain-streams) + Set B prod-subset smoke (A1, A2, A3, A4, A6, A7, A8, A9 — ~2 min).
- T+0:15 — Banner down; Slack window-close notice.
If any Set B check fails: rollback per spec §11 (decision tree depends on time elapsed; T+0:30 boundary is the safety line for option-2 reverse migration).
Rollback time bound¶
Worst-case 15 minutes (8 min snapshot restore + 4 min revert deploys + 3 min smoke). Goes in the maintenance-window announcement so support knows when to expect status either way.
Alembic downgrade trap¶
Do not alembic downgrade -1 past 5d49162011a5_add_tenant_curaway_patients_row between Phase 2a and Phase 2b. The Phase-1 down-guard will refuse anyway; if a true Phase 2a rollback is needed, restore from snapshot, do NOT use alembic downgrade.
Dev-branch rehearsal procedure¶
Run this end-to-end at least 24h before scheduling the prod window. Every Set A check must pass.
1. Provision a Railway dev branch¶
# In the curaway_src repo, with railway CLI already authed
railway environment new --name cutover-rehearsal-2026-MM-DD
railway link --environment cutover-rehearsal-2026-MM-DD
2. Restore a prod snapshot to the dev branch¶
# 1. Snapshot prod (run from a project linked to PROD)
railway link --environment production
railway run -- pg_dump --no-owner --no-privileges \
"$DATABASE_URL_ADMIN" \
| gzip > /tmp/cutover_rehearsal_$(date -u +%Y%m%dT%H%M%SZ).sql.gz
# 2. Restore to dev branch
railway link --environment cutover-rehearsal-2026-MM-DD
gunzip -c /tmp/cutover_rehearsal_*.sql.gz \
| railway run -- psql "$DATABASE_URL_ADMIN"
3. Run pre-flight --dry-run against the dev branch¶
Verdict must be READY (rc=0). Any blocker must be addressed (update predicates / classifier dict / spec) before proceeding.
4. Execute the cutover against the dev branch¶
CUTOVER_ACTOR="rehearsal:$(whoami)" \
railway run -- python scripts/cutover_patient_tenant.py \
--execute --operator-confirmed \
--report-path /tmp/cutover_report_rehearsal_$(date -u +%Y%m%dT%H%M%SZ).json
Expected: rc=0, "CUTOVER COMMITTED" message, report file written with commit_status: "committed" and full row-id allow-list.
5. Run Set A smoke matrix¶
5a. SQL-based subset (12 checks, automated)¶
Expected: rc=0, "ALL 12 CHECKS PASSED" message. Any FAIL → investigate before proceeding.
5b. Manual checks (9 of the 21 — operator-executed)¶
- A11 Patient login E2E — sign in to the dev-branch patient app as a real patient account. Verify: chat, file upload, match request, document WebSocket all work end-to-end.
- A12 Idempotency-key replay — capture an idempotency key from before the cutover (
SELECT key FROM idempotency_keys LIMIT 1), thencurl -X POST … -H "X-Idempotency-Key: <key>" …and confirm the cached response is returned (HTTP 2xx, body matches the original); the cachedcase_idmust resolve ontenant-curaway-patients. - A13 Provider portal sanity — sign into the provider portal on the dev branch. List cases. Forwarded cases must be visible (cross-tenant via
case_shares). - A14 Coordinator portal sanity — sign into the coordinator portal. View any case that just moved. The Performance page (#159) loads (not 403, assuming #514 grants are live).
- A15 RLS policies — log in as a non-patient-tenant user and attempt a cross-tenant read of
patientsrows. Expect 403. - A16 Langfuse trace — send a patient message via the patient app on dev branch. Inspect the latest Langfuse trace for
tenant_id=tenant-curaway-patientsin metadata. - A17 Background job sanity — trigger any QStash callback (e.g.,
POST /qstash/document-processedwith a moved document_reference id). The callback must find the row ontenant-curaway-patientsand complete successfully. - A18 Module-constant reload — exec into the dev-branch container, run
python -c "from app.routers.public_shared import _DEFAULT_TENANT; print(_DEFAULT_TENANT)". Must printtenant-curaway-patients. - A19 Drain-streams works — open the case WebSocket on the dev branch from a real patient session (or a wscat shell with a Clerk JWT):
wss://<dev-branch>/api/v1/cases/<id>/ws?token=<jwt>. In a separate shell:curl -X POST -H "Authorization: Bearer <super-admin-jwt>" https://<dev-branch>/api/v1/admin/system/drain-streams. The WebSocket must emit{"type":"drained","message":"…"}and close within 1 second. (Verify in browser DevTools → Network → WS frames panel.) The retired SSE handlers/messages/stream+/chat/stream+/documents/streamwere removed in #586 — don't use them. - A20-active Cleanup sweep moves a straggler — the smoke runner's automated A20 only proves no straggler EXISTS post-rehearsal. To prove the cleanup mechanism can move one, do this:
- Insert a synthetic patient-classified event onto apollo that simulates a lost-write from a long-lived SSE:
INSERT INTO events (id, event_type, tenant_id, payload, source_service, created_at) VALUES (gen_random_uuid()::text, 'intake.started', 'tenant-apollo-001', '{"actor": "rehearsal:synthetic-straggler"}'::jsonb, 'rehearsal', NOW()); - Confirm the synthetic row exists on apollo:
SELECT count(*) FROM events WHERE tenant_id='tenant-apollo-001' AND payload->>'actor' = 'rehearsal:synthetic-straggler'→ returns 1. - Run the cleanup predicate (mirrors the migration's events UPDATE):
UPDATE events SET tenant_id='tenant-curaway-patients' WHERE tenant_id='tenant-apollo-001' AND event_type IN ('intake.started', /* ...patient lifecycle types from EVENT_TYPE_CLASSIFICATION... */). - Confirm the synthetic row is now on the patient tenant: same SELECT but
tenant_id='tenant-curaway-patients'→ returns 1. - Re-run the smoke runner — A20 must still PASS.
- Cleanup:
DELETE FROM events WHERE payload->>'actor' = 'rehearsal:synthetic-straggler'so the rehearsal artifact doesn't pollute future audits.
6. Document the rehearsal¶
Capture in the operator log: snapshot timestamp, restore wall-clock, --execute duration, smoke A1-A21 results (paste the smoke runner output + the manual-check observations), report file path.
If all 21 checks pass: announce the prod maintenance window per the runbook execution sequence above.
If any check fails: do NOT proceed. Update the relevant code/spec/classifier, ship through review, re-rehearse on a fresh dev branch.
Phase 2b — apollo rename + UUID-PK conversion (deferred)¶
Per ADR-0023 lazy-migration policy + #426 prerequisites. Bundled scope:
- Rename tenant-apollo-001 → tenant-provider-apollo-hospitals (or similar)
- Convert tenant-apollo-001 PK to UUID
- Convert tenant-curaway-patients PK to UUID
- Update remaining hardcoded references
Spec written when scheduled. Estimated additional downtime: 5–8 min. Operator-facing UI must already render slug-not-UUID before this window (file as a sub-issue).
Phase 3 — cleanup (deferred)¶
- Retire the
_ORG_TENANT_MAP_FALLBACKpatient-org dict entry entirely (Phase 2a flips its value but the entry stays as defense-in-depth). - Optional: rename
tenant-apollo-001if not done in Phase 2b.