Skip to content

Runbook — Patient Tenant Cutover

Status: Operator runbook. Source of truth for execution: docs/specs/patient-tenant-cutover-phase2-feature.md (Phase 2a) and a future Phase 2b spec.

This runbook is the quick-reference card for the operator running the maintenance window. The full risk analysis, mid-window connection behavior, predicate derivations, smoke matrix, and rollback safety bounds live in the spec. Read the spec end-to-end at least once before scheduling the window; consult this runbook during execution.


Phase 1 — DONE (PR #512, merged 2026-04-30)

tenant-curaway-patients row exists with mirrored apollo settings (ap-south-1/en/INR/daily). Added to PROTECTED_TENANT_IDS. No data has moved.


Phase 2a — patient data cutover

Spec: docs/specs/patient-tenant-cutover-phase2-feature.md — 21-check dev-branch smoke matrix; concrete SQL predicates verified against live schema (Session 86 audit); execution order pinned in §"Required execution order inside the cutover transaction".

Quick prerequisites

  • Phase 1 migration 5d49162011a5_add_tenant_curaway_patients_row already applied (verify railway run alembic current returns 5d49162011a5).
  • DB snapshot rehearsed on Railway dev branch within last 7 days; restore time documented.
  • POST /api/v1/admin/system/drain-streams admin endpoint shipped + baked ≥ 24h.
  • Telegram alerts wired (cutover_started, cutover_completed, cutover_failed).
  • Backend image with default_tenant_id='tenant-curaway-patients' AND the flipped _ORG_TENANT_MAP_FALLBACK patient-org row staged on chore/post-cutover-default-tenant.
  • Vercel rewrite for /maintenance.html rehearsed on a preview branch.
  • All Set A smoke checks (21 of them) green on the dev-branch rehearsal.

Execution sequence (~15 min target)

The full spec timeline lives in §"Maintenance window timeline". Highlights:

  1. T+0:00 — Vercel rewrite live on patient-app; /admin/system/drain-streams fired; Slack #curaway-ops window-open notice.
  2. T+0:01pg_dump snapshot to S3.
  3. T+0:04python scripts/cutover_patient_tenant.py --execute --operator-confirmed (single transaction, follows the 14-step execution order from the spec).
  4. T+0:09 — Backend deploy: default_tenant_id=tenant-curaway-patients + _ORG_TENANT_MAP_FALLBACK patient-org → tenant-curaway-patients.
  5. T+0:11 — Vercel env update: VITE_TENANT_ID=tenant-curaway-patients; patient-app redeploys.
  6. T+0:11–T+0:13 — Post-cutover cleanup sweep (catches SSE/WS pollution if any escaped drain-streams) + Set B prod-subset smoke (A1, A2, A3, A4, A6, A7, A8, A9 — ~2 min).
  7. T+0:15 — Banner down; Slack window-close notice.

If any Set B check fails: rollback per spec §11 (decision tree depends on time elapsed; T+0:30 boundary is the safety line for option-2 reverse migration).

Rollback time bound

Worst-case 15 minutes (8 min snapshot restore + 4 min revert deploys + 3 min smoke). Goes in the maintenance-window announcement so support knows when to expect status either way.

Alembic downgrade trap

Do not alembic downgrade -1 past 5d49162011a5_add_tenant_curaway_patients_row between Phase 2a and Phase 2b. The Phase-1 down-guard will refuse anyway; if a true Phase 2a rollback is needed, restore from snapshot, do NOT use alembic downgrade.


Dev-branch rehearsal procedure

Run this end-to-end at least 24h before scheduling the prod window. Every Set A check must pass.

1. Provision a Railway dev branch

# In the curaway_src repo, with railway CLI already authed
railway environment new --name cutover-rehearsal-2026-MM-DD
railway link --environment cutover-rehearsal-2026-MM-DD

2. Restore a prod snapshot to the dev branch

# 1. Snapshot prod (run from a project linked to PROD)
railway link --environment production
railway run -- pg_dump --no-owner --no-privileges \
  "$DATABASE_URL_ADMIN" \
  | gzip > /tmp/cutover_rehearsal_$(date -u +%Y%m%dT%H%M%SZ).sql.gz

# 2. Restore to dev branch
railway link --environment cutover-rehearsal-2026-MM-DD
gunzip -c /tmp/cutover_rehearsal_*.sql.gz \
  | railway run -- psql "$DATABASE_URL_ADMIN"

3. Run pre-flight --dry-run against the dev branch

railway run -- python scripts/cutover_patient_tenant.py --dry-run

Verdict must be READY (rc=0). Any blocker must be addressed (update predicates / classifier dict / spec) before proceeding.

4. Execute the cutover against the dev branch

CUTOVER_ACTOR="rehearsal:$(whoami)" \
  railway run -- python scripts/cutover_patient_tenant.py \
  --execute --operator-confirmed \
  --report-path /tmp/cutover_report_rehearsal_$(date -u +%Y%m%dT%H%M%SZ).json

Expected: rc=0, "CUTOVER COMMITTED" message, report file written with commit_status: "committed" and full row-id allow-list.

5. Run Set A smoke matrix

5a. SQL-based subset (12 checks, automated)

railway run -- python scripts/cutover_smoke_set_a.py

Expected: rc=0, "ALL 12 CHECKS PASSED" message. Any FAIL → investigate before proceeding.

5b. Manual checks (9 of the 21 — operator-executed)

  • A11 Patient login E2E — sign in to the dev-branch patient app as a real patient account. Verify: chat, file upload, match request, document WebSocket all work end-to-end.
  • A12 Idempotency-key replay — capture an idempotency key from before the cutover (SELECT key FROM idempotency_keys LIMIT 1), then curl -X POST … -H "X-Idempotency-Key: <key>" … and confirm the cached response is returned (HTTP 2xx, body matches the original); the cached case_id must resolve on tenant-curaway-patients.
  • A13 Provider portal sanity — sign into the provider portal on the dev branch. List cases. Forwarded cases must be visible (cross-tenant via case_shares).
  • A14 Coordinator portal sanity — sign into the coordinator portal. View any case that just moved. The Performance page (#159) loads (not 403, assuming #514 grants are live).
  • A15 RLS policies — log in as a non-patient-tenant user and attempt a cross-tenant read of patients rows. Expect 403.
  • A16 Langfuse trace — send a patient message via the patient app on dev branch. Inspect the latest Langfuse trace for tenant_id=tenant-curaway-patients in metadata.
  • A17 Background job sanity — trigger any QStash callback (e.g., POST /qstash/document-processed with a moved document_reference id). The callback must find the row on tenant-curaway-patients and complete successfully.
  • A18 Module-constant reload — exec into the dev-branch container, run python -c "from app.routers.public_shared import _DEFAULT_TENANT; print(_DEFAULT_TENANT)". Must print tenant-curaway-patients.
  • A19 Drain-streams works — open the case WebSocket on the dev branch from a real patient session (or a wscat shell with a Clerk JWT): wss://<dev-branch>/api/v1/cases/<id>/ws?token=<jwt>. In a separate shell: curl -X POST -H "Authorization: Bearer <super-admin-jwt>" https://<dev-branch>/api/v1/admin/system/drain-streams. The WebSocket must emit {"type":"drained","message":"…"} and close within 1 second. (Verify in browser DevTools → Network → WS frames panel.) The retired SSE handlers /messages/stream + /chat/stream + /documents/stream were removed in #586 — don't use them.
  • A20-active Cleanup sweep moves a straggler — the smoke runner's automated A20 only proves no straggler EXISTS post-rehearsal. To prove the cleanup mechanism can move one, do this:
  • Insert a synthetic patient-classified event onto apollo that simulates a lost-write from a long-lived SSE: INSERT INTO events (id, event_type, tenant_id, payload, source_service, created_at) VALUES (gen_random_uuid()::text, 'intake.started', 'tenant-apollo-001', '{"actor": "rehearsal:synthetic-straggler"}'::jsonb, 'rehearsal', NOW());
  • Confirm the synthetic row exists on apollo: SELECT count(*) FROM events WHERE tenant_id='tenant-apollo-001' AND payload->>'actor' = 'rehearsal:synthetic-straggler' → returns 1.
  • Run the cleanup predicate (mirrors the migration's events UPDATE): UPDATE events SET tenant_id='tenant-curaway-patients' WHERE tenant_id='tenant-apollo-001' AND event_type IN ('intake.started', /* ...patient lifecycle types from EVENT_TYPE_CLASSIFICATION... */).
  • Confirm the synthetic row is now on the patient tenant: same SELECT but tenant_id='tenant-curaway-patients' → returns 1.
  • Re-run the smoke runner — A20 must still PASS.
  • Cleanup: DELETE FROM events WHERE payload->>'actor' = 'rehearsal:synthetic-straggler' so the rehearsal artifact doesn't pollute future audits.

6. Document the rehearsal

Capture in the operator log: snapshot timestamp, restore wall-clock, --execute duration, smoke A1-A21 results (paste the smoke runner output + the manual-check observations), report file path.

If all 21 checks pass: announce the prod maintenance window per the runbook execution sequence above.

If any check fails: do NOT proceed. Update the relevant code/spec/classifier, ship through review, re-rehearse on a fresh dev branch.


Phase 2b — apollo rename + UUID-PK conversion (deferred)

Per ADR-0023 lazy-migration policy + #426 prerequisites. Bundled scope: - Rename tenant-apollo-001tenant-provider-apollo-hospitals (or similar) - Convert tenant-apollo-001 PK to UUID - Convert tenant-curaway-patients PK to UUID - Update remaining hardcoded references

Spec written when scheduled. Estimated additional downtime: 5–8 min. Operator-facing UI must already render slug-not-UUID before this window (file as a sub-issue).


Phase 3 — cleanup (deferred)

  • Retire the _ORG_TENANT_MAP_FALLBACK patient-org dict entry entirely (Phase 2a flips its value but the entry stays as defense-in-depth).
  • Optional: rename tenant-apollo-001 if not done in Phase 2b.