Skip to content

MSO Video — Daily.co Operations Runbook

Phase 2a.2 (ADR-0018) ships the Daily.co integration for MSO teleconsultation. This runbook is the operational reference for app/services/video_room_service.py + app/services/video_room_pii_filter.py + the QStash cron handlers in app/jobs/.

Related: ADR-0018 §G, ADR-0025, docs/specs/mso-teleconsultation-feature.md.

1. Daily.co API key rotation

Keys live in Railway as DAILY_API_KEY. To rotate:

  1. Log in to the Daily.co dashboard → Developers → API keys.
  2. Generate a new key. Do NOT delete the old key yet.
  3. Update DAILY_API_KEY in Railway (railway variables set ...) on the backend service. Railway redeploys automatically.
  4. Smoke check after deploy: hit /api/v1/admin/health/external-deps and confirm daily_co.last_check_ok=true. Or run the curl command in §4.
  5. Delete the old key from the Daily.co dashboard.

If rotation breaks production: roll the env var back; Railway redeploys in ~60s.

2. Verifying recording is disabled

ADR-0025 commits to "no recording, ever". Three layers enforce this:

  1. enable_recording=false is set on every POST /rooms call.
  2. The round-trip config.enable_recording field is asserted in VideoRoomService.create_room. A non-false value triggers an immediate destroy_room and raises VideoRoomProvisionFailedError.
  3. teleconsultation_sessions.recording_url has a CHECK (... IS NULL) constraint at the DB level (Phase 2a.1).

Manual verification:

  • Daily.co dashboard → Rooms → select the room → Configuration → confirm "Recording" is disabled.
  • Or call assert_recording_disabled(daily_room_id) from a tenant-scoped admin shell.
  • Phase 2a.8 ships a CI smoke test that asserts this on every main-branch build.

If a room is ever found with recording enabled:

  1. Destroy it immediately (destroy_room).
  2. File P0 incident — this is an ADR-0025 violation.
  3. Audit Daily.co console for any other rooms; rotate the API key.

3. HIPAA tier upgrade path

The MVP runs on Daily.co's non-HIPAA tier. The runtime check warns in production once a tenant has patient rows (Gate 3). Upgrade flow:

  1. Phase 2a.8 smoke is green AND SD has completed one internal test session end-to-end.
  2. Subscribe to Daily.co's HIPAA tier (BAA execution required).
  3. Flip Flagsmith flag daily_hipaa_tier_enabledtrue.
  4. Flip Flagsmith flag mso_post_launchtrue immediately after.
  5. Confirm VideoRoomService.create_room logs no warnings on the next booking (the WARN-only path is now bypassed).

If mso_post_launch=true is flipped before daily_hipaa_tier_enabled=true the next booking raises VideoHipaaTierRequiredError (503 VIDEO_HIPAA_TIER_REQUIRED_001) — a CRITICAL Telegram alert fires. Resolution: flip daily_hipaa_tier_enabled=true or revert mso_post_launch=false.

4. Smoke test (post-deploy)

Run from a shell with DAILY_API_KEY exported (use a sandbox key in non-prod). All commands assume mso_video_enabled=true is flipped for the calling tenant in Flagsmith.

# 1. Create a test room (consultation_id picked at random)
CID="00000000-0000-4000-8000-000000000001"
curl -X POST https://api.curaway.ai/api/v1/internal/_test/video/create \
  -H "X-Tenant-ID: tenant-mso-panel" \
  -H "Authorization: Bearer $INTERNAL_API_SECRET" \
  -d "{\"consultation_id\": \"$CID\", \"scheduled_for\": \"2026-12-31T10:00:00Z\"}"

# Expected: 200 with {daily_room_id: "consult-00000000", daily_meeting_url: ...}

# 2. Confirm recording is disabled
curl https://api.curaway.ai/api/v1/internal/_test/video/assert-recording-disabled?room=consult-00000000 \
  -H "Authorization: Bearer $INTERNAL_API_SECRET"

# Expected: 200 with {ok: true}

# 3. Destroy the room
curl -X DELETE https://api.curaway.ai/api/v1/internal/_test/video/consult-00000000 \
  -H "Authorization: Bearer $INTERNAL_API_SECRET"

# Expected: 200 (or 404 if already destroyed — both are success)

The internal test endpoints above are gated by INTERNAL_API_SECRET and ship in Phase 2a.8. Until then, run the smoke from a Python REPL attached to the Railway shell.

5. Telegram alert recipients

  • daily_co_outage — fires on persistent Daily.co API failure. Routes to the standard alert chat (TELEGRAM_ALERT_CHAT_ID). Severity: WARNING.
  • daily_hipaa_tier_required — fires when a post-launch tenant tries to provision without HIPAA tier. Severity: CRITICAL.

If alerts go silent: confirm TELEGRAM_BOT_TOKEN + TELEGRAM_ALERT_CHAT_ID are set on Railway and that the Telegram bot is still in the chat.

6. Circuit breaker

VideoRoomService opens a module-level breaker after 5 failures in 60s (open for 30s). Manual reset (in a Python REPL):

from app.services.video_room_service import reset_video_room_service
reset_video_room_service()

This drops the singleton and resets the breaker. Useful for clearing state after a transient Daily.co outage.

7. PII boundary violations

app/services/video_room_pii_filter.py raises VideoRoomPIIBoundaryViolation when a Daily.co payload would leak PII. This is a hard fail. If you see one in logs:

  1. Check the violation list in the exception detail.
  2. Identify the calling site — likely a new field added to a Daily.co payload that contains a name/email/etc.
  3. Fix the call site to use the role labels (Doctor/Patient/ Companion) and consult-XXXXXXXX room name format.
  4. Add a regression test in tests/test_video_room_pii_filter.py.

Pre-launch Rollout Checklist (Phase 2a.8 → MSO Video Live)

Run these steps in order. Do not skip or reorder.

  1. [ ] Verify mso_video_enabled is false in production Flagsmith
  2. [ ] Verify mso_post_launch is false in production Flagsmith
  3. [ ] Deploy this PR to production. Migration head should be d6e7f8a9b1c2 (no new migrations from 2a.8 — wiring + tests only)
  4. [ ] Register QStash cron schedules (idempotent — safe to re-run):
    # From Railway shell or locally with production env vars:
    python -m app.register_schedules
    # Confirms: mso-room-provision (*/5) + mso-room-destroy (*/5) registered
    
    Or via admin API: POST /api/v1/internal/schedules (requires X-Internal-Secret)
  5. [ ] Flip mso_lifecycle_cron_enabled to true in dev environment first; verify cron fires within 5 min (Railway logs: mso-room-provision: enabled=True)
  6. [ ] Flip mso_lifecycle_cron_enabled to true in production
  7. [ ] Run main-smoke workflow manually (workflow_dispatch on main-smoke.yml); verify recording-disabled assertion passes on the run output
  8. [ ] Internal test session (SD as patient OR doctor): a. Schedule a session against a test MSO doctor b. Verify cron provisions the room T-15min before (watch Railway logs for mso.session.room_provisioned) c. Join from both sides; hold for 1+ minute; end the session d. Verify charge state in test Stripe / Razorpay dashboard e. Confirm recording_url IS NULL on the session row:
    SELECT id, status, recording_url FROM teleconsultation_sessions
    WHERE status = 'completed' ORDER BY ended_at DESC LIMIT 5;
    -- recording_url should be NULL on every row
    
  9. [ ] HIPAA tier upgrade (separate decision, ~$500/mo): When SD is ready, upgrade Daily.co plan to HIPAA-covered tier; flip daily_hipaa_tier_enabled to true in production. Until then, the runtime check is WARN-only (not blocking).
  10. [ ] Flip mso_post_launch to true (escalates the runtime HIPAA check from WARN to CRITICAL+block)
  11. [ ] Flip mso_video_enabled to true in production. MSO video is now live.

Rollback: set mso_video_enabled=false in Flagsmith — instant. Sessions in-flight are unaffected (their room URLs still work until the cron destroys them at 90min or ended_at+60s).