Test-Data Hygiene Runbook¶
Epic: https://github.com/curaway-ai/curaway-backend/issues/1194 (D4) Added: 2026-05-25
The Core Rule¶
One persona per test account.
Multiple conversations per account are OK (new
case_id, samepatient_id). Multiple personae require multiple accounts. Persona pollution is a non-recoverable error — quarantine and replace.
A "persona" is a unique combination of: - Identity (name, age, gender, nationality) - Clinical presentation (diagnosis, condition, procedure type) - Intake scenario (caregiver vs. self, language, location)
Once a patient account carries clinical data for more than one persona, the analytics signal is degraded and the account cannot be rehabilitated without deleting historical conversations (which we never do — see Data Governance below).
What Pollution Looks Like¶
Signs an account has been re-used across personae:
| Signal | Example from demo-patient-aisha-001 |
|---|---|
| > 10 cases | 64 cases across 5 personae |
| Mixed demographics | Maria (45F Dubai) + Abdul Moeed (male Jeddah) + Meskerem (38F Ethiopia) |
| Unrelated ICD chapters | C50 (breast) + C91 (leukaemia) + Q27 (absent iliac vein) |
| Mismatched documents | NGS/HLA reports next to WhatsApp images |
| Stacked FHIR resources | 42 Conditions / 129 Observations from different body systems |
Quarantine Procedure¶
When you discover a polluted account:
-
Do not delete it. Historical conversations must remain inspectable.
-
Run the quarantine script (dry-run first):
# Dry-run — prints what would change, no writes
railway run -s curaway --environment production -- \
python scripts/quarantine_polluted_demo_patients.py
# Apply
railway run -s curaway --environment production -- \
python scripts/quarantine_polluted_demo_patients.py --apply
For additional polluted accounts beyond the pre-configured list, pass their
external_auth_id values:
railway run -s curaway --environment production -- \
python scripts/quarantine_polluted_demo_patients.py \
--apply \
--extra-ids demo-patient-xyz-001 demo-patient-abc-002
- Verify the flag was set:
-
Create a replacement account using
scripts/seed_persona_accounts.py(or add a new function inapp/seeds/seed_persona_accounts.pyfollowing the existing pattern). -
Update this document — add the quarantined account to the table below.
Quarantined Accounts¶
| external_auth_id | Tenant | Quarantined | Reason | Replacement |
|---|---|---|---|---|
| demo-patient-aisha-001 | tenant-curaway-patients | 2026-05-25 (#1194 D2) | 5 unrelated personae, 64 cases, 39 documents, 42 FHIR Conditions stacked on one demographic record | demo-patient-maria-001, demo-patient-abdul-001, demo-patient-meskerem-001 |
Run
scripts/quarantine_polluted_demo_patients.py --applyagainst production to actually set the DB flag for demo-patient-aisha-001. This PR only adds the column and script — the production write is a manual step for SD.
Clean Persona Accounts (#1194 D3)¶
Three clean replacement accounts are pre-seeded via scripts/seed_persona_accounts.py:
| external_auth_id | Persona | Demographics | Procedure |
|---|---|---|---|
| demo-patient-maria-001 | Maria | 45F, Dubai (ARE), born 1981-03-15 | ONCO-CHEMO (stage 2 IDC left breast) |
| demo-patient-abdul-001 | Abdul Moeed | male, Jeddah (SAU), DOB unknown | BMT-001 (leukaemia, no subtype) |
| demo-patient-meskerem-001 | Meskerem | 38F, Ethiopia (ETH), born 1988-07-22 | VASC-001 (absent iliac vein) |
Each account ships with: - One Patient record (demographic only) - One empty Case (procedure identified, intake not started) - One ConsentRecord (clinical_data_processing granted) - No documents, no FHIR resources pre-loaded
Seeding the clean accounts¶
# Seed all three (idempotent):
railway run -s curaway --environment production -- \
python scripts/seed_persona_accounts.py
# Seed a subset:
railway run -s curaway --environment production -- \
python scripts/seed_persona_accounts.py --persona maria,meskerem
Abdul Moeed — Caregiver Gap¶
Abdul's scenario involves a mother-as-caregiver flow (Syeda speaks for her son Abdul Moeed). Caregiver relationship modelling is out of scope for this PR. The account is seeded as a plain patient record with an empty DOB. Follow-up tracked in https://github.com/curaway-ai/curaway-backend/issues/1194.
SQL: Filter Out Test and Quarantined Patients¶
Use both filters in analytics queries and Metabase dashboards:
-- Exclude test patients AND quarantined accounts
WHERE COALESCE(p.metadata->>'is_test', 'false') != 'true'
AND p.is_test_polluted = false
Prevent Pollution Going Forward¶
Follow these rules when creating or using test patient accounts:
-
One persona per
external_auth_id. Never reuse an account for a different name, nationality, or clinical presentation. -
Use the E2E seed scripts — never create test patients manually in prod:
# All E2E personas (includes tkr, maria, abdul, meskerem):
railway run -s curaway -- python scripts/seed_e2e.py
# Clean persona accounts only:
railway run -s curaway -- python scripts/seed_persona_accounts.py
-
If you need a new persona, add a function in
app/seeds/seed_persona_accounts.pyfollowing the existing pattern and register it inscripts/seed_e2e.py. -
Never re-run an intake flow on an existing test patient to simulate a different clinical scenario — create a new patient account instead.
-
Heuristic alert: accounts with > 10 cases AND FHIR Conditions spanning 3+ ICD-10 chapter groups are flagged as "suspect polluted" by
scripts/quarantine_polluted_demo_patients.py. Run the scan periodically:
Data Governance Constraint¶
Quarantined accounts are never deleted because:
- Historical conversations contain real AI agent decisions that must remain auditable (GDPR Article 5(1)(e) storage limitation applies to unnecessary data, but audit data is excluded under Article 17(3)(b)).
- The conversations may be needed to reproduce and debug reported clinical agent behaviour issues.
The is_test_polluted flag is the signal to downstream systems (analytics,
matching, cohort analysis) to exclude the account. It does not restrict
read access for authorised operators.
Downstream Readers (Wired — #1194 D5)¶
The flag is active: five code paths consult is_test_polluted and
quarantined patients are filtered out of admin, cohort, and external-LLM
surfaces. Seven additional sites intentionally do not filter on the
flag — they cover auth, GDPR cascades, per-case workflows, and the
quarantine script itself. The split keeps polluted accounts inspectable
to authorised operators while preventing them from leaking into matching,
analytics, or external MCP consumers.
Sites that filter (exclude polluted)¶
| Site | How |
|---|---|
app/repositories/patient_repository.py::list_active |
New exclude_polluted: bool = False kwarg — admin/MCP callers opt in |
app/repositories/admin_person_repository.py::search_all_platform_users |
New exclude_polluted: bool = True kwarg (admin default) |
app/repositories/admin_person_repository.py::count_members_breakdown |
Hardcoded exclusion (admin context) |
app/repositories/admin_person_repository.py::list_all_members |
Hardcoded exclusion (admin context) |
app/mcp/server.py (search_patients tool) |
Passes exclude_polluted=True to patient_service.list_patients |
All filters use the NULL-safe form
or_(Patient.is_test_polluted.is_(False), Patient.is_test_polluted.is_(None))
so rows that pre-date the column (NULL after ALTER TABLE) aren't
accidentally hidden.
Sites that intentionally do NOT filter¶
| Site | Rationale |
|---|---|
patient_repository.get_by_id / get_by_auth_id |
Auth flows — a polluted user mid-session must not get a hard 404 |
app/routers/patients.py sign-up + self-service endpoints |
Backward-compat — patients still see their own data |
app/services/ehr_rebuild_service.py |
Per-case context — polluted accounts retain history for debugging |
app/services/match_service.py |
Per-case (case_id scopes the read) |
app/services/data_subject_handler.py |
GDPR Article 17 — erasure must succeed on quarantined accounts |
scripts/quarantine_polluted_demo_patients.py |
The script itself sets the flag — must read polluted rows |
To regenerate the audit, search for Patient.is_test_polluted,
PatientRepository, and list_active callers and re-classify any new
ones.
Related Files¶
| File | Purpose |
|---|---|
app/models/patient.py |
is_test_polluted boolean column |
alembic/versions/b1c2d3e4f5a8_add_is_test_polluted_to_patients.py |
Migration that adds the column |
scripts/quarantine_polluted_demo_patients.py |
One-off quarantine script |
app/seeds/seed_persona_accounts.py |
Clean persona seed functions |
scripts/seed_persona_accounts.py |
Standalone seed script |
scripts/seed_e2e.py |
E2E seed entry point (now includes maria/abdul/meskerem) |
app/repositories/patient_repository.py |
list_active(exclude_polluted=...) (D5) |
app/repositories/admin_person_repository.py |
Admin reads with quarantine filter (D5) |
app/mcp/server.py |
MCP search_patients excludes polluted (D5) |
tests/test_is_test_polluted_downstream_readers.py |
D5 reader coverage |
docs/runbook/test-data.md |
Legacy test-data tagging runbook (is_test flag) |
docs/runbook/case-id-scoping.md |
CI scanner runbook — prevents new ungated list_by_patient callers from re-introducing the cross-case LLM context leak that motivated the pollution discovery |