Document Scoping — Current State and Future Plan¶
Current State (MVP)¶
Documents are case-scoped by a timestamp filter — uploaded_at >= case.created_at.
This is enforced in three places that ALL query document_references:
| Query | File | Filter |
|---|---|---|
| Document checklist | app/routers/cases.py:945 |
patient_id + tenant_id + uploaded_at >= case.created_at |
| EHR rebuild | app/services/ehr_rebuild_service.py:85 |
patient_id + tenant_id + analysis_status='completed' + uploaded_at >= case.created_at |
| Patient state coverage | app/services/patient_state.py:166 |
patient_id + uploaded_at >= case.created_at |
Why timestamp filter (proxy)¶
The document_references table does not have a case_id column.
Adding one requires:
- Schema migration to add
case_id(nullable to start) - Backfill existing rows by inferring from
uploaded_at→ matching case - Update presign + confirm endpoints to accept and store
case_id - Update frontend upload flow to pass
case_id - Migrate all 3 query sites from timestamp filter to
case_idfilter - Drop the timestamp filter and make
case_idNOT NULL
For the MVP demo, the timestamp proxy is reliable enough — the only edge case is two cases created within the same second, which doesn't happen in real demo flow.
Why case-scoped (not patient-scoped)¶
We want EHR + checklist + patient state to all show the SAME picture for a given case. Patient-scoped queries caused this bug:
User created a new case → progress rail showed "4 of 15 mandatory" docs already uploaded (because previous cases for the same patient had docs). But the EHR was empty (because it's snapshot-stored on the case row). Mismatch confused users and looked broken.
Case-scoping makes the inconsistency impossible: a fresh case has 0 documents, 0 conditions extracted from docs, and 0 lab observations. Clean slate.
Cross-Case Document Reuse — Existing Mechanism¶
Patients shouldn't have to re-upload their X-ray for every new case. The
_check_existing_records() function in case_orchestrator.py already
handles this:
reusable = await _check_existing_records(db, patient_id, tenant_id, case)
if reusable:
# Show patient: "I found previous records that are still valid:
# SD_20250209.pdf, knee_xray.pdf. Use these for this case?"
This fires on the records-first turn. The user sees previous records and gets to opt in. Explicit consent, not silent inheritance.
Shipped: case_id Column Migration (Session 39, PR #152)¶
Phases 1–2 shipped in Session 39 via Alembic migration a2b3c4d5e6f7.
148 existing document_references rows were backfilled using the timestamp-inference
logic described below. The column is nullable while Phase 3 (upload flow refactor)
is completed. Phases 3–5 remain as planned below.
Future Plan — Proper case_id Column (Phases 3–5 remaining)¶
Phase 1: Schema migration (post-MVP) — Shipped (Session 39, PR #152)¶
Migration a2b3c4d5e6f7 ran in production. 148 documents backfilled.
ALTER TABLE document_references
ADD COLUMN case_id VARCHAR(36) REFERENCES cases(id);
CREATE INDEX idx_document_references_case_id
ON document_references(case_id);
Make case_id nullable initially so existing rows don't break.
Phase 2: Backfill¶
For each existing document_references row:
- Find the case where
case.patient_id = doc.patient_idANDcase.created_at <= doc.uploaded_atAND (next case for that patient does not exist ORdoc.uploaded_at < next_case.created_at) - Set
doc.case_id = matched_case.id
Rows that can't be matched (e.g., orphaned uploads) get case_id = NULL
and are excluded from queries.
Phase 3: Update upload flow¶
- Frontend
uploadFileToR2()acceptscaseIdparameter - Backend
presignendpoint acceptscase_idin request body - Backend
confirmendpoint stores it on the document record - All 3 query sites switch from
uploaded_at >= case_createdtocase_id = :case_id
Phase 4: Make case_id NOT NULL¶
Once backfill is complete and all upload paths set case_id, change the
column to NOT NULL and drop the timestamp filter from queries.
Phase 5: Cross-case reuse via explicit linking¶
Add a join table for explicit cross-case document reuse:
CREATE TABLE case_document_links (
id VARCHAR(36) PRIMARY KEY,
case_id VARCHAR(36) REFERENCES cases(id),
document_id VARCHAR(36) REFERENCES document_references(id),
linked_at TIMESTAMP NOT NULL,
linked_by VARCHAR(36), -- patient or admin
UNIQUE(case_id, document_id)
);
When the patient clicks "use this previous record for the new case", we
INSERT a row in case_document_links instead of duplicating the document.
Queries union the documents owned by the case (via case_id) and the
documents linked to the case (via case_document_links).
Migration Risks¶
- Backfill correctness: the timestamp inference may be ambiguous if a patient has overlapping cases (rare but possible). Manual review needed.
- Frontend coordination: every upload path needs to pass
case_id. Missing one means orphaned documents. - Query consistency: all 3 sites must migrate at the same time. Mixing case_id and timestamp filters leads to the same bug we just fixed.
Acceptance Criteria¶
Phase 1-4 are complete when:
pytest tests/test_document_scoping.py(new test file) all pass- A new case shows 0 documents in checklist + 0 in EHR + 0 in patient state
- A patient with 5 documents from 3 cases shows the right docs per case
- The cross-case reuse flow (
_check_existing_records) still works - E2E test
e2e/conversation-regression.spec.ts::EHR data flowpasses
Status¶
| Phase | Status |
|---|---|
| MVP timestamp filter (3 query sites) | Complete |
| Phase 1: case_id column migration | Complete — migration a2b3c4d5e6f7, Session 39 PR #152 |
| Phase 2: Backfill | Complete — 148 documents backfilled |
| Phase 3: Upload flow refactor | Not started |
| Phase 4: NOT NULL + drop timestamp filter | Not started |
| Phase 5: Cross-case linking table | Not started |