Document Scoping — Current State and Future Plan¶

Current State (MVP)¶

Documents are case-scoped by a timestamp filter — uploaded_at >= case.created_at.

This is enforced in three places that ALL query document_references:

Query	File	Filter
Document checklist	`app/routers/cases.py:945`	`patient_id + tenant_id + uploaded_at >= case.created_at`
EHR rebuild	`app/services/ehr_rebuild_service.py:85`	`patient_id + tenant_id + analysis_status='completed' + uploaded_at >= case.created_at`
Patient state coverage	`app/services/patient_state.py:166`	`patient_id + uploaded_at >= case.created_at`

Why timestamp filter (proxy)¶

The document_references table does not have a case_id column. Adding one requires:

Schema migration to add case_id (nullable to start)
Backfill existing rows by inferring from uploaded_at → matching case
Update presign + confirm endpoints to accept and store case_id
Update frontend upload flow to pass case_id
Migrate all 3 query sites from timestamp filter to case_id filter
Drop the timestamp filter and make case_id NOT NULL

For the MVP demo, the timestamp proxy is reliable enough — the only edge case is two cases created within the same second, which doesn't happen in real demo flow.

Why case-scoped (not patient-scoped)¶

We want EHR + checklist + patient state to all show the SAME picture for a given case. Patient-scoped queries caused this bug:

User created a new case → progress rail showed "4 of 15 mandatory" docs already uploaded (because previous cases for the same patient had docs). But the EHR was empty (because it's snapshot-stored on the case row). Mismatch confused users and looked broken.

Case-scoping makes the inconsistency impossible: a fresh case has 0 documents, 0 conditions extracted from docs, and 0 lab observations. Clean slate.

Cross-Case Document Reuse — Existing Mechanism¶

Patients shouldn't have to re-upload their X-ray for every new case. The _check_existing_records() function in case_orchestrator.py already handles this:

reusable = await _check_existing_records(db, patient_id, tenant_id, case)
if reusable:
    # Show patient: "I found previous records that are still valid:
    # SD_20250209.pdf, knee_xray.pdf. Use these for this case?"

This fires on the records-first turn. The user sees previous records and gets to opt in. Explicit consent, not silent inheritance.

Shipped: case_id Column Migration (Session 39, PR #152)¶

Phases 1–2 shipped in Session 39 via Alembic migration a2b3c4d5e6f7. 148 existing document_references rows were backfilled using the timestamp-inference logic described below. The column is nullable while Phase 3 (upload flow refactor) is completed. Phases 3–5 remain as planned below.

Future Plan — Proper case_id Column (Phases 3–5 remaining)¶

Phase 1: Schema migration (post-MVP) — Shipped (Session 39, PR #152)¶

Migration a2b3c4d5e6f7 ran in production. 148 documents backfilled.

ALTER TABLE document_references
ADD COLUMN case_id VARCHAR(36) REFERENCES cases(id);

CREATE INDEX idx_document_references_case_id
ON document_references(case_id);

Make case_id nullable initially so existing rows don't break.

Phase 2: Backfill¶

For each existing document_references row:

Find the case where case.patient_id = doc.patient_id AND case.created_at <= doc.uploaded_at AND (next case for that patient does not exist OR doc.uploaded_at < next_case.created_at)
Set doc.case_id = matched_case.id

Rows that can't be matched (e.g., orphaned uploads) get case_id = NULL and are excluded from queries.

Phase 3: Update upload flow¶

Frontend uploadFileToR2() accepts caseId parameter
Backend presign endpoint accepts case_id in request body
Backend confirm endpoint stores it on the document record
All 3 query sites switch from uploaded_at >= case_created to case_id = :case_id

Phase 4: Make case_id NOT NULL¶

Once backfill is complete and all upload paths set case_id, change the column to NOT NULL and drop the timestamp filter from queries.

Phase 5: Cross-case reuse via explicit linking¶

Add a join table for explicit cross-case document reuse:

CREATE TABLE case_document_links (
    id VARCHAR(36) PRIMARY KEY,
    case_id VARCHAR(36) REFERENCES cases(id),
    document_id VARCHAR(36) REFERENCES document_references(id),
    linked_at TIMESTAMP NOT NULL,
    linked_by VARCHAR(36),  -- patient or admin
    UNIQUE(case_id, document_id)
);

When the patient clicks "use this previous record for the new case", we INSERT a row in case_document_links instead of duplicating the document. Queries union the documents owned by the case (via case_id) and the documents linked to the case (via case_document_links).

Migration Risks¶

Backfill correctness: the timestamp inference may be ambiguous if a patient has overlapping cases (rare but possible). Manual review needed.
Frontend coordination: every upload path needs to pass case_id. Missing one means orphaned documents.
Query consistency: all 3 sites must migrate at the same time. Mixing case_id and timestamp filters leads to the same bug we just fixed.

Acceptance Criteria¶

Phase 1-4 are complete when:

pytest tests/test_document_scoping.py (new test file) all pass
A new case shows 0 documents in checklist + 0 in EHR + 0 in patient state
A patient with 5 documents from 3 cases shows the right docs per case
The cross-case reuse flow (_check_existing_records) still works
E2E test e2e/conversation-regression.spec.ts::EHR data flow passes

Status¶

Phase	Status
MVP timestamp filter (3 query sites)	Complete
Phase 1: case_id column migration	Complete — migration `a2b3c4d5e6f7`, Session 39 PR #152
Phase 2: Backfill	Complete — 148 documents backfilled
Phase 3: Upload flow refactor	Not started
Phase 4: NOT NULL + drop timestamp filter	Not started
Phase 5: Cross-case linking table	Not started