07 — Document Processing Pipeline¶

Three-Tier OCR/Parsing¶

Tier	Tool	When Used	Characteristics
Primary	PyMuPDF (local)	All PDFs, first attempt	Fast, local, no API cost, handles standard medical PDFs
Fallback 1	Unstructured.io	Complex layouts, tables, multi-column	Better structure preservation, 1K pages/mo free
Fallback 2	Claude Vision	Scanned docs, handwritten notes	Highest accuracy for degraded quality, highest cost

Fallback trigger: if PyMuPDF extraction returns empty or garbled text (< 50 chars, > 90% non-alphanumeric).

Upload Flow (Detailed)¶

Frontend                    API                     R2              QStash          Parser          Clinical Agent
   │                         │                      │                 │               │                  │
   ├─ POST /uploads/presign──▶                      │                 │               │                  │
   │◀── { upload_url, doc_id, storage_key } ────────│                 │               │                  │
   │                         │                      │                 │               │                  │
   ├─ PUT upload_url (file)──────────────────────────▶                │               │                  │
   │                         │                      │                 │               │                  │
   ├─ POST /uploads/confirm──▶                      │                 │               │                  │
   │                         ├─ document_reference ──│                │               │                  │
   │                         │   status: uploaded    │                │               │                  │
   │                         ├─ event: document_uploaded              │               │                  │
   │                         ├─ QStash dispatch ─────────────────────▶│               │                  │
   │◀── SSE: status=parsing ─┤                      │                │               │                  │
   │                         │                      │     callback ──▶│               │                  │
   │                         │                      │                 │── PyMuPDF ────▶               │
   │                         │                      │                 │◀── text ──────│               │
   │                         │                      │                 │               │                  │
   │                         ├─ status: parsed ──────│                │               │                  │
   │                         ├─ event: document_parsed                │               │                  │
   │◀── SSE: status=parsed ──┤                      │                │               │                  │
   │                         │                      │                 │               │                  │
   │                         ├─ AUTO-CHAIN ──────────────────────────────────────────▶│                  │
   │                         │                      │                 │               ├── extract ───────▶
   │                         │                      │                 │               │                  │
   │                         │                      │                 │               │◀── FHIR ─────────│
   │                         ├─ status: analyzed ────│                │               │                  │
   │◀── SSE: status=analyzed ┤                      │                │               │                  │

Document Status Transitions¶

uploaded → parsing → parsed → analyzed
                  ↘ failed (retry via QStash, max 3 attempts)

File Storage Security¶

All files in Cloudflare R2: {tenant_id}/{patient_id}/{file_id}
No public access — downloads via presigned read URLs (on demand)
Upload presigned URLs expire after 15 minutes
GDPR deletion cascades to R2 (DataSubjectRequestHandler)
Binary files never stored in PostgreSQL

Document Reference Schema¶

CREATE TABLE document_references (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    patient_id UUID NOT NULL REFERENCES patients(id),
    tenant_id VARCHAR(100) NOT NULL,
    filename VARCHAR(255) NOT NULL,
    content_type VARCHAR(100),
    file_size_bytes INTEGER,
    storage_key VARCHAR(500) NOT NULL,        -- R2 object key
    ocr_status VARCHAR(20) DEFAULT 'uploaded', -- uploaded|parsing|parsed|failed
    analysis_status VARCHAR(20),               -- pending|analyzing|completed|failed
    extracted_text TEXT,
    extracted_data JSONB,                       -- structured extraction results
    document_type VARCHAR(50),                  -- radiology|lab|consultation|insurance|other
    retry_count INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

Auto-Chain: Document → Clinical Context Agent¶

When document_parsed event fires: 1. EHR Builder checks if extracted_text has clinical content (not just headers/footers) 2. If clinical content detected, dispatches to Clinical Context Agent 3. Agent processes, emits fhir_resource_created events 4. EHR Builder assembles new FHIR resources into patient record 5. intake_progress recalculated 6. SSE notifies frontend of all updates

Frontend: Document Status in Conversation¶

Documents appear as cards in the conversation thread:

┌──────────────────────────────┐
│ 📄 knee_xray_report.pdf     │
│ ████████████░░░ Analyzing... │
│ Uploaded 2 min ago           │
└──────────────────────────────┘
         ↓ (after analysis)
┌──────────────────────────────┐
│ 📄 knee_xray_report.pdf  ✅ │
│ Found: M17.11 (OA right knee)│
│   Kellgren-Lawrence Grade 4  │
│   3 observations extracted   │
│ [View Details]               │
└──────────────────────────────┘