Skip to content

07 — Document Processing Pipeline

Three-Tier OCR/Parsing

Tier Tool When Used Characteristics
Primary PyMuPDF (local) All PDFs, first attempt Fast, local, no API cost, handles standard medical PDFs
Fallback 1 Unstructured.io Complex layouts, tables, multi-column Better structure preservation, 1K pages/mo free
Fallback 2 Claude Vision Scanned docs, handwritten notes Highest accuracy for degraded quality, highest cost

Fallback trigger: if PyMuPDF extraction returns empty or garbled text (< 50 chars, > 90% non-alphanumeric).

Upload Flow (Detailed)

Frontend                    API                     R2              QStash          Parser          Clinical Agent
   │                         │                      │                 │               │                  │
   ├─ POST /uploads/presign──▶                      │                 │               │                  │
   │◀── { upload_url, doc_id, storage_key } ────────│                 │               │                  │
   │                         │                      │                 │               │                  │
   ├─ PUT upload_url (file)──────────────────────────▶                │               │                  │
   │                         │                      │                 │               │                  │
   ├─ POST /uploads/confirm──▶                      │                 │               │                  │
   │                         ├─ document_reference ──│                │               │                  │
   │                         │   status: uploaded    │                │               │                  │
   │                         ├─ event: document_uploaded              │               │                  │
   │                         ├─ QStash dispatch ─────────────────────▶│               │                  │
   │◀── SSE: status=parsing ─┤                      │                │               │                  │
   │                         │                      │     callback ──▶│               │                  │
   │                         │                      │                 │── PyMuPDF ────▶               │
   │                         │                      │                 │◀── text ──────│               │
   │                         │                      │                 │               │                  │
   │                         ├─ status: parsed ──────│                │               │                  │
   │                         ├─ event: document_parsed                │               │                  │
   │◀── SSE: status=parsed ──┤                      │                │               │                  │
   │                         │                      │                 │               │                  │
   │                         ├─ AUTO-CHAIN ──────────────────────────────────────────▶│                  │
   │                         │                      │                 │               ├── extract ───────▶
   │                         │                      │                 │               │                  │
   │                         │                      │                 │               │◀── FHIR ─────────│
   │                         ├─ status: analyzed ────│                │               │                  │
   │◀── SSE: status=analyzed ┤                      │                │               │                  │

Document Status Transitions

uploaded → parsing → parsed → analyzed
                  ↘ failed (retry via QStash, max 3 attempts)

File Storage Security

  • All files in Cloudflare R2: {tenant_id}/{patient_id}/{file_id}
  • No public access — downloads via presigned read URLs (on demand)
  • Upload presigned URLs expire after 15 minutes
  • GDPR deletion cascades to R2 (DataSubjectRequestHandler)
  • Binary files never stored in PostgreSQL

Document Reference Schema

CREATE TABLE document_references (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    patient_id UUID NOT NULL REFERENCES patients(id),
    tenant_id VARCHAR(100) NOT NULL,
    filename VARCHAR(255) NOT NULL,
    content_type VARCHAR(100),
    file_size_bytes INTEGER,
    storage_key VARCHAR(500) NOT NULL,        -- R2 object key
    ocr_status VARCHAR(20) DEFAULT 'uploaded', -- uploaded|parsing|parsed|failed
    analysis_status VARCHAR(20),               -- pending|analyzing|completed|failed
    extracted_text TEXT,
    extracted_data JSONB,                       -- structured extraction results
    document_type VARCHAR(50),                  -- radiology|lab|consultation|insurance|other
    retry_count INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

Auto-Chain: Document → Clinical Context Agent

When document_parsed event fires: 1. EHR Builder checks if extracted_text has clinical content (not just headers/footers) 2. If clinical content detected, dispatches to Clinical Context Agent 3. Agent processes, emits fhir_resource_created events 4. EHR Builder assembles new FHIR resources into patient record 5. intake_progress recalculated 6. SSE notifies frontend of all updates

Frontend: Document Status in Conversation

Documents appear as cards in the conversation thread:

┌──────────────────────────────┐
│ 📄 knee_xray_report.pdf     │
│ ████████████░░░ Analyzing... │
│ Uploaded 2 min ago           │
└──────────────────────────────┘
         ↓ (after analysis)
┌──────────────────────────────┐
│ 📄 knee_xray_report.pdf  ✅ │
│ Found: M17.11 (OA right knee)│
│   Kellgren-Lawrence Grade 4  │
│   3 observations extracted   │
│ [View Details]               │
└──────────────────────────────┘