07 — Document Processing Pipeline¶
Three-Tier OCR/Parsing¶
| Tier | Tool | When Used | Characteristics |
|---|---|---|---|
| Primary | PyMuPDF (local) | All PDFs, first attempt | Fast, local, no API cost, handles standard medical PDFs |
| Fallback 1 | Unstructured.io | Complex layouts, tables, multi-column | Better structure preservation, 1K pages/mo free |
| Fallback 2 | Claude Vision | Scanned docs, handwritten notes | Highest accuracy for degraded quality, highest cost |
Fallback trigger: if PyMuPDF extraction returns empty or garbled text (< 50 chars, > 90% non-alphanumeric).
Upload Flow (Detailed)¶
Frontend API R2 QStash Parser Clinical Agent
│ │ │ │ │ │
├─ POST /uploads/presign──▶ │ │ │ │
│◀── { upload_url, doc_id, storage_key } ────────│ │ │ │
│ │ │ │ │ │
├─ PUT upload_url (file)──────────────────────────▶ │ │ │
│ │ │ │ │ │
├─ POST /uploads/confirm──▶ │ │ │ │
│ ├─ document_reference ──│ │ │ │
│ │ status: uploaded │ │ │ │
│ ├─ event: document_uploaded │ │ │
│ ├─ QStash dispatch ─────────────────────▶│ │ │
│◀── SSE: status=parsing ─┤ │ │ │ │
│ │ │ callback ──▶│ │ │
│ │ │ │── PyMuPDF ────▶ │
│ │ │ │◀── text ──────│ │
│ │ │ │ │ │
│ ├─ status: parsed ──────│ │ │ │
│ ├─ event: document_parsed │ │ │
│◀── SSE: status=parsed ──┤ │ │ │ │
│ │ │ │ │ │
│ ├─ AUTO-CHAIN ──────────────────────────────────────────▶│ │
│ │ │ │ ├── extract ───────▶
│ │ │ │ │ │
│ │ │ │ │◀── FHIR ─────────│
│ ├─ status: analyzed ────│ │ │ │
│◀── SSE: status=analyzed ┤ │ │ │ │
Document Status Transitions¶
File Storage Security¶
- All files in Cloudflare R2:
{tenant_id}/{patient_id}/{file_id} - No public access — downloads via presigned read URLs (on demand)
- Upload presigned URLs expire after 15 minutes
- GDPR deletion cascades to R2 (DataSubjectRequestHandler)
- Binary files never stored in PostgreSQL
Document Reference Schema¶
CREATE TABLE document_references (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
patient_id UUID NOT NULL REFERENCES patients(id),
tenant_id VARCHAR(100) NOT NULL,
filename VARCHAR(255) NOT NULL,
content_type VARCHAR(100),
file_size_bytes INTEGER,
storage_key VARCHAR(500) NOT NULL, -- R2 object key
ocr_status VARCHAR(20) DEFAULT 'uploaded', -- uploaded|parsing|parsed|failed
analysis_status VARCHAR(20), -- pending|analyzing|completed|failed
extracted_text TEXT,
extracted_data JSONB, -- structured extraction results
document_type VARCHAR(50), -- radiology|lab|consultation|insurance|other
retry_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
Auto-Chain: Document → Clinical Context Agent¶
When document_parsed event fires:
1. EHR Builder checks if extracted_text has clinical content (not just headers/footers)
2. If clinical content detected, dispatches to Clinical Context Agent
3. Agent processes, emits fhir_resource_created events
4. EHR Builder assembles new FHIR resources into patient record
5. intake_progress recalculated
6. SSE notifies frontend of all updates
Frontend: Document Status in Conversation¶
Documents appear as cards in the conversation thread:
┌──────────────────────────────┐
│ 📄 knee_xray_report.pdf │
│ ████████████░░░ Analyzing... │
│ Uploaded 2 min ago │
└──────────────────────────────┘
↓ (after analysis)
┌──────────────────────────────┐
│ 📄 knee_xray_report.pdf ✅ │
│ Found: M17.11 (OA right knee)│
│ Kellgren-Lawrence Grade 4 │
│ 3 observations extracted │
│ [View Details] │
└──────────────────────────────┘