ADR-0003: PyMuPDF-First OCR Strategy¶

Status: Accepted Date: 2026-03-22 Session: 23B

Context¶

Patients upload medical documents (lab results, discharge summaries, pathology reports, imaging reports) as PDFs. Before the AI agent can analyze these documents, the text must be extracted reliably. Medical PDFs come in two flavors:

Digitally-generated PDFs -- text is embedded in the PDF and can be extracted directly.
Scanned PDFs -- the document is an image; text must be recovered via OCR.

The extraction pipeline must be fast (users are waiting in a chat interface), accurate (medical terminology must be preserved), and reliable (no external API dependency for the common case).

Decision¶

Use PyMuPDF (fitz) as the primary text extraction method, running synchronously inline during document processing. Reserve Unstructured.io and Claude Vision as fallbacks for scanned documents where PyMuPDF returns insufficient text.

Rationale¶

Speed. PyMuPDF extracts text from a typical 5-page PDF in under 1 second. Since extraction runs synchronously, the text is immediately available when the chat agent processes the document. No queue, no polling, no race conditions.
No external dependency. PyMuPDF is a Python library that runs locally in the container. It does not call any external API, so there is no latency, rate limiting, or cost for the common case (digitally-generated PDFs).
Accuracy on digital PDFs. For PDFs with embedded text, PyMuPDF's extraction is lossless -- it reads the actual text layer, preserving medical terminology, lab values, and formatting.
Fallback chain. When PyMuPDF returns fewer than 50 characters from a page (indicating a scanned image), the pipeline falls back to Unstructured.io for layout-aware OCR, and then to Claude Vision for complex cases (handwritten notes, poor scan quality).

Alternatives Considered¶

Alternative	Pros	Cons	Verdict
Unstructured.io first	Layout-aware, handles tables well	API-dependent, slower (2-5s per document), adds cost	Reserved as fallback
Tesseract OCR	Open source, well-known	Poor quality on medical documents with small fonts and dense formatting	Rejected
LlamaParse	Good at structured extraction	Evaluated but deferred; adds another API dependency and cost	Deferred
Claude Vision directly	Highest quality for complex documents	Expensive per page, slower, overkill for digital PDFs	Reserved as last-resort fallback

Consequences¶

Positive: Sub-second extraction for the majority of uploads (digitally-generated PDFs). No external API cost for the common path.
Positive: Synchronous extraction eliminates the race condition where the agent tries to read a document before OCR completes (see also ADR-0010).
Negative: PyMuPDF cannot extract text from scanned-image PDFs. The fallback chain adds complexity.
Negative: PyMuPDF's table extraction is basic. Documents with complex tables (multi-page lab panels) may lose structure.
Accepted risk: The fallback chain (PyMuPDF -> Unstructured.io -> Claude Vision) introduces conditional complexity, but each step is well-defined and measurable via the text-length heuristic.