ADR-0003: PyMuPDF-First OCR Strategy¶
Status: Accepted Date: 2026-03-22 Session: 23B
Context¶
Patients upload medical documents (lab results, discharge summaries, pathology reports, imaging reports) as PDFs. Before the AI agent can analyze these documents, the text must be extracted reliably. Medical PDFs come in two flavors:
- Digitally-generated PDFs -- text is embedded in the PDF and can be extracted directly.
- Scanned PDFs -- the document is an image; text must be recovered via OCR.
The extraction pipeline must be fast (users are waiting in a chat interface), accurate (medical terminology must be preserved), and reliable (no external API dependency for the common case).
Decision¶
Use PyMuPDF (fitz) as the primary text extraction method, running synchronously inline during document processing. Reserve Unstructured.io and Claude Vision as fallbacks for scanned documents where PyMuPDF returns insufficient text.
Rationale¶
- Speed. PyMuPDF extracts text from a typical 5-page PDF in under 1 second. Since extraction runs synchronously, the text is immediately available when the chat agent processes the document. No queue, no polling, no race conditions.
- No external dependency. PyMuPDF is a Python library that runs locally in the container. It does not call any external API, so there is no latency, rate limiting, or cost for the common case (digitally-generated PDFs).
- Accuracy on digital PDFs. For PDFs with embedded text, PyMuPDF's extraction is lossless -- it reads the actual text layer, preserving medical terminology, lab values, and formatting.
- Fallback chain. When PyMuPDF returns fewer than 50 characters from a page (indicating a scanned image), the pipeline falls back to Unstructured.io for layout-aware OCR, and then to Claude Vision for complex cases (handwritten notes, poor scan quality).
Alternatives Considered¶
| Alternative | Pros | Cons | Verdict |
|---|---|---|---|
| Unstructured.io first | Layout-aware, handles tables well | API-dependent, slower (2-5s per document), adds cost | Reserved as fallback |
| Tesseract OCR | Open source, well-known | Poor quality on medical documents with small fonts and dense formatting | Rejected |
| LlamaParse | Good at structured extraction | Evaluated but deferred; adds another API dependency and cost | Deferred |
| Claude Vision directly | Highest quality for complex documents | Expensive per page, slower, overkill for digital PDFs | Reserved as last-resort fallback |
Consequences¶
- Positive: Sub-second extraction for the majority of uploads (digitally-generated PDFs). No external API cost for the common path.
- Positive: Synchronous extraction eliminates the race condition where the agent tries to read a document before OCR completes (see also ADR-0010).
- Negative: PyMuPDF cannot extract text from scanned-image PDFs. The fallback chain adds complexity.
- Negative: PyMuPDF's table extraction is basic. Documents with complex tables (multi-page lab panels) may lose structure.
- Accepted risk: The fallback chain (PyMuPDF -> Unstructured.io -> Claude Vision) introduces conditional complexity, but each step is well-defined and measurable via the text-length heuristic.