OpenDataLoader PDF — Evaluation & Integration Plan¶
Status: Evaluated, deferred (implement when ready) Date: 2026-04-17 Session: 41 Repo: https://github.com/opendataloader-project/opendataloader-pdf License: Apache 2.0
Overview¶
OpenDataLoader PDF is an open-source PDF parser producing structured, AI-ready output. Two extraction modes: local (deterministic Java parser, 0.015s/page) and hybrid (routes complex pages to Docling + EasyOCR + SmolVLM 256M vision model, 0.463s/page). Python SDK: pip install opendataloader-pdf. 17.7K GitHub stars, active development (last push April 2026).
Why Evaluate¶
Curaway's #1 patient upload is lab reports — structured tables with parameters, values, reference ranges. Current pipeline (PyMuPDF) scores 0.401 on table extraction accuracy. OpenDataLoader scores 0.928 — a 2.3x improvement on the most common document type.
Benchmark Comparison¶
| Tool | Overall accuracy | Table accuracy | Speed/page | Cost |
|---|---|---|---|---|
| opendataloader-pdf (hybrid) | 0.907 | 0.928 | 0.463s | Free (local) |
| opendataloader-pdf (local) | 0.831 | — | 0.015s | Free (local) |
| Unstructured.io (hi_res) | 0.841 | — | API-dependent | Free 1K pages/mo, then paid |
| Docling | 0.882 | — | — | Free (local) |
| PyMuPDF4LLM | 0.732 | 0.401 | <0.01s | Free (local) |
| Marker | 0.861 | — | — | Free (local) |
Source: opendataloader-bench (200 real-world PDFs).
Current Pipeline vs Proposed¶
Current (PyMuPDF + Unstructured.io)¶
Patient uploads PDF
→ PyMuPDF extracts text inline (fast, 0 cost)
→ If scanned: Unstructured.io OCR (1K pages/mo free)
→ Raw text → Claude Haiku clinical extraction → FHIR resources
Weaknesses: - Lab report tables lose structure — Claude has to reconstruct from flattened text - Unstructured.io free tier caps at 1K pages/mo - No bounding box data for document validation
Proposed (dual-path with feature flag)¶
Patient uploads PDF
├── Flagsmith: pdf_extractor = "pymupdf" (default)
│ → PyMuPDF inline (current behavior, unchanged)
│
└── Flagsmith: pdf_extractor = "opendataloader"
→ opendataloader hybrid server (sidecar)
→ Structured Markdown + JSON with table bounding boxes
→ Same downstream: Claude Haiku → FHIR
Why dual-path: - A/B test extraction quality on real patient documents - PyMuPDF stays as fast path for simple PDFs (text-heavy, no tables) - opendataloader activates for complex PDFs (lab reports, multi-page radiology) - Shadow mode first — run both extractors, compare outputs via Langfuse, then promote winner
Matches existing patterns: WeightedScoringV1 vs AgentEnhancedMatching, risk_assessor vs future risk_assessor_llm_shadow.
Output Formats¶
opendataloader-pdf produces multiple output formats per document:
Markdown — clean text with table structure preserved:
## Complete Blood Count
| Parameter | Value | Unit | Reference Range |
|-----------|-------|------|-----------------|
| Hemoglobin | 12.8 | g/dL | 12.0 - 16.0 |
| WBC | 7.2 | x10³/µL | 4.5 - 11.0 |
| Platelets | 245 | x10³/µL | 150 - 400 |
JSON — structured elements with bounding boxes:
{
"elements": [
{
"type": "table",
"page": 1,
"bbox": [72, 200, 540, 450],
"content": {
"headers": ["Parameter", "Value", "Unit", "Reference Range"],
"rows": [
["Hemoglobin", "12.8", "g/dL", "12.0 - 16.0"],
["WBC", "7.2", "x10³/µL", "4.5 - 11.0"]
]
}
}
]
}
The JSON with bounding boxes is valuable for: - Document validation (knowing WHERE on the page a value was found) - Extraction audit trail (provenance for compliance) - Future UI features (highlighting source regions in DocumentViewer)
Medical Document Impact¶
opendataloader-pdf has no medical-domain awareness. It is a layout parser. The clinical intelligence remains in Claude Haiku (existing Clinical Context Agent). The improvement is in the quality of input to Claude:
| Document type | Current input to Claude | With opendataloader |
|---|---|---|
| Lab report (structured table) | Flattened text, table structure lost | Preserved table with headers, rows, reference ranges |
| Radiology report (narrative + measurements) | Mixed text, measurements may be misplaced | Semantic sections with bounding boxes |
| Discharge summary (multi-page narrative) | Good (mostly text) | Similar quality — marginal improvement |
| Scanned handwritten prescription | Unstructured.io OCR | EasyOCR 80+ languages, 300+ DPI |
Biggest win: Lab reports. Better table input → Claude extracts parameter values more accurately → better comorbidity detection → better PFS scoring.
Deployment Architecture¶
Railway sidecar¶
opendataloader hybrid mode runs a local server. Deploy as a sidecar process on Railway:
Railway container:
├── FastAPI (main app, port 8000)
└── opendataloader-pdf hybrid server (port 5002)
Dockerfile additions:
# Add Java 11 for opendataloader
RUN apt-get update && apt-get install -y openjdk-11-jre-headless
# Install opendataloader-pdf
RUN pip install opendataloader-pdf
Startup: Add opendataloader-pdf-hybrid --port 5002 & to entrypoint before FastAPI starts. Or use Railway's multi-process support.
Alternative: Separate Railway service for opendataloader if resource isolation is needed. Internal networking, no public exposure.
Service integration¶
New service: app/services/opendataloader_extractor.py
# Conceptual
async def extract_with_opendataloader(file_bytes: bytes, filename: str) -> ExtractedDocument:
"""Extract structured content from PDF using opendataloader-pdf hybrid mode."""
# Call local hybrid server
# Return structured Markdown + JSON with bounding boxes
# Same ExtractedDocument interface as PyMuPDF path
Wired into existing document_processing.py:
async def run_post_ocr_pipeline(document_id, ...):
extractor = get_feature_flag("pdf_extractor") # "pymupdf" or "opendataloader"
if extractor == "opendataloader":
extracted = await extract_with_opendataloader(file_bytes, filename)
else:
extracted = extract_with_pymupdf(file_bytes, filename) # current path
# Rest of pipeline unchanged: Claude extraction → FHIR → EHR rebuild
Shadow Mode¶
Before promoting opendataloader as default:
- Both extractors run on every PDF (behind
pdf_extractor_shadowflag) - Both outputs logged to Langfuse with comparison metadata
- Claude runs on both inputs separately — compare FHIR output quality
- Metrics tracked:
- Parameter extraction count (did opendataloader find more values?)
- Table reconstruction accuracy (manual spot-check)
- Comorbidity detection rate (did better input → more detected conditions?)
- Processing time (PyMuPDF <0.01s vs opendataloader 0.5s)
- After ~200 documents: Evaluate and decide whether to promote
Concerns¶
| Concern | Severity | Mitigation |
|---|---|---|
| JVM per call (~200-500ms startup) | Medium | Run hybrid server as persistent sidecar (warm JVM) |
| Java 11+ in Docker image | Low | openjdk-11-jre-headless adds ~150MB to image |
| Relative newness (11 months) | Low | Pin version, abstract behind service interface |
| No medical NER | Non-issue | Claude Haiku handles this — opendataloader replaces layout extraction, not clinical intelligence |
| Hybrid mode needs Docling backend | Low | Bundled with pip install, runs locally |
| PDF-only (no Word, PPT) | Non-issue | Curaway patients upload PDFs and images, not Office docs |
Flagsmith Flags¶
| Flag | Type | Default | Description |
|---|---|---|---|
pdf_extractor |
String | "pymupdf" |
Which extraction engine to use: "pymupdf" or "opendataloader" |
pdf_extractor_shadow |
Boolean | false |
Run both extractors and log comparison (shadow mode) |
Implementation Effort¶
| Task | Effort |
|---|---|
| Add opendataloader-pdf to backend dependencies | 1 day |
| Add Java 11 to Railway Docker image | 0.5 day |
| Run opendataloader hybrid server as sidecar | 0.5 day |
Create opendataloader_extractor.py service |
1 day |
Wire into document_processing.py behind Flagsmith flag |
0.5 day |
| Shadow mode comparison logging in Langfuse | 0.5 day |
| Tests | 1 day |
| Total | ~5 days |
Edge Cases¶
| Scenario | Handling |
|---|---|
| opendataloader server crashes | Fall back to PyMuPDF automatically. Log error. Alert via existing API failure notification system. |
| PDF too large for hybrid mode (>100 pages) | Use local-only mode (0.015s/page) for documents over threshold. Configurable. |
| Encrypted/password-protected PDF | Same as current — reject at upload with user-friendly error. |
| Corrupted PDF | opendataloader returns error → fall back to PyMuPDF → if both fail, store raw, queue for retry. |
| Non-Latin script lab report (Arabic, Hindi) | Hybrid mode supports 80+ languages via EasyOCR. Better than current PyMuPDF (Latin only without Unstructured). |
| Image-only PDF (photo of handwritten prescription) | Hybrid mode OCR handles this. Current path requires Unstructured.io. Removes that dependency. |
| Shadow mode doubles processing time | Acceptable during evaluation (~200 docs). Disable shadow after evaluation complete. |
Decision¶
Recommendation: Add as shadow-mode dual-path. The table extraction improvement (0.928 vs 0.401) is significant for lab reports — the most common patient upload. Removing Unstructured.io dependency is a bonus. Low risk — feature flagged, no behavior change until promoted after ~200 document evaluation.
Depends on: Nothing — can be implemented independently of the platform restructuring.