OpenDataLoader PDF — Evaluation & Integration Plan¶

Status: Evaluated, deferred (implement when ready) Date: 2026-04-17 Session: 41 Repo: https://github.com/opendataloader-project/opendataloader-pdf License: Apache 2.0

Overview¶

OpenDataLoader PDF is an open-source PDF parser producing structured, AI-ready output. Two extraction modes: local (deterministic Java parser, 0.015s/page) and hybrid (routes complex pages to Docling + EasyOCR + SmolVLM 256M vision model, 0.463s/page). Python SDK: pip install opendataloader-pdf. 17.7K GitHub stars, active development (last push April 2026).

Why Evaluate¶

Curaway's #1 patient upload is lab reports — structured tables with parameters, values, reference ranges. Current pipeline (PyMuPDF) scores 0.401 on table extraction accuracy. OpenDataLoader scores 0.928 — a 2.3x improvement on the most common document type.

Benchmark Comparison¶

Tool	Overall accuracy	Table accuracy	Speed/page	Cost
opendataloader-pdf (hybrid)	0.907	0.928	0.463s	Free (local)
opendataloader-pdf (local)	0.831	—	0.015s	Free (local)
Unstructured.io (hi_res)	0.841	—	API-dependent	Free 1K pages/mo, then paid
Docling	0.882	—	—	Free (local)
PyMuPDF4LLM	0.732	0.401	<0.01s	Free (local)
Marker	0.861	—	—	Free (local)

Source: opendataloader-bench (200 real-world PDFs).

Current Pipeline vs Proposed¶

Current (PyMuPDF + Unstructured.io)¶

Patient uploads PDF
  → PyMuPDF extracts text inline (fast, 0 cost)
  → If scanned: Unstructured.io OCR (1K pages/mo free)
  → Raw text → Claude Haiku clinical extraction → FHIR resources

Weaknesses: - Lab report tables lose structure — Claude has to reconstruct from flattened text - Unstructured.io free tier caps at 1K pages/mo - No bounding box data for document validation

Proposed (dual-path with feature flag)¶

Patient uploads PDF
  ├── Flagsmith: pdf_extractor = "pymupdf" (default)
  │     → PyMuPDF inline (current behavior, unchanged)
  │
  └── Flagsmith: pdf_extractor = "opendataloader"
        → opendataloader hybrid server (sidecar)
        → Structured Markdown + JSON with table bounding boxes
        → Same downstream: Claude Haiku → FHIR

Why dual-path: - A/B test extraction quality on real patient documents - PyMuPDF stays as fast path for simple PDFs (text-heavy, no tables) - opendataloader activates for complex PDFs (lab reports, multi-page radiology) - Shadow mode first — run both extractors, compare outputs via Langfuse, then promote winner

Matches existing patterns: WeightedScoringV1 vs AgentEnhancedMatching, risk_assessor vs future risk_assessor_llm_shadow.

Output Formats¶

opendataloader-pdf produces multiple output formats per document:

Markdown — clean text with table structure preserved:

## Complete Blood Count

| Parameter | Value | Unit | Reference Range |
|-----------|-------|------|-----------------|
| Hemoglobin | 12.8 | g/dL | 12.0 - 16.0 |
| WBC | 7.2 | x10³/µL | 4.5 - 11.0 |
| Platelets | 245 | x10³/µL | 150 - 400 |

JSON — structured elements with bounding boxes:

{
  "elements": [
    {
      "type": "table",
      "page": 1,
      "bbox": [72, 200, 540, 450],
      "content": {
        "headers": ["Parameter", "Value", "Unit", "Reference Range"],
        "rows": [
          ["Hemoglobin", "12.8", "g/dL", "12.0 - 16.0"],
          ["WBC", "7.2", "x10³/µL", "4.5 - 11.0"]
        ]
      }
    }
  ]
}

The JSON with bounding boxes is valuable for: - Document validation (knowing WHERE on the page a value was found) - Extraction audit trail (provenance for compliance) - Future UI features (highlighting source regions in DocumentViewer)

Medical Document Impact¶

opendataloader-pdf has no medical-domain awareness. It is a layout parser. The clinical intelligence remains in Claude Haiku (existing Clinical Context Agent). The improvement is in the quality of input to Claude:

Document type	Current input to Claude	With opendataloader
Lab report (structured table)	Flattened text, table structure lost	Preserved table with headers, rows, reference ranges
Radiology report (narrative + measurements)	Mixed text, measurements may be misplaced	Semantic sections with bounding boxes
Discharge summary (multi-page narrative)	Good (mostly text)	Similar quality — marginal improvement
Scanned handwritten prescription	Unstructured.io OCR	EasyOCR 80+ languages, 300+ DPI

Biggest win: Lab reports. Better table input → Claude extracts parameter values more accurately → better comorbidity detection → better PFS scoring.

Deployment Architecture¶

Railway sidecar¶

opendataloader hybrid mode runs a local server. Deploy as a sidecar process on Railway:

Railway container:
  ├── FastAPI (main app, port 8000)
  └── opendataloader-pdf hybrid server (port 5002)

Dockerfile additions:

# Add Java 11 for opendataloader
RUN apt-get update && apt-get install -y openjdk-11-jre-headless

# Install opendataloader-pdf
RUN pip install opendataloader-pdf

Startup: Add opendataloader-pdf-hybrid --port 5002 & to entrypoint before FastAPI starts. Or use Railway's multi-process support.

Alternative: Separate Railway service for opendataloader if resource isolation is needed. Internal networking, no public exposure.

Service integration¶

New service: app/services/opendataloader_extractor.py

# Conceptual
async def extract_with_opendataloader(file_bytes: bytes, filename: str) -> ExtractedDocument:
    """Extract structured content from PDF using opendataloader-pdf hybrid mode."""
    # Call local hybrid server
    # Return structured Markdown + JSON with bounding boxes
    # Same ExtractedDocument interface as PyMuPDF path

Wired into existing document_processing.py:

async def run_post_ocr_pipeline(document_id, ...):
    extractor = get_feature_flag("pdf_extractor")  # "pymupdf" or "opendataloader"

    if extractor == "opendataloader":
        extracted = await extract_with_opendataloader(file_bytes, filename)
    else:
        extracted = extract_with_pymupdf(file_bytes, filename)  # current path

    # Rest of pipeline unchanged: Claude extraction → FHIR → EHR rebuild

Shadow Mode¶

Before promoting opendataloader as default:

Both extractors run on every PDF (behind pdf_extractor_shadow flag)
Both outputs logged to Langfuse with comparison metadata
Claude runs on both inputs separately — compare FHIR output quality
Metrics tracked:
Parameter extraction count (did opendataloader find more values?)
Table reconstruction accuracy (manual spot-check)
Comorbidity detection rate (did better input → more detected conditions?)
Processing time (PyMuPDF <0.01s vs opendataloader 0.5s)
After ~200 documents: Evaluate and decide whether to promote

Concerns¶

Concern	Severity	Mitigation
JVM per call (~200-500ms startup)	Medium	Run hybrid server as persistent sidecar (warm JVM)
Java 11+ in Docker image	Low	`openjdk-11-jre-headless` adds ~150MB to image
Relative newness (11 months)	Low	Pin version, abstract behind service interface
No medical NER	Non-issue	Claude Haiku handles this — opendataloader replaces layout extraction, not clinical intelligence
Hybrid mode needs Docling backend	Low	Bundled with pip install, runs locally
PDF-only (no Word, PPT)	Non-issue	Curaway patients upload PDFs and images, not Office docs

Flagsmith Flags¶

Flag	Type	Default	Description
`pdf_extractor`	String	`"pymupdf"`	Which extraction engine to use: `"pymupdf"` or `"opendataloader"`
`pdf_extractor_shadow`	Boolean	`false`	Run both extractors and log comparison (shadow mode)

Implementation Effort¶

Task	Effort
Add opendataloader-pdf to backend dependencies	1 day
Add Java 11 to Railway Docker image	0.5 day
Run opendataloader hybrid server as sidecar	0.5 day
Create `opendataloader_extractor.py` service	1 day
Wire into `document_processing.py` behind Flagsmith flag	0.5 day
Shadow mode comparison logging in Langfuse	0.5 day
Tests	1 day
Total	~5 days

Edge Cases¶

Scenario	Handling
opendataloader server crashes	Fall back to PyMuPDF automatically. Log error. Alert via existing API failure notification system.
PDF too large for hybrid mode (>100 pages)	Use local-only mode (0.015s/page) for documents over threshold. Configurable.
Encrypted/password-protected PDF	Same as current — reject at upload with user-friendly error.
Corrupted PDF	opendataloader returns error → fall back to PyMuPDF → if both fail, store raw, queue for retry.
Non-Latin script lab report (Arabic, Hindi)	Hybrid mode supports 80+ languages via EasyOCR. Better than current PyMuPDF (Latin only without Unstructured).
Image-only PDF (photo of handwritten prescription)	Hybrid mode OCR handles this. Current path requires Unstructured.io. Removes that dependency.
Shadow mode doubles processing time	Acceptable during evaluation (~200 docs). Disable shadow after evaluation complete.

Decision¶

Recommendation: Add as shadow-mode dual-path. The table extraction improvement (0.928 vs 0.401) is significant for lab reports — the most common patient upload. Removing Unstructured.io dependency is a bonus. Low risk — feature flagged, no behavior change until promoted after ~200 document evaluation.

Depends on: Nothing — can be implemented independently of the platform restructuring.