Skip to content

OpenDataLoader PDF — Evaluation & Integration Plan

Status: Evaluated, deferred (implement when ready) Date: 2026-04-17 Session: 41 Repo: https://github.com/opendataloader-project/opendataloader-pdf License: Apache 2.0


Overview

OpenDataLoader PDF is an open-source PDF parser producing structured, AI-ready output. Two extraction modes: local (deterministic Java parser, 0.015s/page) and hybrid (routes complex pages to Docling + EasyOCR + SmolVLM 256M vision model, 0.463s/page). Python SDK: pip install opendataloader-pdf. 17.7K GitHub stars, active development (last push April 2026).

Why Evaluate

Curaway's #1 patient upload is lab reports — structured tables with parameters, values, reference ranges. Current pipeline (PyMuPDF) scores 0.401 on table extraction accuracy. OpenDataLoader scores 0.928 — a 2.3x improvement on the most common document type.

Benchmark Comparison

Tool Overall accuracy Table accuracy Speed/page Cost
opendataloader-pdf (hybrid) 0.907 0.928 0.463s Free (local)
opendataloader-pdf (local) 0.831 0.015s Free (local)
Unstructured.io (hi_res) 0.841 API-dependent Free 1K pages/mo, then paid
Docling 0.882 Free (local)
PyMuPDF4LLM 0.732 0.401 <0.01s Free (local)
Marker 0.861 Free (local)

Source: opendataloader-bench (200 real-world PDFs).

Current Pipeline vs Proposed

Current (PyMuPDF + Unstructured.io)

Patient uploads PDF
  → PyMuPDF extracts text inline (fast, 0 cost)
  → If scanned: Unstructured.io OCR (1K pages/mo free)
  → Raw text → Claude Haiku clinical extraction → FHIR resources

Weaknesses: - Lab report tables lose structure — Claude has to reconstruct from flattened text - Unstructured.io free tier caps at 1K pages/mo - No bounding box data for document validation

Proposed (dual-path with feature flag)

Patient uploads PDF
  ├── Flagsmith: pdf_extractor = "pymupdf" (default)
  │     → PyMuPDF inline (current behavior, unchanged)
  └── Flagsmith: pdf_extractor = "opendataloader"
        → opendataloader hybrid server (sidecar)
        → Structured Markdown + JSON with table bounding boxes
        → Same downstream: Claude Haiku → FHIR

Why dual-path: - A/B test extraction quality on real patient documents - PyMuPDF stays as fast path for simple PDFs (text-heavy, no tables) - opendataloader activates for complex PDFs (lab reports, multi-page radiology) - Shadow mode first — run both extractors, compare outputs via Langfuse, then promote winner

Matches existing patterns: WeightedScoringV1 vs AgentEnhancedMatching, risk_assessor vs future risk_assessor_llm_shadow.

Output Formats

opendataloader-pdf produces multiple output formats per document:

Markdown — clean text with table structure preserved:

## Complete Blood Count

| Parameter | Value | Unit | Reference Range |
|-----------|-------|------|-----------------|
| Hemoglobin | 12.8 | g/dL | 12.0 - 16.0 |
| WBC | 7.2 | x10³/µL | 4.5 - 11.0 |
| Platelets | 245 | x10³/µL | 150 - 400 |

JSON — structured elements with bounding boxes:

{
  "elements": [
    {
      "type": "table",
      "page": 1,
      "bbox": [72, 200, 540, 450],
      "content": {
        "headers": ["Parameter", "Value", "Unit", "Reference Range"],
        "rows": [
          ["Hemoglobin", "12.8", "g/dL", "12.0 - 16.0"],
          ["WBC", "7.2", "x10³/µL", "4.5 - 11.0"]
        ]
      }
    }
  ]
}

The JSON with bounding boxes is valuable for: - Document validation (knowing WHERE on the page a value was found) - Extraction audit trail (provenance for compliance) - Future UI features (highlighting source regions in DocumentViewer)

Medical Document Impact

opendataloader-pdf has no medical-domain awareness. It is a layout parser. The clinical intelligence remains in Claude Haiku (existing Clinical Context Agent). The improvement is in the quality of input to Claude:

Document type Current input to Claude With opendataloader
Lab report (structured table) Flattened text, table structure lost Preserved table with headers, rows, reference ranges
Radiology report (narrative + measurements) Mixed text, measurements may be misplaced Semantic sections with bounding boxes
Discharge summary (multi-page narrative) Good (mostly text) Similar quality — marginal improvement
Scanned handwritten prescription Unstructured.io OCR EasyOCR 80+ languages, 300+ DPI

Biggest win: Lab reports. Better table input → Claude extracts parameter values more accurately → better comorbidity detection → better PFS scoring.

Deployment Architecture

Railway sidecar

opendataloader hybrid mode runs a local server. Deploy as a sidecar process on Railway:

Railway container:
  ├── FastAPI (main app, port 8000)
  └── opendataloader-pdf hybrid server (port 5002)

Dockerfile additions:

# Add Java 11 for opendataloader
RUN apt-get update && apt-get install -y openjdk-11-jre-headless

# Install opendataloader-pdf
RUN pip install opendataloader-pdf

Startup: Add opendataloader-pdf-hybrid --port 5002 & to entrypoint before FastAPI starts. Or use Railway's multi-process support.

Alternative: Separate Railway service for opendataloader if resource isolation is needed. Internal networking, no public exposure.

Service integration

New service: app/services/opendataloader_extractor.py

# Conceptual
async def extract_with_opendataloader(file_bytes: bytes, filename: str) -> ExtractedDocument:
    """Extract structured content from PDF using opendataloader-pdf hybrid mode."""
    # Call local hybrid server
    # Return structured Markdown + JSON with bounding boxes
    # Same ExtractedDocument interface as PyMuPDF path

Wired into existing document_processing.py:

async def run_post_ocr_pipeline(document_id, ...):
    extractor = get_feature_flag("pdf_extractor")  # "pymupdf" or "opendataloader"

    if extractor == "opendataloader":
        extracted = await extract_with_opendataloader(file_bytes, filename)
    else:
        extracted = extract_with_pymupdf(file_bytes, filename)  # current path

    # Rest of pipeline unchanged: Claude extraction → FHIR → EHR rebuild

Shadow Mode

Before promoting opendataloader as default:

  1. Both extractors run on every PDF (behind pdf_extractor_shadow flag)
  2. Both outputs logged to Langfuse with comparison metadata
  3. Claude runs on both inputs separately — compare FHIR output quality
  4. Metrics tracked:
  5. Parameter extraction count (did opendataloader find more values?)
  6. Table reconstruction accuracy (manual spot-check)
  7. Comorbidity detection rate (did better input → more detected conditions?)
  8. Processing time (PyMuPDF <0.01s vs opendataloader 0.5s)
  9. After ~200 documents: Evaluate and decide whether to promote

Concerns

Concern Severity Mitigation
JVM per call (~200-500ms startup) Medium Run hybrid server as persistent sidecar (warm JVM)
Java 11+ in Docker image Low openjdk-11-jre-headless adds ~150MB to image
Relative newness (11 months) Low Pin version, abstract behind service interface
No medical NER Non-issue Claude Haiku handles this — opendataloader replaces layout extraction, not clinical intelligence
Hybrid mode needs Docling backend Low Bundled with pip install, runs locally
PDF-only (no Word, PPT) Non-issue Curaway patients upload PDFs and images, not Office docs

Flagsmith Flags

Flag Type Default Description
pdf_extractor String "pymupdf" Which extraction engine to use: "pymupdf" or "opendataloader"
pdf_extractor_shadow Boolean false Run both extractors and log comparison (shadow mode)

Implementation Effort

Task Effort
Add opendataloader-pdf to backend dependencies 1 day
Add Java 11 to Railway Docker image 0.5 day
Run opendataloader hybrid server as sidecar 0.5 day
Create opendataloader_extractor.py service 1 day
Wire into document_processing.py behind Flagsmith flag 0.5 day
Shadow mode comparison logging in Langfuse 0.5 day
Tests 1 day
Total ~5 days

Edge Cases

Scenario Handling
opendataloader server crashes Fall back to PyMuPDF automatically. Log error. Alert via existing API failure notification system.
PDF too large for hybrid mode (>100 pages) Use local-only mode (0.015s/page) for documents over threshold. Configurable.
Encrypted/password-protected PDF Same as current — reject at upload with user-friendly error.
Corrupted PDF opendataloader returns error → fall back to PyMuPDF → if both fail, store raw, queue for retry.
Non-Latin script lab report (Arabic, Hindi) Hybrid mode supports 80+ languages via EasyOCR. Better than current PyMuPDF (Latin only without Unstructured).
Image-only PDF (photo of handwritten prescription) Hybrid mode OCR handles this. Current path requires Unstructured.io. Removes that dependency.
Shadow mode doubles processing time Acceptable during evaluation (~200 docs). Disable shadow after evaluation complete.

Decision

Recommendation: Add as shadow-mode dual-path. The table extraction improvement (0.928 vs 0.401) is significant for lab reports — the most common patient upload. Removing Unstructured.io dependency is a bonus. Low risk — feature flagged, no behavior change until promoted after ~200 document evaluation.

Depends on: Nothing — can be implemented independently of the platform restructuring.