Document Pipeline¶

Overview¶

The Document Pipeline handles the complete lifecycle of medical documents in Curaway -- from patient upload through OCR extraction, clinical entity recognition, FHIR resource generation, and requirement matching. It is designed to be resilient, asynchronous, and privacy-aware.

DICOM support: For the planned DICOM imaging workflow (.dcm files, WADO-RS retrieval, pixel-data extraction), see docs/specs/dicom-support-feature.md.

Upload Flow¶

Architecture¶

sequenceDiagram
    participant Client as Frontend (Next.js)
    participant API as FastAPI
    participant R2 as Cloudflare R2
    participant QS as QStash
    participant OCR as OCR Pipeline
    participant CCA as Clinical Context Agent
    participant DB as PostgreSQL

    Client->>API: POST /uploads/presign
    API->>API: Validate: extension, size, consent
    API-->>Client: {presigned_url, document_id}
    Client->>R2: PUT file (presigned URL)
    Client->>API: POST /uploads/confirm
    API->>DB: Update document status → "uploaded"
    API->>QS: Enqueue OCR callback
    QS->>OCR: Trigger OCR processing
    OCR->>DB: Update document with extracted text
    OCR->>CCA: Trigger Clinical Context Agent
    CCA->>DB: Store FHIR resources
    CCA->>DB: Update document status → "processed"

Step 1: Presign Request¶

POST /api/v1/uploads/presign

class PresignRequest(BaseModel):
    case_id: UUID
    filename: str
    content_type: str           # "application/pdf", "image/jpeg", etc.
    file_size_bytes: int

class PresignResponse(BaseModel):
    document_id: UUID
    presigned_url: str          # R2 presigned PUT URL (15-minute expiry)
    expires_at: datetime

The API performs these validations before generating a presigned URL:

Validation	Rule	Error Code
File extension	`.pdf`, `.jpg`, `.jpeg`, `.png`, `.tiff`, `.dicom`	`INVALID_EXTENSION`
File size (minimum)	>= 5 KB	`FILE_TOO_SMALL`
File size (maximum)	<= 20 MB	`FILE_TOO_LARGE`
Content type	Must match extension	`CONTENT_TYPE_MISMATCH`
Patient consent	Active consent record must exist	`CONSENT_REQUIRED`
Case status	Case must be in active workflow phase	`CASE_NOT_ACTIVE`

Step 2: Direct Upload to R2¶

The client uploads directly to Cloudflare R2 using the presigned URL. This avoids routing large files through the API server:

// Frontend upload (Next.js)
const upload = async (file: File, presignedUrl: string) => {
  const response = await fetch(presignedUrl, {
    method: "PUT",
    body: file,
    headers: { "Content-Type": file.type },
  });
  if (!response.ok) throw new UploadError("R2 upload failed");
};

Step 3: Confirm Upload¶

POST /api/v1/uploads/confirm

class ConfirmRequest(BaseModel):
    document_id: UUID
    sha256_hash: str            # Client-computed hash for dedup

On confirmation, the API:

Verifies the file exists in R2
Checks SHA-256 hash for deduplication
Updates document status to uploaded
Runs inline PyMuPDF OCR if enable_inline_ocr flag is on (fast path, see below)
Enqueues full OCR callback via QStash (async fallback / complex documents)

Inline OCR Fast Path¶

When enable_inline_ocr is enabled (Flagsmith), PyMuPDF attempts text extraction synchronously inside confirm_upload. The decision tree determines whether to proceed inline or fall through to the async QStash path:

confirm_upload
  ├─ PyMuPDF extracts text
  │   ├─ > 100 chars extracted → process synchronously, skip QStash
  │   ├─ <= 100 chars → fall through to QStash async (scanned/image PDF)
  │   └─ error → fall through to QStash async
  └─ Flag off → always enqueue to QStash

The 100-character threshold distinguishes text-based PDFs (which PyMuPDF handles well) from scanned documents (which need Unstructured.io or Claude Vision via the full OCR chain).

Both paths share app/services/document_processing.run_post_ocr_pipeline() to ensure identical downstream processing — Clinical Context Agent, requirement matching, embedding matching, EHR rebuild, and SSE progress events all run through the same function regardless of which path extracted the text.

Step 4: QStash OCR Callback¶

QStash delivers an async callback to the OCR processing endpoint:

@router.post("/internal/callbacks/ocr")
async def ocr_callback(
    request: QStashCallbackRequest,
    x_qstash_signature: str = Header(),
):
    """Process OCR for an uploaded document (QStash callback)."""
    verify_qstash_signature(request, x_qstash_signature)
    document = await document_service.get(request.document_id)
    await ocr_pipeline.process(document)

Why QStash?

OCR processing can take 5-30 seconds depending on document complexity. QStash provides reliable async delivery with automatic retries (3 attempts, exponential backoff), dead letter queues, and signature verification -- all on the free tier.

OCR Stack¶

The OCR pipeline uses an ordered fallback chain, trying each method in sequence until one succeeds:

graph TD
    A[Document Upload] --> B{PyMuPDF}
    B -->|Success| G[Extracted Text]
    B -->|Failure/Low Quality| C{Unstructured.io}
    C -->|Success| G
    C -->|Failure| D{Claude Vision}
    D -->|Success| G
    D -->|Failure| E[Manual Review Flag]

    style B fill:#008B8B,color:#fff
    style C fill:#4A90D9,color:#fff
    style D fill:#FF7F50,color:#fff
    style E fill:#FF0000,color:#fff

Method Comparison¶

Method	Type	Speed	Cost	Best For
PyMuPDF	Synchronous, local	~1 second	$0	Text-based PDFs, well-formatted documents
Unstructured.io	API call	~5 seconds	Free tier (1K pages/mo)	Complex layouts, tables, multi-column
Claude Vision	LLM API call	~10 seconds	~$0.02/page	Scanned PDFs, handwritten notes, poor quality images

PyMuPDF (Primary)¶

import fitz  # PyMuPDF

async def extract_with_pymupdf(document_path: str) -> OCRResult:
    """Primary OCR: fast, free, synchronous text extraction."""
    doc = fitz.open(document_path)
    pages = []
    for page in doc:
        text = page.get_text("text")
        pages.append(PageResult(
            page_number=page.number + 1,
            text=text,
            confidence=estimate_text_quality(text),
        ))

    overall_confidence = sum(p.confidence for p in pages) / len(pages)

    if overall_confidence < 0.6:
        raise LowQualityError(f"PyMuPDF confidence {overall_confidence:.2f} below threshold")

    return OCRResult(
        method="pymupdf",
        pages=pages,
        confidence=overall_confidence,
    )

Unstructured.io (Fallback 1)¶

from unstructured.partition.auto import partition

async def extract_with_unstructured(document_path: str) -> OCRResult:
    """Fallback 1: Unstructured.io for complex layouts."""
    elements = partition(filename=document_path)
    text = "\n".join(str(el) for el in elements)

    return OCRResult(
        method="unstructured",
        pages=[PageResult(page_number=1, text=text, confidence=0.8)],
        confidence=0.8,
    )

Claude Vision (Fallback 2)¶

async def extract_with_claude_vision(document_path: str) -> OCRResult:
    """Fallback 2: Claude Vision for scanned/handwritten documents."""
    images = convert_pdf_to_images(document_path)

    pages = []
    for i, image in enumerate(images):
        response = await anthropic_client.messages.create(
            model="claude-sonnet-4-6-20250514",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {"type": "base64", "data": image}},
                    {"type": "text", "text": VISION_OCR_PROMPT},
                ],
            }],
            max_tokens=4000,
        )
        pages.append(PageResult(
            page_number=i + 1,
            text=response.content[0].text,
            confidence=0.85,
        ))

    return OCRResult(method="claude_vision", pages=pages, confidence=0.85)

Claude Vision Cost

Claude Vision is the most expensive OCR method (~$0.02/page). It is only invoked when both PyMuPDF and Unstructured.io fail, which typically means the document is a scanned image or contains handwritten notes.

Clinical Context Agent Pipeline¶

After OCR extraction, the Clinical Context Agent processes the text through a 4-node LangGraph workflow (see Agent System for full details):

graph LR
    A[extract_clinical_entities] --> B[map_to_medical_codes]
    B --> C[generate_fhir_resources]
    C --> D[store_resources]

    style A fill:#008B8B,color:#fff
    style B fill:#008B8B,color:#fff
    style C fill:#008B8B,color:#fff
    style D fill:#008B8B,color:#fff

Entity Extraction Output¶

class ExtractedEntities(BaseModel):
    """Clinical entities extracted from a medical document."""
    conditions: list[ConditionEntity]       # Diagnoses, findings
    procedures: list[ProcedureEntity]       # Past/recommended procedures
    medications: list[MedicationEntity]     # Current medications
    lab_results: list[LabResult]            # Lab values with ranges
    vitals: list[VitalSign]                 # Blood pressure, heart rate, etc.
    allergies: list[AllergyEntity]
    imaging_findings: list[ImagingFinding]
    physician_notes: list[str]              # Free-text clinical notes
    document_date: Optional[date]
    patient_name_in_doc: Optional[str]      # For cross-reference validation
    laterality: Optional[str]               # "left", "right", "bilateral"

Embedding-Based Document Matching¶

Purpose¶

When a patient uploads a document, the system must determine which procedure requirements it satisfies. This is done through embedding-based matching against the requirement_embeddings Qdrant collection.

Flow¶

graph TD
    A[Extracted Document Text] --> B[Voyage AI Embedding]
    B --> C[Qdrant Cosine Similarity Search]
    C --> D{Confidence >= 0.85?}
    D -->|Yes| E[Auto-Match to Requirement]
    D -->|No| F{Confidence >= 0.60?}
    F -->|Yes| G[LLM Re-Ranker Verification]
    F -->|No| H[Unmatched - Flag for Review]
    G -->|Confirmed| E
    G -->|Rejected| H

    style D fill:#008B8B,color:#fff
    style F fill:#008B8B,color:#fff
    style G fill:#FF7F50,color:#fff

Matching Thresholds¶

Cosine Similarity	Action	Cost
>= 0.85	Auto-match (high confidence)	$0
0.60 - 0.84	LLM re-ranker verification	~$0.005
< 0.60	Unmatched, flag for manual review	$0

Re-Ranker Prompt¶

RERANKER_PROMPT = """
You are a medical document classifier. Given an extracted document summary
and a list of candidate requirements, determine which requirement (if any)
the document satisfies.

Document summary: {document_summary}

Candidate requirements:
{candidates}

For each candidate, respond with:
- requirement_id: The ID of the requirement
- confidence: 0.0 to 1.0
- reasoning: Brief explanation

Only return matches with confidence >= 0.70.
"""

Requirement Embeddings (70 Vectors)¶

The 70 requirement embedding vectors cover document types across all 12 supported procedures:

Category	Count	Examples
Blood Work	15	CBC, metabolic panel, coagulation, HbA1c, thyroid
Imaging	18	X-ray, MRI, CT, echocardiogram, angiogram, OPG
Cardiac Tests	8	ECG, stress test, echo, Holter monitor
Clearances	10	Cardiac clearance, dental clearance, anesthesia eval
Specialist Reports	12	Orthopedic assessment, oncology report, pulmonary function
Other	7	Vaccination records, insurance pre-auth, travel fitness certificate

Parameter-Level Document Matching¶

Purpose¶

Embedding-based matching works well for document-level classification ("this is a blood work report"), but cannot verify whether a specific lab panel's individual parameters are present. A blood work report that contains Hemoglobin, RBC, WBC, and Platelets should match "Complete Blood Count" even if the document filename or category doesn't mention CBC.

Parameter matching solves this by cross-referencing the extracted observations (individual lab values from the Clinical Context Agent) against defined parameter sets for each procedure requirement.

How It Works¶

Document observations (from Clinical Context Agent)
    → Extract parameter names (e.g., "Hemoglobin", "WBC", "Creatinine")
    → Compare against _PARAM_SETS for each requirement
    → If ≥50% of expected parameters found → match confirmed

Implementation: app/routers/cases.py:_match_doc_by_parameters() (line 1072)

13 Parameter Sets (~80 Parameters)¶

Requirement	Expected Parameters	Min Match
Complete Blood Count (CBC)	hemoglobin, rbc, wbc, hematocrit, platelets, mcv, mch	4/7 (50%)
Basic Metabolic Panel	glucose, calcium, sodium, potassium, chloride, bicarbonate, bun, creatinine	4/8
Coagulation Panel	pt, inr, aptt	2/3
HbA1c	hba1c, glycated hemoglobin, hemoglobin a1c	1/3
Urinalysis	ph, specific gravity, protein, glucose, blood, leukocyte	3/6
Blood Type	blood group, blood type, abo, rh factor	2/4
Lipid Panel	cholesterol, triglyceride, hdl, ldl	2/4
Liver Function	alt, ast, bilirubin, albumin, alkaline phosphatase	3/5
Kidney Function	creatinine, bun, egfr, urea	2/4
Thyroid	tsh, t3, t4, free t4	2/4

Plus 3 alias sets (cbc, metabolic, coagulation) that map to the same parameters.

Matching Priority¶

The document checklist uses a three-tier matching strategy in priority order:

Category + filename match — exact document category or filename keywords
Parameter-level match — extracted observations against parameter sets (≥50% coverage)
Embedding match — Voyage AI embedding similarity via Qdrant (≥0.5 threshold)

Each document can only satisfy one requirement (tracked by claimed_ids set to prevent double-counting).

Validation Keywords (12 Sets, ~60 Keywords)¶

A separate keyword system in app/services/document_validator.py validates whether uploaded documents cover required tests. This is used by the smart document validator (Session 31) to check test validity, source acceptance, and on-site requirements.

Test	Keywords
Complete Blood Count	cbc, hemoglobin, wbc, rbc, platelet, hematocrit
Basic Metabolic Panel	glucose, sodium, potassium, creatinine, bun, calcium, metabolic
Coagulation Panel	pt, inr, aptt, ptt, coagulation, prothrombin
HbA1c	hba1c, glycated, a1c, hemoglobin a1c
Chest X-ray	chest, x-ray, xray, cxr, thorax
ECG	ecg, ekg, electrocardiogram, cardiac rhythm
Knee X-ray	knee, x-ray, xray, radiograph
MRI	mri, magnetic resonance
Urinalysis	urine, urinalysis
Thyroid Function	tsh, t3, t4, thyroid
Lipid Panel	cholesterol, ldl, hdl, triglyceride, lipid
Blood Type	blood type, blood group, abo, rh factor

Combined Coverage¶

Between parameter sets (80+ parameters) and validation keywords (60+ keywords), the system covers ~143 distinct medical parameters across lab work, imaging, and cardiac tests. This enables accurate document-to-requirement matching even when document metadata is sparse or the Clinical Context Agent extracts individual values without labeling the parent panel.

Document Validation¶

Smart Document Validation¶

The document pipeline runs seven automated validators after extraction. Results are stored on the document record and surfaced conversationally through the orchestrator — error-severity issues block workflow advancement until the patient clarifies.

class DocumentValidator:
    """Validates extracted document data for anomalies."""

    async def validate(
        self,
        document: Document,
        extracted: ExtractedEntities,
        case: Case,
    ) -> list[ValidationIssue]:
        issues = []
        issues.extend(self._check_expired_tests(extracted, case))
        issues.extend(self._check_onsite_required(extracted, case))
        issues.extend(self._check_source_acceptance(extracted, case))
        issues.extend(self._check_laterality_mismatch(extracted, case))
        issues.extend(self._check_patient_name_mismatch(extracted, case))
        issues.extend(self._check_ocr_quality(document))
        issues.extend(self._check_wrong_body_part(extracted, case))
        return issues

Validator	Description	Severity	Data Source
Expired Tests	Test result older than `validity_days` from Neo4j `REQUIRES_TEST` relationship (e.g., CBC >30 days old)	`warning`	Neo4j `validity_days` property
On-Site Required	Hospital requires the test to be performed on arrival (e.g., fresh blood work) — patient is informed but document still accepted	`info`	Neo4j `on_site_required` boolean
Source Acceptance	Lab/facility does not meet the provider's required accreditation level (JCI/NABH required but report from unaccredited lab)	`warning`	Neo4j `source_acceptance` enum
Laterality Mismatch	Document says "left knee" but case is for "right knee"	`error`	Extracted laterality vs. case laterality
Patient Name Mismatch	Name in document header does not match patient record	`warning`	Extracted `patient_name_in_doc` vs. patient profile
Poor OCR Quality	OCR confidence below 0.5 on any page — extracted data may be unreliable	`warning`	OCR confidence score
Wrong Body Part	Hip X-ray uploaded for a cardiac case. Systemic tests (blood work, lab panels, ECG, urinalysis) are exempt from this check	`error`	Extracted body part vs. case procedure

Laterality Mismatch Example¶

def _check_laterality_mismatch(
    self,
    extracted: ExtractedEntities,
    case: Case,
) -> list[ValidationIssue]:
    """Detect left/right mismatch between document and case."""
    if not extracted.laterality or not case.laterality:
        return []

    if extracted.laterality != case.laterality:
        return [ValidationIssue(
            type="laterality_mismatch",
            severity="error",
            message=(
                f"Document indicates '{extracted.laterality}' side, "
                f"but case is for '{case.laterality}' side. "
                f"Please verify and upload the correct document."
            ),
        )]
    return []

Anomaly Detection Wired into Orchestrator

Validation issues are not just logged -- they are surfaced to the patient through the chat orchestrator. If an error-severity issue is detected, the orchestrator pauses the workflow and asks the patient to verify or re-upload the document.

Deduplication and Version Management¶

SHA-256 Deduplication¶

Every uploaded document has its SHA-256 hash computed client-side and verified server-side:

async def check_duplicate(
    tenant_id: str,
    case_id: UUID,
    sha256_hash: str,
) -> Optional[Document]:
    """Check if this exact document has already been uploaded."""
    existing = await db.fetch_one(
        """
        SELECT * FROM documents
        WHERE tenant_id = $1 AND case_id = $2 AND sha256_hash = $3
        AND status != 'deleted'
        """,
        tenant_id, case_id, sha256_hash,
    )
    return Document(**existing) if existing else None

If a duplicate is detected, the API returns the existing document instead of creating a new one.

Version Management (Supersede on Re-Upload)¶

When a patient uploads a newer version of a document type (e.g., updated blood work), the system supersedes the old version:

async def supersede_document(
    document_id: UUID,
    new_document_id: UUID,
    tenant_id: str,
):
    """Mark an existing document as superseded by a newer version."""
    await db.execute(
        """
        UPDATE documents
        SET status = 'superseded',
            superseded_by = $1,
            updated_at = NOW()
        WHERE id = $2 AND tenant_id = $3
        """,
        new_document_id, document_id, tenant_id,
    )

Document status lifecycle:

stateDiagram-v2
    [*] --> pending: Presign generated
    pending --> uploaded: Client confirms upload
    uploaded --> processing: QStash triggers OCR
    processing --> processed: OCR + extraction complete
    processing --> failed: OCR failure (all methods)
    processed --> matched: Requirement matched
    processed --> unmatched: No requirement match
    processed --> superseded: Newer version uploaded
    matched --> superseded: Newer version uploaded
    failed --> uploaded: Retry triggered

Frontend Validation¶

The frontend performs pre-upload validation to catch obvious issues before the network request:

const UPLOAD_CONFIG = {
  allowedExtensions: [".pdf", ".jpg", ".jpeg", ".png", ".tiff"],
  minSizeBytes: 5 * 1024,        // 5 KB
  maxSizeBytes: 20 * 1024 * 1024, // 20 MB
};

function validateFile(file: File): ValidationResult {
  const ext = getExtension(file.name).toLowerCase();

  if (!UPLOAD_CONFIG.allowedExtensions.includes(ext)) {
    return { valid: false, error: `Unsupported file type: ${ext}` };
  }
  if (file.size < UPLOAD_CONFIG.minSizeBytes) {
    return { valid: false, error: "File is too small (minimum 5 KB)" };
  }
  if (file.size > UPLOAD_CONFIG.maxSizeBytes) {
    return { valid: false, error: "File is too large (maximum 20 MB)" };
  }
  return { valid: true };
}

Error Toast Pattern¶

Validation errors are shown as inline toast notifications in the upload area:

const handleUpload = async (file: File) => {
  const validation = validateFile(file);
  if (!validation.valid) {
    toast.error(validation.error, {
      position: "bottom-center",
      duration: 5000,
    });
    return;
  }
  // Proceed with presign + upload flow
};

Real-Time Upload Progress (SSE)¶

Patients see live processing status without polling. The GET /api/v1/patients/{id}/documents/stream endpoint streams Server-Sent Events as the 6-step pipeline advances.

6-Step Pipeline¶

Step	Enum	Label	Trigger
1	`upload_received`	Upload received	`confirm_upload` completes
2	`ocr_started`	Reading document	Before PyMuPDF / Unstructured.io / Vision
3	`ocr_complete`	Text extracted	OCR text saved to DB
4	`analysis_started`	Analyzing findings	Clinical Context Agent begins
5	`analysis_complete`	Analysis complete	FHIR resources saved
6	`matching_complete`	Ready	Requirement match + embedding match recorded

Each step is emitted by emit_progress() in document_processing.py, which pushes a ProgressEvent to the doc_progress:{patient_id} Redis channel. Progress emission also creates a Langfuse span for pipeline observability.

Event Schema¶

Every progress event carries this payload:

{
  "step": 3,
  "label": "Text extracted",
  "status": "complete",
  "timestamp": "2026-04-05T10:23:45Z",
  "detail": "PyMuPDF extracted 2,847 characters",
  "document_id": "a1b2c3d4-..."
}

status is one of: in_progress, complete, error.

SSE Event Types¶

Event	When	Payload
`connected`	On open	`{documents: [...]}` — current document list
`progress`	Each pipeline step	`{step, label, status, timestamp, detail?, document_id}`
`document_update`	DB status change	`{document_id, ocr_status, analysis_status, extracted_data}`
`heartbeat`	Every 5s (no activity)	`{timestamp}` — keeps connection alive through proxies
`done`	All docs terminal	`{message: "All documents processed"}`
`timeout`	After 5 minutes	`{message: "Stream timeout, refresh to check status"}`

Heartbeat + Error Propagation¶

Heartbeat: Fires every 5 seconds when no other event has been emitted. Prevents proxy/CDN connection drops.
Error propagation: If any pipeline step fails (OCR error, agent timeout, matching failure), the step emits status: "error" with a detail message. The stream always terminates with a final step 6 ("Ready") event — even on error — so the frontend can reliably hide the progress indicator.

Dual-Source Architecture¶

The stream interleaves two data sources:

Redis doc_progress:{patient_id} channel — fast path (300ms poll). emit_progress() in document_processing.py pushes to this channel at each pipeline stage.
DB polling — slow path (every 5s). Detects ocr_status / analysis_status changes that may not have fired a Redis event.

Redis errors are non-fatal — the stream degrades to DB-only polling.

// Frontend connection pattern
const es = new EventSource(
  `${baseUrl}/api/v1/patients/${patientId}/documents/stream?tenant_id=${tenantId}`
);
es.addEventListener('progress', (e) => {
  const { step, label, status } = JSON.parse(e.data);
  // Update step indicator UI
});
es.addEventListener('done', () => es.close());

Pre-Upload & Pre-Processing Validation¶

Before a document is accepted into the OCR/analysis pipeline, the following preprocessing checks MUST run. Some are enforced today by file_validator.py and document_validator.py; the rest are tracked here as gaps.

Check	Stage	Source of truth	Status
Filename / extension allow-list	Presign	`guardrails.yaml`	enforced
MIME type allow-list	Presign	`guardrails.yaml`	enforced
File size (min/max)	Presign	`guardrails.yaml`	enforced
Patient name match (header vs Clerk profile)	Post-OCR	`document_validator.py`	enforced
Document validity window (test date vs `validity_days` in Neo4j `REQUIRES_TEST`)	Post-extraction	Neo4j `REQUIRES_TEST.validity_days`	enforced — flagged as `expired` issue
Laterality / body-part match (left vs right knee, lab tests exempt)	Post-OCR	`document_validator.py`	enforced
Source acceptance (e.g., JCI/NABH-only)	Post-extraction	Neo4j `REQUIRES_TEST.source_acceptance`	enforced — informational
OCR quality / confidence floor	Post-OCR	`document_validator.py`	enforced
Duplicate / superseded version detection	Confirm	`document_service.py`	enforced
Cross-case reuse of analyzed records	Pre-analysis	`case_orchestrator._check_existing_records`	enforced

Rule: No document advances to Clinical Context Agent until the validity window, name, and body-part checks have produced at least an info-level result. Critical/high issues block workflow advancement and must be resolved conversationally.

Agent Conversation Requirements¶

The conversational agent has three obligations tied to this pipeline:

When requesting documents, the agent MUST surface the validity window from REQUIRES_TEST.validity_days in plain language (e.g., "a CBC from the last 30 days"). The document checklist card already renders validity badges; the agent's spoken response must match.
When a document fails validity / name / laterality checks, the agent must report the specific issue ("This CBC is 45 days old; we need one from the last 30") rather than a generic error.
At the end of the conversation — after consent and before forwarding records to providers — the agent MUST notify the patient about:
Tests that will be performed on-site at the receiving hospital (REQUIRES_TEST.on_site_required = true)
Tests that may be repeated on-site even if the patient already uploaded results, and why (e.g., "Bumrungrad repeats blood work on arrival per their protocol")
Source-acceptance restrictions that may force a repeat (e.g., non-JCI/NABH imaging will be redone)

This notification is rendered as a structured rich card (on_site_tests_notice) in addition to the LLM's prose, so the patient has a durable reference. Driven by Neo4j REQUIRES_TEST + per-provider REQUIRES_FOR_PROCEDURE overrides.

Pipeline Metrics¶

Metric	Target	Current (MVP)
OCR success rate (any method)	> 95%	98%
Average OCR time	< 10 seconds	3 seconds (PyMuPDF)
Clinical extraction accuracy	> 85%	88% (Claude Sonnet)
Requirement auto-match rate	> 60%	65%
False positive match rate	< 5%	3%
End-to-end pipeline time	< 30 seconds	12 seconds average