Document-to-Requirement Matching via Embeddings — Implementation Plan¶

Priority: High — demo-blocking quality issue Effort: 4-6 hours Dependencies: Voyage AI (deployed), Qdrant (deployed)

Architecture¶

Upload Time (one-time per document):
  Document → OCR → Clinical Context Agent → Entities + Observations
                                          ↓
                                    Build clinical summary text
                                          ↓
                                    Embed via Voyage AI (1024 dims)
                                          ↓
                                    Store embedding in Qdrant collection "document_embeddings"
                                          ↓
                                    Query "requirement_embeddings" collection for top-K matches
                                          ↓
                                    LLM re-ranker confirms/rejects each match
                                          ↓
                                    Store matched_requirements on document metadata

Checklist Time (every panel refresh):
  Read document.extracted_data.matched_requirements → done
  Zero matching logic, zero keyword lists, zero maintenance

Step-by-Step Implementation¶

Step 1: Seed Requirement Embeddings (one-time)¶

Create a Qdrant collection requirement_embeddings with all procedure requirements embedded:

# For each procedure in procedure_requirements table:
for proc in all_procedures:
    for req in proc.required_documents:
        text = f"{req['name']}. {req.get('description', '')}. "
              f"Type: {req['type']}. "
              f"For procedure: {proc.procedure_name}."

        embedding = voyage_embed(text)  # 1024 dims

        qdrant.upsert("requirement_embeddings", {
            "id": f"{proc.procedure_code}:{req['type']}:{req['name']}",
            "vector": embedding,
            "payload": {
                "procedure_code": proc.procedure_code,
                "procedure_name": proc.procedure_name,
                "requirement_name": req["name"],
                "requirement_type": req["type"],
                "mandatory": req.get("mandatory", False),
                "description": req.get("description", ""),
            }
        })

Example embeddings: - "Complete Blood Count (CBC). Full blood panel including WBC, RBC, hemoglobin, hematocrit, platelets. Type: blood_work. For procedure: Total Knee Replacement." - "Panoramic X-Ray (OPG). Full mouth dental X-ray showing all teeth, jawbone, and TMJ. Type: dental. For procedure: Dental Implants Full Mouth." - "MRI Knee. Magnetic resonance imaging of the knee joint showing cartilage, ligaments, and meniscus. Type: imaging. For procedure: Total Knee Replacement."

Collection size: ~100-200 vectors (10 procedures × 10-20 requirements each). Tiny — well within Qdrant free tier.

When to re-seed: When a new procedure is added or requirements change. Can be triggered via admin endpoint or QStash schedule.

Step 2: Document Embedding at Upload Time¶

In the QStash OCR callback (internal.py), after Clinical Context Agent runs:

# Build clinical summary text from the document's analysis
entities = doc.extracted_data.get("extracted_entities", [])
observations = doc.extracted_data.get("observations", [])

entity_text = ". ".join(e.get("name", "") for e in entities[:10])
obs_text = ". ".join(
    f"{o.get('parameter_name', '')}: {o.get('value', '')} {o.get('unit', '')}"
    for o in observations[:20]
)

doc_summary = f"Medical document: {doc.original_filename}. "
              f"Category: {doc.document_category}. "
              f"Conditions found: {entity_text}. "
              f"Lab values: {obs_text}."

# Embed
doc_embedding = voyage_embed(doc_summary)

# Store in Qdrant for future reference
qdrant.upsert("document_embeddings", {
    "id": doc.id,
    "vector": doc_embedding,
    "payload": {
        "document_id": doc.id,
        "patient_id": doc.patient_id,
        "filename": doc.original_filename,
        "category": doc.document_category,
    }
})

Step 3: Match Document Against Requirements¶

Query the requirement_embeddings collection with the document embedding:

# Find top-5 matching requirements across ALL procedures
matches = qdrant.search(
    collection_name="requirement_embeddings",
    query_vector=doc_embedding,
    limit=5,
    score_threshold=0.65,  # minimum similarity
)

# Filter to the patient's current procedure (if known)
if case.procedure_code:
    matches = [m for m in matches if m.payload["procedure_code"] == case.procedure_code]

# Result: [
#   {requirement_name: "CBC", score: 0.91, procedure_code: "D6010"},
#   {requirement_name: "HbA1c", score: 0.87, procedure_code: "D6010"},
# ]

Step 4: LLM Re-Ranker (Optional, Recommended)¶

The vector search may return false positives (a dental consultation note matching "Dental CT Scan" because both mention "dental"). A quick LLM confirmation:

# Only for matches with score 0.65-0.85 (uncertain range)
# High-confidence matches (>0.85) skip re-ranking
uncertain_matches = [m for m in matches if m.score < 0.85]

if uncertain_matches:
    confirmation = await haiku_classify(
        f"Document content summary: {doc_summary[:500]}\n\n"
        f"Does this document satisfy the following requirements?\n"
        + "\n".join(f"- {m.payload['requirement_name']}: {m.payload['description']}" for m in uncertain_matches)
        + "\n\nReturn JSON: {\"confirmed\": [\"req_name_1\"], \"rejected\": [\"req_name_2\"]}"
    )

Cost: $0.001 per uncertain document. High-confidence matches (>0.85 similarity) skip this step entirely.

Step 5: Store Results on Document¶

matched_requirements = [
    {
        "requirement_name": m.payload["requirement_name"],
        "requirement_type": m.payload["requirement_type"],
        "procedure_code": m.payload["procedure_code"],
        "confidence": m.score,
        "confirmed_by_llm": m.score > 0.85 or m in confirmed_matches,
    }
    for m in final_matches
]

doc.extracted_data["matched_requirements"] = matched_requirements
await db.commit()

Step 6: Update Checklist Endpoint¶

Replace the entire _match_doc_to_requirement function:

# In the document-checklist endpoint:
for req in all_required:
    matched_doc = None
    for doc in uploaded_docs:
        if doc["id"] in claimed_ids:
            continue
        matched_reqs = (doc.get("extracted_data") or {}).get("matched_requirements", [])
        if any(
            mr["requirement_name"] == req["name"] or mr["requirement_type"] == req["type"]
            for mr in matched_reqs
        ):
            matched_doc = doc
            claimed_ids.add(doc["id"])
            break
    # ... build checklist entry from matched_doc

Zero keyword lists. Zero content matching. Just read pre-computed tags.

Cost Analysis¶

Component	Per Document	Per Month (100 cases, 3 docs each)
Voyage embedding (document)	$0.0001	$0.03
Qdrant search (5 results)	$0 (free tier)	$0
LLM re-ranker (uncertain only, ~30%)	$0.001 × 0.3 = $0.0003	$0.09
Total	$0.0004	$0.12

Negligible. Less than one cent per document.

Qdrant Collection Schema¶

Collection: requirement_embeddings
  - Vector: 1024 dims (Voyage AI voyage-3.5-lite)
  - Payload: procedure_code, procedure_name, requirement_name, requirement_type, mandatory, description
  - Size: ~100-200 vectors
  - Index: HNSW (default)

Collection: document_embeddings (optional, for future cross-case reuse)
  - Vector: 1024 dims
  - Payload: document_id, patient_id, filename, category
  - Size: grows with uploads

Migration Path¶

Seed requirement embeddings — one-time script, run via admin endpoint
Add embedding step to QStash OCR callback — embeds + matches after CCA
Update checklist endpoint — read matched_requirements from document metadata
Remove _match_doc_to_requirement — no longer needed
Backfill existing documents — one-time script to embed + match all existing docs

When New Procedures Are Added¶

Admin adds procedure to procedure_requirements table
Run seed script for that procedure → embeddings created in Qdrant
Done. No code changes. Existing documents automatically match new requirements on next upload or backfill.

Files to Change¶

File	Change
`app/integrations/embedding_service.py`	Add `embed_text()` wrapper (may already exist)
`app/integrations/qdrant_service.py`	Add `search_requirements()` method
`app/routers/internal.py`	Add embedding + matching step after CCA in OCR callback
`app/routers/cases.py`	Simplify checklist to read `matched_requirements`
`scripts/seed_requirement_embeddings.py`	New — one-time seeding script
`config/feature_flags.yaml`	Add `embedding_document_matching` flag