Document-to-Requirement Matching via Embeddings — Implementation Plan¶
Priority: High — demo-blocking quality issue Effort: 4-6 hours Dependencies: Voyage AI (deployed), Qdrant (deployed)
Architecture¶
Upload Time (one-time per document):
Document → OCR → Clinical Context Agent → Entities + Observations
↓
Build clinical summary text
↓
Embed via Voyage AI (1024 dims)
↓
Store embedding in Qdrant collection "document_embeddings"
↓
Query "requirement_embeddings" collection for top-K matches
↓
LLM re-ranker confirms/rejects each match
↓
Store matched_requirements on document metadata
Checklist Time (every panel refresh):
Read document.extracted_data.matched_requirements → done
Zero matching logic, zero keyword lists, zero maintenance
Step-by-Step Implementation¶
Step 1: Seed Requirement Embeddings (one-time)¶
Create a Qdrant collection requirement_embeddings with all procedure requirements embedded:
# For each procedure in procedure_requirements table:
for proc in all_procedures:
for req in proc.required_documents:
text = f"{req['name']}. {req.get('description', '')}. "
f"Type: {req['type']}. "
f"For procedure: {proc.procedure_name}."
embedding = voyage_embed(text) # 1024 dims
qdrant.upsert("requirement_embeddings", {
"id": f"{proc.procedure_code}:{req['type']}:{req['name']}",
"vector": embedding,
"payload": {
"procedure_code": proc.procedure_code,
"procedure_name": proc.procedure_name,
"requirement_name": req["name"],
"requirement_type": req["type"],
"mandatory": req.get("mandatory", False),
"description": req.get("description", ""),
}
})
Example embeddings: - "Complete Blood Count (CBC). Full blood panel including WBC, RBC, hemoglobin, hematocrit, platelets. Type: blood_work. For procedure: Total Knee Replacement." - "Panoramic X-Ray (OPG). Full mouth dental X-ray showing all teeth, jawbone, and TMJ. Type: dental. For procedure: Dental Implants Full Mouth." - "MRI Knee. Magnetic resonance imaging of the knee joint showing cartilage, ligaments, and meniscus. Type: imaging. For procedure: Total Knee Replacement."
Collection size: ~100-200 vectors (10 procedures × 10-20 requirements each). Tiny — well within Qdrant free tier.
When to re-seed: When a new procedure is added or requirements change. Can be triggered via admin endpoint or QStash schedule.
Step 2: Document Embedding at Upload Time¶
In the QStash OCR callback (internal.py), after Clinical Context Agent runs:
# Build clinical summary text from the document's analysis
entities = doc.extracted_data.get("extracted_entities", [])
observations = doc.extracted_data.get("observations", [])
entity_text = ". ".join(e.get("name", "") for e in entities[:10])
obs_text = ". ".join(
f"{o.get('parameter_name', '')}: {o.get('value', '')} {o.get('unit', '')}"
for o in observations[:20]
)
doc_summary = f"Medical document: {doc.original_filename}. "
f"Category: {doc.document_category}. "
f"Conditions found: {entity_text}. "
f"Lab values: {obs_text}."
# Embed
doc_embedding = voyage_embed(doc_summary)
# Store in Qdrant for future reference
qdrant.upsert("document_embeddings", {
"id": doc.id,
"vector": doc_embedding,
"payload": {
"document_id": doc.id,
"patient_id": doc.patient_id,
"filename": doc.original_filename,
"category": doc.document_category,
}
})
Step 3: Match Document Against Requirements¶
Query the requirement_embeddings collection with the document embedding:
# Find top-5 matching requirements across ALL procedures
matches = qdrant.search(
collection_name="requirement_embeddings",
query_vector=doc_embedding,
limit=5,
score_threshold=0.65, # minimum similarity
)
# Filter to the patient's current procedure (if known)
if case.procedure_code:
matches = [m for m in matches if m.payload["procedure_code"] == case.procedure_code]
# Result: [
# {requirement_name: "CBC", score: 0.91, procedure_code: "D6010"},
# {requirement_name: "HbA1c", score: 0.87, procedure_code: "D6010"},
# ]
Step 4: LLM Re-Ranker (Optional, Recommended)¶
The vector search may return false positives (a dental consultation note matching "Dental CT Scan" because both mention "dental"). A quick LLM confirmation:
# Only for matches with score 0.65-0.85 (uncertain range)
# High-confidence matches (>0.85) skip re-ranking
uncertain_matches = [m for m in matches if m.score < 0.85]
if uncertain_matches:
confirmation = await haiku_classify(
f"Document content summary: {doc_summary[:500]}\n\n"
f"Does this document satisfy the following requirements?\n"
+ "\n".join(f"- {m.payload['requirement_name']}: {m.payload['description']}" for m in uncertain_matches)
+ "\n\nReturn JSON: {\"confirmed\": [\"req_name_1\"], \"rejected\": [\"req_name_2\"]}"
)
Cost: $0.001 per uncertain document. High-confidence matches (>0.85 similarity) skip this step entirely.
Step 5: Store Results on Document¶
matched_requirements = [
{
"requirement_name": m.payload["requirement_name"],
"requirement_type": m.payload["requirement_type"],
"procedure_code": m.payload["procedure_code"],
"confidence": m.score,
"confirmed_by_llm": m.score > 0.85 or m in confirmed_matches,
}
for m in final_matches
]
doc.extracted_data["matched_requirements"] = matched_requirements
await db.commit()
Step 6: Update Checklist Endpoint¶
Replace the entire _match_doc_to_requirement function:
# In the document-checklist endpoint:
for req in all_required:
matched_doc = None
for doc in uploaded_docs:
if doc["id"] in claimed_ids:
continue
matched_reqs = (doc.get("extracted_data") or {}).get("matched_requirements", [])
if any(
mr["requirement_name"] == req["name"] or mr["requirement_type"] == req["type"]
for mr in matched_reqs
):
matched_doc = doc
claimed_ids.add(doc["id"])
break
# ... build checklist entry from matched_doc
Zero keyword lists. Zero content matching. Just read pre-computed tags.
Cost Analysis¶
| Component | Per Document | Per Month (100 cases, 3 docs each) |
|---|---|---|
| Voyage embedding (document) | $0.0001 | $0.03 |
| Qdrant search (5 results) | $0 (free tier) | $0 |
| LLM re-ranker (uncertain only, ~30%) | $0.001 × 0.3 = $0.0003 | $0.09 |
| Total | $0.0004 | $0.12 |
Negligible. Less than one cent per document.
Qdrant Collection Schema¶
Collection: requirement_embeddings
- Vector: 1024 dims (Voyage AI voyage-3.5-lite)
- Payload: procedure_code, procedure_name, requirement_name, requirement_type, mandatory, description
- Size: ~100-200 vectors
- Index: HNSW (default)
Collection: document_embeddings (optional, for future cross-case reuse)
- Vector: 1024 dims
- Payload: document_id, patient_id, filename, category
- Size: grows with uploads
Migration Path¶
- Seed requirement embeddings — one-time script, run via admin endpoint
- Add embedding step to QStash OCR callback — embeds + matches after CCA
- Update checklist endpoint — read
matched_requirementsfrom document metadata - Remove
_match_doc_to_requirement— no longer needed - Backfill existing documents — one-time script to embed + match all existing docs
When New Procedures Are Added¶
- Admin adds procedure to
procedure_requirementstable - Run seed script for that procedure → embeddings created in Qdrant
- Done. No code changes. Existing documents automatically match new requirements on next upload or backfill.
Files to Change¶
| File | Change |
|---|---|
app/integrations/embedding_service.py |
Add embed_text() wrapper (may already exist) |
app/integrations/qdrant_service.py |
Add search_requirements() method |
app/routers/internal.py |
Add embedding + matching step after CCA in OCR callback |
app/routers/cases.py |
Simplify checklist to read matched_requirements |
scripts/seed_requirement_embeddings.py |
New — one-time seeding script |
config/feature_flags.yaml |
Add embedding_document_matching flag |