ADR-0006: Records-First Intelligence¶

Status: Accepted Date: 2026-03-26 Session: 21

Context¶

Curaway matches patients to clinical trials and specialists based on their medical history. The matching algorithm needs input data to work with. Two approaches were considered:

Match on raw text. Feed the raw text of uploaded medical documents directly into an LLM or embedding model and match against provider descriptions.
Match on structured records. First extract structured data from documents (diagnoses as ICD-10 codes, procedures as CPT codes, observations as FHIR resources), then match on the structured representation.

The choice affects explainability, accuracy, auditability, and the ability to apply deterministic business rules (e.g., "this trial requires a confirmed HER2+ diagnosis").

Decision¶

Build structured EHR records first, conforming to FHIR R4 resource types, then apply matching logic on the structured data. An EHR completeness score gates matching: the system requires a completeness score above 50% before generating matches.

Rationale¶

Explainability. When a patient is matched to a trial, the system can explain why: "Matched because your record includes ICD-10 C50.9 (breast cancer) and observation HER2 status = positive." Raw text matching would be a black box.
Auditability. Healthcare applications require audit trails. Structured records with standardized codes (ICD-10, SNOMED CT, LOINC) provide a clear, reproducible basis for matching decisions.
Deterministic rules. Many clinical trial eligibility criteria are deterministic (age range, specific diagnosis code, lab value thresholds). Structured records enable rule-based filtering before semantic matching, reducing false positives.
Completeness gating. The 50% completeness threshold prevents premature matching on incomplete data. If a patient has only uploaded one lab report, the system prompts them to provide more records before generating matches.
Reusability. Structured EHR data is useful beyond matching: it powers the patient summary, the timeline view, and the data export for providers.

Alternatives Considered¶

Alternative	Pros	Cons	Verdict
Match on raw text	Faster to implement, no extraction step needed	Unexplainable results, sensitive to document formatting, no deterministic rules possible	Rejected
Match on keywords	Simple, fast	Brittle (synonyms, abbreviations, negation handling), high false positive rate	Rejected
Hybrid (structured + semantic)	Best of both worlds	More complex, risk of conflicting signals	Planned for v2 (structured first, semantic refinement second)

Consequences¶

Positive: Every match has an explainable, auditable basis in coded medical data.
Positive: Deterministic eligibility rules filter out obvious mismatches before expensive LLM-based semantic comparison.
Positive: Structured records enable future features (patient summary, timeline, FHIR export) without re-processing documents.
Negative: The extraction step adds latency and complexity. The LLM must reliably extract ICD-10, SNOMED, and LOINC codes from free-text medical documents.
Negative: Extraction accuracy is imperfect. Misidentified codes lead to incorrect matches or missed matches.
Accepted risk: The 50% completeness threshold is somewhat arbitrary. It will be tuned based on real-world data about how much information is needed for useful matches.