Traceability & Feedback Loops¶
Full decision traceability and model feedback loops for the Curaway agentic AI platform. Every AI decision is recorded, linked to patient outcomes, and fed back into prompt tuning, weight adjustment, and threshold calibration.
Design Principles¶
- Every decision traceable — why did the agent ask this question, skip that check, or rank this provider first?
- Every outcome linked — did the patient accept the match? Did the provider respond? Was the extraction correct?
- Every correction fed back — wrong ICD code → better prompt. Rejected match → adjusted weights. Missed condition → new few-shot example.
Architecture Overview¶
graph TD
A[Patient Interaction] --> B[Decision Engine]
B --> C[Decision Record]
C --> D[Langfuse Trace]
C --> E[Events Table]
C --> F[Audit Log]
G[Patient Outcome] --> H[Outcome Recorder]
H --> I{Link to Decisions}
I --> J[Match Accepted/Rejected]
I --> K[Provider Response]
I --> L[Patient Satisfaction]
I --> M[Provider Clinical Feedback]
J --> N[Feedback Store]
K --> N
L --> N
M --> N
N --> O[Eval Pipeline]
O --> P[Prompt Tuning]
O --> Q[Weight Adjustment]
O --> R[Threshold Calibration]
O --> S[ML Training Data]
P --> B
Q --> B
R --> B
style B fill:#008B8B,color:#fff
style N fill:#FF7F50,color:#fff
style O fill:#004D4D,color:#fff
Layer 1: Decision Records¶
Every patient turn produces a structured Decision Record stored in the events table.
Schema¶
{
"event_type": "agent.decision",
"case_id": "uuid",
"turn_number": 3,
"timestamp": "2026-04-01T10:30:00Z",
# What the patient provided
"input": {
"message": "Dallas, no pain, blood works attached",
"attachments": [{"document_id": "uuid", "filename": "blood_work.pdf"}],
"input_method": "text",
},
# Classification
"classification": {
"category": "medical_travel",
"confidence": 0.94,
"model": "claude-haiku-4.5",
"prompt_version": "classifier_v2",
},
# Document processing (per document)
"document_processing": [
{
"document_id": "uuid",
"ocr_method": "pymupdf",
"ocr_duration_ms": 320,
"chars_extracted": 4200,
"clinical_context_agent": {
"model": "claude-haiku-4.5",
"prompt_version": "clinical_context_v3",
"tokens_in": 3200,
"tokens_out": 1800,
"cost_usd": 0.012,
"conditions_extracted": 7,
"observations_extracted": 97,
"langfuse_trace_id": "trace-uuid",
},
"lab_analyzer": {
"comorbidities_detected": ["fatty_liver", "bradycardia", "impaired_glucose"],
"method": "rule_based",
},
"validator": {
"checks_run": ["laterality", "document_age", "body_part", "patient_name", "ocr_quality"],
"checks_skipped": ["body_part"], # systemic test exemption
"issues_found": [],
},
"embedding_match": {
"requirement_matched": "complete_blood_count",
"similarity_score": 0.91,
"confirmed_by_llm": False, # score > 0.85, skipped re-ranker
"coverage": 0.95, # 19/20 required parameters found
"missing_params": ["reticulocyte_count"],
},
}
],
# State changes this turn caused
"state_delta": {
"location": {"before": None, "after": {"city": "Dallas", "country": "USA"}, "source": "message_extraction"},
"conditions": {"before": [], "after": ["fatty_liver", "bradycardia", "impaired_glucose"], "source": "lab_analyzer"},
"documents.cbc": {"before": "not_provided", "after": "complete", "source": "blood_work.pdf"},
"ehr_completeness": {"before": 0.17, "after": 0.45},
},
# Routing decision
"routing": {
"branch_chosen": "_handle_attachment_response",
"reason": "attachments processed + procedure already identified",
"branches_skipped": [
{"branch": "quick_questions", "reason": "location already known from message"},
{"branch": "records_request", "reason": "records already provided as attachments"},
],
"workflow_updates": {"records_requested": True, "procedure_identified": True},
},
# Response generation
"response": {
"model": "claude-haiku-4.5",
"prompt_version": "attachment_response_v2",
"tokens_in": 800,
"tokens_out": 400,
"cost_usd": 0.004,
"content_type": "text",
"langfuse_trace_id": "trace-uuid",
},
# What's needed next
"pending_actions": ["awaiting_patient_response", "awaiting_imaging_upload"],
}
Storage¶
Decision Records are stored as event_type = "agent.decision" in the existing events table (JSONB payload). No new table needed. Indexed by case_id and timestamp for efficient retrieval.
Layer 2: Outcome Events¶
Track what happens after the AI makes a decision.
Events to Capture¶
| Event | Trigger | Links To |
|---|---|---|
match.presented |
Match results shown to patient | decision records for this case |
match.accepted |
Patient selects a provider | match.presented event |
match.rejected |
Patient rejects all matches or requests re-match | match.presented event |
provider.notified |
Records forwarded to provider | match.accepted event |
provider.responded |
Provider acknowledges/schedules | provider.notified event |
provider.feedback |
Provider corrects EHR data | clinical extraction decision records |
patient.satisfaction |
Post-journey NPS survey | all decision records for this case |
extraction.correction |
Manual correction of AI extraction | specific document processing record |
Outcome Schema¶
{
"event_type": "match.accepted",
"case_id": "uuid",
"timestamp": "2026-04-02T14:00:00Z",
"outcome": {
"provider_selected": "apollo-chennai",
"provider_rank": 1, # was it the top match?
"total_matches_shown": 5,
"time_to_decision_minutes": 45,
"doctor_selected": "dr-rajesh-patel", # if doctor-level matching enabled
},
# Link back to the decision that produced the matches
"decision_refs": ["event-id-of-matching-decision"],
# Signals for feedback
"feedback_signals": {
"top_match_accepted": True,
"selection_reason": None, # future: patient can explain why they chose
},
}
Layer 2.5: Auto-Reviewer (Active Now)¶
The auto-reviewer enables the feedback flywheel without provider partnerships. It compares the LLM's Clinical Context Agent extractions against the rule-based Lab Analyzer's detections to automatically generate ground truth.
graph LR
A[Document Uploaded] --> B[Clinical Context Agent<br/>LLM Extraction]
A --> C[Lab Analyzer<br/>Rule-Based Detection]
B --> D{Compare}
C --> D
D -->|Match| E[Auto-confirmed]
D -->|LLM missed| F[Auto-correction<br/>feedback record]
D -->|LLM extra| G[Queue for<br/>clinical advisor]
How it works¶
- Automated: Lab Analyzer uses clinical thresholds (HbA1c > 6.5 → diabetes, eGFR < 60 → CKD) which are deterministically correct
- Compares: conditions the rules found vs conditions the LLM extracted
- Creates feedback records for misses (
correction_type = "condition_missed",reviewed_by = "automated_lab_analyzer") - Auto-confirms matches (both agree →
quality_score = 1.0)
Activation status¶
- Service:
app/services/auto_reviewer.py - Manual trigger:
POST /api/v1/cases/{id}/auto-review - Batch:
POST /internal/eval/auto-review(QStash nightly) - Initial run: 16 cases reviewed, 88 feedback records created
- Pattern detector: finds recurring misses for prompt improvement
Three layers of ground truth¶
| Layer | Source | Covers | Accuracy |
|---|---|---|---|
| Automated | Lab analyzer rules vs LLM | Lab-detectable conditions | High (deterministic) |
| Clinical advisor | Dr. Shrikanth Naidu via API | Complex clinical conditions | Gold standard |
| Patient behavior | Match acceptance signals | Matching quality | Implicit signal |
Layer 3: Feedback Store¶
Links decisions to outcomes for analysis and model improvement.
Schema¶
CREATE TABLE feedback_records (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
case_id UUID NOT NULL,
feedback_type VARCHAR(50) NOT NULL, -- prompt_quality, match_quality, extraction_accuracy, clinical_correction
decision_event_id UUID NOT NULL, -- the agent.decision event
outcome_event_id UUID, -- the outcome event (if available)
-- What was the AI's output?
ai_output JSONB NOT NULL,
-- What was the correct/desired output?
ground_truth JSONB, -- null if not yet reviewed
-- Scoring
quality_score DECIMAL(3,2), -- 0.00-1.00 (automated or manual)
reviewed_by VARCHAR(100), -- "automated", "clinical_advisor", "provider"
reviewed_at TIMESTAMPTZ,
-- Actionable feedback
correction_type VARCHAR(50), -- "icd_code_wrong", "condition_missed", "match_rank_wrong", "question_redundant"
correction_detail JSONB, -- specific correction data
applied_to_prompt BOOLEAN DEFAULT false, -- has this been used to update a prompt?
applied_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_fb_case ON feedback_records(case_id);
CREATE INDEX idx_fb_type ON feedback_records(feedback_type);
CREATE INDEX idx_fb_unapplied ON feedback_records(applied_to_prompt) WHERE applied_to_prompt = false;
Layer 4: Three Feedback Loops¶
Loop 1: Prompt Quality (cadence: weekly)¶
graph LR
A[Decision Record] --> B[Langfuse Trace]
B --> C[Manual Review<br/>or Automated Eval]
C --> D{Quality OK?}
D -->|No| E[Create Feedback Record]
E --> F[Update Prompt<br/>in Langfuse]
F --> G[A/B Test New<br/>vs Old Prompt]
G --> H[Promote Winner]
D -->|Yes| I[No Action]
What's evaluated:
| Agent | Evaluation Criteria |
|---|---|
| Clinical Context | ICD code accuracy, condition completeness, observation extraction rate |
| Intake | Question relevance (was this already known?), information extraction accuracy |
| Match | Explanation quality, reasoning accuracy, no hallucinated stats |
| Explanation | Medical accuracy, readability, locale appropriateness |
| Classifier | Category accuracy, false positive rate for off-topic/medical-advice |
Automated evaluators (Langfuse):
# ICD accuracy evaluator
def eval_icd_accuracy(trace):
"""Compare AI-extracted ICD codes against gold standard."""
extracted = trace.output.get("icd_codes", [])
gold = get_gold_standard(trace.metadata["document_id"])
if not gold:
return None # no ground truth yet
precision = len(set(extracted) & set(gold)) / max(len(extracted), 1)
recall = len(set(extracted) & set(gold)) / max(len(gold), 1)
return {"precision": precision, "recall": recall, "f1": 2 * precision * recall / max(precision + recall, 0.001)}
# Question relevance evaluator
def eval_question_relevance(trace):
"""Was this question necessary given what we already knew?"""
state_before = trace.metadata.get("patient_state_before", {})
question_asked = trace.output.get("question_topic")
if question_asked == "location" and state_before.get("location"):
return {"relevant": False, "reason": "location_already_known"}
return {"relevant": True}
Loop 2: Matching Quality (cadence: per-case, analyzed monthly)¶
graph LR
A[Match Presented] --> B{Patient Action}
B -->|Accepts Top Match| C[Strong Signal:<br/>Ranking Correct]
B -->|Accepts Lower Match| D[Moderate Signal:<br/>Ranking Suboptimal]
B -->|Rejects All| E[Weak Signal:<br/>Investigate Why]
B -->|Requests Re-match| F[Negative Signal:<br/>Criteria Wrong]
C --> G[Feedback Store]
D --> G
E --> G
F --> G
G --> H[Monthly Analysis]
H --> I[Adjust Dimension Weights]
H --> J[Tune Score Thresholds]
H --> K[Identify Scoring Gaps]
Signals captured:
| Signal | Weight | Meaning |
|---|---|---|
| Top match accepted | +1.0 | Ranking was correct |
| Match 2-3 accepted | +0.5 | Close, but top wasn't best |
| Match 4-5 accepted | +0.2 | Significant ranking error |
| All rejected, re-matched | -0.5 | Criteria or weights wrong |
| Provider responded positively | +0.3 | Provider-side validation |
| Provider declined case | -0.3 | Match was one-sided |
| Patient satisfaction >= 4/5 | +0.5 | End-to-end success |
| Patient satisfaction <= 2/5 | -0.5 | Something went wrong |
Weight adjustment (monthly):
# Simplified — production would use gradient-based optimization
def adjust_weights(feedback_records, current_weights):
"""Adjust matching weights based on accumulated feedback signals."""
dimension_scores = defaultdict(list)
for record in feedback_records:
match_decision = record.ai_output
outcome = record.ground_truth # which provider was actually selected
# Which dimensions predicted the selected provider best?
selected_scores = match_decision["scores"][outcome["provider_id"]]
for dim, score in selected_scores.items():
if outcome["signal"] > 0:
dimension_scores[dim].append(score * outcome["signal"])
else:
dimension_scores[dim].append(-score * abs(outcome["signal"]))
# Dimensions that predicted well get weight boost, others get decrease
adjustments = {}
for dim, scores in dimension_scores.items():
avg = sum(scores) / len(scores)
adjustments[dim] = current_weights[dim] * (1 + 0.1 * avg) # 10% max adjustment per cycle
# Normalize to sum = 1.0
total = sum(adjustments.values())
return {k: v / total for k, v in adjustments.items()}
Loop 3: Clinical Accuracy (cadence: per-provider-interaction)¶
graph LR
A[AI Extracts EHR] --> B[Records Forwarded<br/>to Provider]
B --> C[Provider Reviews EHR]
C --> D{Corrections Needed?}
D -->|Yes| E[Provider Submits<br/>Corrections]
D -->|No| F[Confirmed Accurate]
E --> G[Feedback Store]
F --> G
G --> H[Correction Patterns<br/>Analysis]
H --> I[Update Few-Shot<br/>Examples]
H --> J[Adjust Confidence<br/>Thresholds]
H --> K[Flag Systematic<br/>Errors]
Provider feedback endpoint (post-POC):
POST /api/v1/cases/{case_id}/provider-feedback
{
"provider_id": "apollo-chennai",
"reviewer": "Dr. Rajesh Patel",
"ehr_review": {
"conditions_confirmed": ["M17.11"],
"conditions_added": ["E11.9"], // AI missed diabetes
"conditions_removed": [],
"observations_corrected": [
{"parameter": "HbA1c", "ai_value": 5.8, "correct_value": 6.2}
],
"overall_accuracy": 0.85
},
"notes": "Good extraction overall. Missed pre-diabetic indication from HbA1c trend."
}
Correction → prompt improvement:
# When a correction pattern appears 3+ times, create a new few-shot example
def check_correction_patterns():
patterns = db.query("""
SELECT correction_detail->>'type' as error_type,
COUNT(*) as occurrences
FROM feedback_records
WHERE feedback_type = 'clinical_correction'
AND applied_to_prompt = false
GROUP BY correction_detail->>'type'
HAVING COUNT(*) >= 3
""")
for pattern in patterns:
if pattern.error_type == "condition_missed":
# Generate a new few-shot example showing the missed condition
create_prompt_example(pattern)
# Update prompt in Langfuse
update_langfuse_prompt("clinical_context", add_example=new_example)
# Mark feedback as applied
mark_feedback_applied(pattern.ids)
Layer 5: Eval Pipeline¶
Automated nightly evaluation runs comparing AI outputs against accumulated ground truth.
Scheduled Evaluations¶
| Eval | Schedule | Data Source | Metric |
|---|---|---|---|
| ICD extraction accuracy | Nightly | provider_feedback + manual annotations | Precision, Recall, F1 |
| Comorbidity detection rate | Nightly | lab_analyzer outputs vs clinical confirmation | Sensitivity, Specificity |
| Match ranking quality | Weekly | match acceptance signals | NDCG@5, MRR |
| Question relevance | Weekly | decision records (was question needed?) | Redundancy rate |
| Document coverage scoring | Nightly | parameter extraction vs requirements | Coverage accuracy |
| Prompt regression | On prompt change | A/B comparison on held-out test set | Quality delta |
Eval Dashboard (Langfuse + Metabase)¶
┌─────────────────────────────────────────────┐
│ Clinical Extraction Accuracy March 2026 │
│ ────────────────────────────────────────── │
│ ICD-10 Precision: 0.87 (+0.03 vs Feb) │
│ ICD-10 Recall: 0.79 (+0.05 vs Feb) │
│ Condition F1: 0.83 │
│ Observation Rate: 94% of lab values found │
│ │
│ Top Missed Conditions: │
│ 1. Pre-diabetes (HbA1c 5.7-6.4) — 12 cases│
│ 2. Mild CKD (eGFR 60-89) — 8 cases │
│ 3. Subclinical hypothyroid — 5 cases │
│ │
│ Prompt Version: clinical_context_v3 │
│ Recommended: Add pre-diabetes few-shot │
└─────────────────────────────────────────────┘
Implementation Plan¶
Database Changes¶
-- New table for feedback records
CREATE TABLE feedback_records (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
case_id UUID NOT NULL,
feedback_type VARCHAR(50) NOT NULL,
decision_event_id UUID NOT NULL,
outcome_event_id UUID,
ai_output JSONB NOT NULL,
ground_truth JSONB,
quality_score DECIMAL(3,2),
reviewed_by VARCHAR(100),
reviewed_at TIMESTAMPTZ,
correction_type VARCHAR(50),
correction_detail JSONB,
applied_to_prompt BOOLEAN DEFAULT false,
applied_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- New event types to add
-- agent.decision (per-turn decision record)
-- match.accepted / match.rejected
-- provider.feedback (clinical corrections)
-- patient.satisfaction (NPS survey)
-- extraction.correction (manual fix)
New Services¶
| Service | Purpose |
|---|---|
app/services/decision_recorder.py |
Builds and stores Decision Records on every turn |
app/services/outcome_tracker.py |
Records match acceptance, provider response, satisfaction |
app/services/feedback_service.py |
Links decisions to outcomes, manages feedback records |
app/services/eval_runner.py |
Nightly eval pipeline, computes metrics, flags regressions |
app/services/weight_optimizer.py |
Monthly matching weight adjustment from feedback signals |
New Endpoints¶
| Method | Path | Description |
|---|---|---|
GET |
/api/v1/cases/{id}/decisions |
Full decision trace for a case |
POST |
/api/v1/cases/{id}/provider-feedback |
Provider submits EHR corrections |
POST |
/api/v1/cases/{id}/satisfaction |
Patient NPS survey |
GET |
/api/v1/internal/eval/summary |
Latest eval metrics dashboard |
POST |
/api/v1/internal/eval/run |
Trigger manual eval run |
GET |
/api/v1/internal/feedback/pending |
Unapplied corrections awaiting prompt updates |
QStash Scheduled Tasks¶
| Task | Schedule | Description |
|---|---|---|
eval-extraction-accuracy |
0 2 * * * (daily 2am) |
Compare extractions against ground truth |
eval-match-quality |
0 3 * * 1 (weekly Mon 3am) |
Analyze match acceptance patterns |
eval-prompt-regression |
On prompt change | A/B test new vs old prompt |
feedback-pattern-detector |
0 4 * * * (daily 4am) |
Find recurring correction patterns |
weight-optimizer |
0 5 1 * * (monthly 1st 5am) |
Adjust matching weights from signals |
Langfuse Integration¶
Current Tracing (already built)¶
# Every LLM call is traced
trace = create_trace(agent_name="case_chat", session_id=case_id, user_id=patient_id)
handler = get_langchain_handler(trace)
# → Captures: model, tokens, cost, latency, prompt, completion
Enhanced Tracing (to build)¶
# Add decision spans to existing traces
with trace.span("routing_decision") as span:
span.input = {"patient_state": patient_state, "message": message}
branch = determine_routing_branch(patient_state, message)
span.output = {"branch": branch, "reason": reason, "skipped": skipped_branches}
with trace.span("document_processing") as span:
span.input = {"document_id": doc_id, "ocr_method": "pymupdf"}
result = process_document(doc_id)
span.output = {"conditions": len(result.conditions), "observations": len(result.observations)}
with trace.span("embedding_match") as span:
span.input = {"document_summary": summary, "collection": "requirement_embeddings"}
matches = qdrant_search(embedding, limit=5)
span.output = {"top_match": matches[0].payload, "score": matches[0].score, "coverage": coverage}
# Decision record auto-built from trace spans
decision_record = build_decision_record_from_trace(trace)
await store_decision_event(case_id, decision_record)
Langfuse Evaluators¶
# Register automated evaluators in Langfuse
langfuse.register_evaluator(
name="icd_extraction_accuracy",
description="Compares extracted ICD codes against provider-confirmed codes",
function=eval_icd_accuracy,
applies_to={"agent_name": "clinical_context_agent"},
)
langfuse.register_evaluator(
name="question_relevance",
description="Was this intake question necessary given known patient state?",
function=eval_question_relevance,
applies_to={"agent_name": "intake_agent"},
)
langfuse.register_evaluator(
name="match_ranking_quality",
description="Did the patient accept the top-ranked match?",
function=eval_match_ranking,
applies_to={"agent_name": "match_agent"},
)
Data Flow Summary¶
| What | Where Stored | Retention | Access |
|---|---|---|---|
| Decision Records | Events table (JSONB) | Indefinite | GET /cases/{id}/decisions |
| Langfuse Traces | Langfuse Cloud | 90 days (free tier) | Langfuse dashboard |
| Outcome Events | Events table | Indefinite | Internal analytics |
| Feedback Records | feedback_records table | Indefinite | Internal + provider portal |
| Eval Metrics | Events table + Metabase | Indefinite | Eval dashboard |
| Audit Logs | audit_logs table | Indefinite, immutable | Compliance |
Privacy & Compliance¶
- Decision Records contain no PII — patient referenced by UUID only
- Langfuse traces contain no PII — prompts use anonymized data
- Feedback Records may contain provider names (not patient PII)
- Provider clinical feedback is stored with provider consent
- All feedback data is tenant-scoped
- GDPR deletion cascades through decision records and feedback records
- Audit trail is append-only — even corrections don't delete the original decision