Skip to content

Traceability & Feedback Loops

Full decision traceability and model feedback loops for the Curaway agentic AI platform. Every AI decision is recorded, linked to patient outcomes, and fed back into prompt tuning, weight adjustment, and threshold calibration.


Design Principles

  1. Every decision traceable — why did the agent ask this question, skip that check, or rank this provider first?
  2. Every outcome linked — did the patient accept the match? Did the provider respond? Was the extraction correct?
  3. Every correction fed back — wrong ICD code → better prompt. Rejected match → adjusted weights. Missed condition → new few-shot example.

Architecture Overview

graph TD
    A[Patient Interaction] --> B[Decision Engine]
    B --> C[Decision Record]
    C --> D[Langfuse Trace]
    C --> E[Events Table]
    C --> F[Audit Log]

    G[Patient Outcome] --> H[Outcome Recorder]
    H --> I{Link to Decisions}
    I --> J[Match Accepted/Rejected]
    I --> K[Provider Response]
    I --> L[Patient Satisfaction]
    I --> M[Provider Clinical Feedback]

    J --> N[Feedback Store]
    K --> N
    L --> N
    M --> N

    N --> O[Eval Pipeline]
    O --> P[Prompt Tuning]
    O --> Q[Weight Adjustment]
    O --> R[Threshold Calibration]
    O --> S[ML Training Data]

    P --> B
    Q --> B
    R --> B

    style B fill:#008B8B,color:#fff
    style N fill:#FF7F50,color:#fff
    style O fill:#004D4D,color:#fff

Layer 1: Decision Records

Every patient turn produces a structured Decision Record stored in the events table.

Schema

{
    "event_type": "agent.decision",
    "case_id": "uuid",
    "turn_number": 3,
    "timestamp": "2026-04-01T10:30:00Z",

    # What the patient provided
    "input": {
        "message": "Dallas, no pain, blood works attached",
        "attachments": [{"document_id": "uuid", "filename": "blood_work.pdf"}],
        "input_method": "text",
    },

    # Classification
    "classification": {
        "category": "medical_travel",
        "confidence": 0.94,
        "model": "claude-haiku-4.5",
        "prompt_version": "classifier_v2",
    },

    # Document processing (per document)
    "document_processing": [
        {
            "document_id": "uuid",
            "ocr_method": "pymupdf",
            "ocr_duration_ms": 320,
            "chars_extracted": 4200,
            "clinical_context_agent": {
                "model": "claude-haiku-4.5",
                "prompt_version": "clinical_context_v3",
                "tokens_in": 3200,
                "tokens_out": 1800,
                "cost_usd": 0.012,
                "conditions_extracted": 7,
                "observations_extracted": 97,
                "langfuse_trace_id": "trace-uuid",
            },
            "lab_analyzer": {
                "comorbidities_detected": ["fatty_liver", "bradycardia", "impaired_glucose"],
                "method": "rule_based",
            },
            "validator": {
                "checks_run": ["laterality", "document_age", "body_part", "patient_name", "ocr_quality"],
                "checks_skipped": ["body_part"],  # systemic test exemption
                "issues_found": [],
            },
            "embedding_match": {
                "requirement_matched": "complete_blood_count",
                "similarity_score": 0.91,
                "confirmed_by_llm": False,  # score > 0.85, skipped re-ranker
                "coverage": 0.95,  # 19/20 required parameters found
                "missing_params": ["reticulocyte_count"],
            },
        }
    ],

    # State changes this turn caused
    "state_delta": {
        "location": {"before": None, "after": {"city": "Dallas", "country": "USA"}, "source": "message_extraction"},
        "conditions": {"before": [], "after": ["fatty_liver", "bradycardia", "impaired_glucose"], "source": "lab_analyzer"},
        "documents.cbc": {"before": "not_provided", "after": "complete", "source": "blood_work.pdf"},
        "ehr_completeness": {"before": 0.17, "after": 0.45},
    },

    # Routing decision
    "routing": {
        "branch_chosen": "_handle_attachment_response",
        "reason": "attachments processed + procedure already identified",
        "branches_skipped": [
            {"branch": "quick_questions", "reason": "location already known from message"},
            {"branch": "records_request", "reason": "records already provided as attachments"},
        ],
        "workflow_updates": {"records_requested": True, "procedure_identified": True},
    },

    # Response generation
    "response": {
        "model": "claude-haiku-4.5",
        "prompt_version": "attachment_response_v2",
        "tokens_in": 800,
        "tokens_out": 400,
        "cost_usd": 0.004,
        "content_type": "text",
        "langfuse_trace_id": "trace-uuid",
    },

    # What's needed next
    "pending_actions": ["awaiting_patient_response", "awaiting_imaging_upload"],
}

Storage

Decision Records are stored as event_type = "agent.decision" in the existing events table (JSONB payload). No new table needed. Indexed by case_id and timestamp for efficient retrieval.


Layer 2: Outcome Events

Track what happens after the AI makes a decision.

Events to Capture

Event Trigger Links To
match.presented Match results shown to patient decision records for this case
match.accepted Patient selects a provider match.presented event
match.rejected Patient rejects all matches or requests re-match match.presented event
provider.notified Records forwarded to provider match.accepted event
provider.responded Provider acknowledges/schedules provider.notified event
provider.feedback Provider corrects EHR data clinical extraction decision records
patient.satisfaction Post-journey NPS survey all decision records for this case
extraction.correction Manual correction of AI extraction specific document processing record

Outcome Schema

{
    "event_type": "match.accepted",
    "case_id": "uuid",
    "timestamp": "2026-04-02T14:00:00Z",

    "outcome": {
        "provider_selected": "apollo-chennai",
        "provider_rank": 1,  # was it the top match?
        "total_matches_shown": 5,
        "time_to_decision_minutes": 45,
        "doctor_selected": "dr-rajesh-patel",  # if doctor-level matching enabled
    },

    # Link back to the decision that produced the matches
    "decision_refs": ["event-id-of-matching-decision"],

    # Signals for feedback
    "feedback_signals": {
        "top_match_accepted": True,
        "selection_reason": None,  # future: patient can explain why they chose
    },
}

Layer 2.5: Auto-Reviewer (Active Now)

The auto-reviewer enables the feedback flywheel without provider partnerships. It compares the LLM's Clinical Context Agent extractions against the rule-based Lab Analyzer's detections to automatically generate ground truth.

graph LR
    A[Document Uploaded] --> B[Clinical Context Agent<br/>LLM Extraction]
    A --> C[Lab Analyzer<br/>Rule-Based Detection]
    B --> D{Compare}
    C --> D
    D -->|Match| E[Auto-confirmed]
    D -->|LLM missed| F[Auto-correction<br/>feedback record]
    D -->|LLM extra| G[Queue for<br/>clinical advisor]

How it works

  • Automated: Lab Analyzer uses clinical thresholds (HbA1c > 6.5 → diabetes, eGFR < 60 → CKD) which are deterministically correct
  • Compares: conditions the rules found vs conditions the LLM extracted
  • Creates feedback records for misses (correction_type = "condition_missed", reviewed_by = "automated_lab_analyzer")
  • Auto-confirms matches (both agree → quality_score = 1.0)

Activation status

  • Service: app/services/auto_reviewer.py
  • Manual trigger: POST /api/v1/cases/{id}/auto-review
  • Batch: POST /internal/eval/auto-review (QStash nightly)
  • Initial run: 16 cases reviewed, 88 feedback records created
  • Pattern detector: finds recurring misses for prompt improvement

Three layers of ground truth

Layer Source Covers Accuracy
Automated Lab analyzer rules vs LLM Lab-detectable conditions High (deterministic)
Clinical advisor Dr. Shrikanth Naidu via API Complex clinical conditions Gold standard
Patient behavior Match acceptance signals Matching quality Implicit signal

Layer 3: Feedback Store

Links decisions to outcomes for analysis and model improvement.

Schema

CREATE TABLE feedback_records (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL,
    case_id UUID NOT NULL,
    feedback_type VARCHAR(50) NOT NULL,  -- prompt_quality, match_quality, extraction_accuracy, clinical_correction
    decision_event_id UUID NOT NULL,     -- the agent.decision event
    outcome_event_id UUID,               -- the outcome event (if available)

    -- What was the AI's output?
    ai_output JSONB NOT NULL,

    -- What was the correct/desired output?
    ground_truth JSONB,                  -- null if not yet reviewed

    -- Scoring
    quality_score DECIMAL(3,2),          -- 0.00-1.00 (automated or manual)
    reviewed_by VARCHAR(100),            -- "automated", "clinical_advisor", "provider"
    reviewed_at TIMESTAMPTZ,

    -- Actionable feedback
    correction_type VARCHAR(50),         -- "icd_code_wrong", "condition_missed", "match_rank_wrong", "question_redundant"
    correction_detail JSONB,             -- specific correction data
    applied_to_prompt BOOLEAN DEFAULT false,  -- has this been used to update a prompt?
    applied_at TIMESTAMPTZ,

    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_fb_case ON feedback_records(case_id);
CREATE INDEX idx_fb_type ON feedback_records(feedback_type);
CREATE INDEX idx_fb_unapplied ON feedback_records(applied_to_prompt) WHERE applied_to_prompt = false;

Layer 4: Three Feedback Loops

Loop 1: Prompt Quality (cadence: weekly)

graph LR
    A[Decision Record] --> B[Langfuse Trace]
    B --> C[Manual Review<br/>or Automated Eval]
    C --> D{Quality OK?}
    D -->|No| E[Create Feedback Record]
    E --> F[Update Prompt<br/>in Langfuse]
    F --> G[A/B Test New<br/>vs Old Prompt]
    G --> H[Promote Winner]
    D -->|Yes| I[No Action]

What's evaluated:

Agent Evaluation Criteria
Clinical Context ICD code accuracy, condition completeness, observation extraction rate
Intake Question relevance (was this already known?), information extraction accuracy
Match Explanation quality, reasoning accuracy, no hallucinated stats
Explanation Medical accuracy, readability, locale appropriateness
Classifier Category accuracy, false positive rate for off-topic/medical-advice

Automated evaluators (Langfuse):

# ICD accuracy evaluator
def eval_icd_accuracy(trace):
    """Compare AI-extracted ICD codes against gold standard."""
    extracted = trace.output.get("icd_codes", [])
    gold = get_gold_standard(trace.metadata["document_id"])
    if not gold:
        return None  # no ground truth yet
    precision = len(set(extracted) & set(gold)) / max(len(extracted), 1)
    recall = len(set(extracted) & set(gold)) / max(len(gold), 1)
    return {"precision": precision, "recall": recall, "f1": 2 * precision * recall / max(precision + recall, 0.001)}

# Question relevance evaluator
def eval_question_relevance(trace):
    """Was this question necessary given what we already knew?"""
    state_before = trace.metadata.get("patient_state_before", {})
    question_asked = trace.output.get("question_topic")
    if question_asked == "location" and state_before.get("location"):
        return {"relevant": False, "reason": "location_already_known"}
    return {"relevant": True}

Loop 2: Matching Quality (cadence: per-case, analyzed monthly)

graph LR
    A[Match Presented] --> B{Patient Action}
    B -->|Accepts Top Match| C[Strong Signal:<br/>Ranking Correct]
    B -->|Accepts Lower Match| D[Moderate Signal:<br/>Ranking Suboptimal]
    B -->|Rejects All| E[Weak Signal:<br/>Investigate Why]
    B -->|Requests Re-match| F[Negative Signal:<br/>Criteria Wrong]

    C --> G[Feedback Store]
    D --> G
    E --> G
    F --> G

    G --> H[Monthly Analysis]
    H --> I[Adjust Dimension Weights]
    H --> J[Tune Score Thresholds]
    H --> K[Identify Scoring Gaps]

Signals captured:

Signal Weight Meaning
Top match accepted +1.0 Ranking was correct
Match 2-3 accepted +0.5 Close, but top wasn't best
Match 4-5 accepted +0.2 Significant ranking error
All rejected, re-matched -0.5 Criteria or weights wrong
Provider responded positively +0.3 Provider-side validation
Provider declined case -0.3 Match was one-sided
Patient satisfaction >= 4/5 +0.5 End-to-end success
Patient satisfaction <= 2/5 -0.5 Something went wrong

Weight adjustment (monthly):

# Simplified — production would use gradient-based optimization
def adjust_weights(feedback_records, current_weights):
    """Adjust matching weights based on accumulated feedback signals."""
    dimension_scores = defaultdict(list)

    for record in feedback_records:
        match_decision = record.ai_output
        outcome = record.ground_truth  # which provider was actually selected

        # Which dimensions predicted the selected provider best?
        selected_scores = match_decision["scores"][outcome["provider_id"]]
        for dim, score in selected_scores.items():
            if outcome["signal"] > 0:
                dimension_scores[dim].append(score * outcome["signal"])
            else:
                dimension_scores[dim].append(-score * abs(outcome["signal"]))

    # Dimensions that predicted well get weight boost, others get decrease
    adjustments = {}
    for dim, scores in dimension_scores.items():
        avg = sum(scores) / len(scores)
        adjustments[dim] = current_weights[dim] * (1 + 0.1 * avg)  # 10% max adjustment per cycle

    # Normalize to sum = 1.0
    total = sum(adjustments.values())
    return {k: v / total for k, v in adjustments.items()}

Loop 3: Clinical Accuracy (cadence: per-provider-interaction)

graph LR
    A[AI Extracts EHR] --> B[Records Forwarded<br/>to Provider]
    B --> C[Provider Reviews EHR]
    C --> D{Corrections Needed?}
    D -->|Yes| E[Provider Submits<br/>Corrections]
    D -->|No| F[Confirmed Accurate]

    E --> G[Feedback Store]
    F --> G

    G --> H[Correction Patterns<br/>Analysis]
    H --> I[Update Few-Shot<br/>Examples]
    H --> J[Adjust Confidence<br/>Thresholds]
    H --> K[Flag Systematic<br/>Errors]

Provider feedback endpoint (post-POC):

POST /api/v1/cases/{case_id}/provider-feedback
{
    "provider_id": "apollo-chennai",
    "reviewer": "Dr. Rajesh Patel",
    "ehr_review": {
        "conditions_confirmed": ["M17.11"],
        "conditions_added": ["E11.9"],      // AI missed diabetes
        "conditions_removed": [],
        "observations_corrected": [
            {"parameter": "HbA1c", "ai_value": 5.8, "correct_value": 6.2}
        ],
        "overall_accuracy": 0.85
    },
    "notes": "Good extraction overall. Missed pre-diabetic indication from HbA1c trend."
}

Correction → prompt improvement:

# When a correction pattern appears 3+ times, create a new few-shot example
def check_correction_patterns():
    patterns = db.query("""
        SELECT correction_detail->>'type' as error_type,
               COUNT(*) as occurrences
        FROM feedback_records
        WHERE feedback_type = 'clinical_correction'
          AND applied_to_prompt = false
        GROUP BY correction_detail->>'type'
        HAVING COUNT(*) >= 3
    """)

    for pattern in patterns:
        if pattern.error_type == "condition_missed":
            # Generate a new few-shot example showing the missed condition
            create_prompt_example(pattern)
            # Update prompt in Langfuse
            update_langfuse_prompt("clinical_context", add_example=new_example)
            # Mark feedback as applied
            mark_feedback_applied(pattern.ids)

Layer 5: Eval Pipeline

Automated nightly evaluation runs comparing AI outputs against accumulated ground truth.

Scheduled Evaluations

Eval Schedule Data Source Metric
ICD extraction accuracy Nightly provider_feedback + manual annotations Precision, Recall, F1
Comorbidity detection rate Nightly lab_analyzer outputs vs clinical confirmation Sensitivity, Specificity
Match ranking quality Weekly match acceptance signals NDCG@5, MRR
Question relevance Weekly decision records (was question needed?) Redundancy rate
Document coverage scoring Nightly parameter extraction vs requirements Coverage accuracy
Prompt regression On prompt change A/B comparison on held-out test set Quality delta

Eval Dashboard (Langfuse + Metabase)

┌─────────────────────────────────────────────┐
│  Clinical Extraction Accuracy   March 2026  │
│  ────────────────────────────────────────── │
│  ICD-10 Precision:  0.87 (+0.03 vs Feb)     │
│  ICD-10 Recall:     0.79 (+0.05 vs Feb)     │
│  Condition F1:      0.83                     │
│  Observation Rate:  94% of lab values found  │
│                                              │
│  Top Missed Conditions:                      │
│  1. Pre-diabetes (HbA1c 5.7-6.4) — 12 cases│
│  2. Mild CKD (eGFR 60-89) — 8 cases        │
│  3. Subclinical hypothyroid — 5 cases        │
│                                              │
│  Prompt Version: clinical_context_v3         │
│  Recommended: Add pre-diabetes few-shot      │
└─────────────────────────────────────────────┘

Implementation Plan

Database Changes

-- New table for feedback records
CREATE TABLE feedback_records (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id),
    case_id UUID NOT NULL,
    feedback_type VARCHAR(50) NOT NULL,
    decision_event_id UUID NOT NULL,
    outcome_event_id UUID,
    ai_output JSONB NOT NULL,
    ground_truth JSONB,
    quality_score DECIMAL(3,2),
    reviewed_by VARCHAR(100),
    reviewed_at TIMESTAMPTZ,
    correction_type VARCHAR(50),
    correction_detail JSONB,
    applied_to_prompt BOOLEAN DEFAULT false,
    applied_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- New event types to add
-- agent.decision (per-turn decision record)
-- match.accepted / match.rejected
-- provider.feedback (clinical corrections)
-- patient.satisfaction (NPS survey)
-- extraction.correction (manual fix)

New Services

Service Purpose
app/services/decision_recorder.py Builds and stores Decision Records on every turn
app/services/outcome_tracker.py Records match acceptance, provider response, satisfaction
app/services/feedback_service.py Links decisions to outcomes, manages feedback records
app/services/eval_runner.py Nightly eval pipeline, computes metrics, flags regressions
app/services/weight_optimizer.py Monthly matching weight adjustment from feedback signals

New Endpoints

Method Path Description
GET /api/v1/cases/{id}/decisions Full decision trace for a case
POST /api/v1/cases/{id}/provider-feedback Provider submits EHR corrections
POST /api/v1/cases/{id}/satisfaction Patient NPS survey
GET /api/v1/internal/eval/summary Latest eval metrics dashboard
POST /api/v1/internal/eval/run Trigger manual eval run
GET /api/v1/internal/feedback/pending Unapplied corrections awaiting prompt updates

QStash Scheduled Tasks

Task Schedule Description
eval-extraction-accuracy 0 2 * * * (daily 2am) Compare extractions against ground truth
eval-match-quality 0 3 * * 1 (weekly Mon 3am) Analyze match acceptance patterns
eval-prompt-regression On prompt change A/B test new vs old prompt
feedback-pattern-detector 0 4 * * * (daily 4am) Find recurring correction patterns
weight-optimizer 0 5 1 * * (monthly 1st 5am) Adjust matching weights from signals

Langfuse Integration

Current Tracing (already built)

# Every LLM call is traced
trace = create_trace(agent_name="case_chat", session_id=case_id, user_id=patient_id)
handler = get_langchain_handler(trace)
# → Captures: model, tokens, cost, latency, prompt, completion

Enhanced Tracing (to build)

# Add decision spans to existing traces
with trace.span("routing_decision") as span:
    span.input = {"patient_state": patient_state, "message": message}
    branch = determine_routing_branch(patient_state, message)
    span.output = {"branch": branch, "reason": reason, "skipped": skipped_branches}

with trace.span("document_processing") as span:
    span.input = {"document_id": doc_id, "ocr_method": "pymupdf"}
    result = process_document(doc_id)
    span.output = {"conditions": len(result.conditions), "observations": len(result.observations)}

with trace.span("embedding_match") as span:
    span.input = {"document_summary": summary, "collection": "requirement_embeddings"}
    matches = qdrant_search(embedding, limit=5)
    span.output = {"top_match": matches[0].payload, "score": matches[0].score, "coverage": coverage}

# Decision record auto-built from trace spans
decision_record = build_decision_record_from_trace(trace)
await store_decision_event(case_id, decision_record)

Langfuse Evaluators

# Register automated evaluators in Langfuse
langfuse.register_evaluator(
    name="icd_extraction_accuracy",
    description="Compares extracted ICD codes against provider-confirmed codes",
    function=eval_icd_accuracy,
    applies_to={"agent_name": "clinical_context_agent"},
)

langfuse.register_evaluator(
    name="question_relevance",
    description="Was this intake question necessary given known patient state?",
    function=eval_question_relevance,
    applies_to={"agent_name": "intake_agent"},
)

langfuse.register_evaluator(
    name="match_ranking_quality",
    description="Did the patient accept the top-ranked match?",
    function=eval_match_ranking,
    applies_to={"agent_name": "match_agent"},
)

Data Flow Summary

What Where Stored Retention Access
Decision Records Events table (JSONB) Indefinite GET /cases/{id}/decisions
Langfuse Traces Langfuse Cloud 90 days (free tier) Langfuse dashboard
Outcome Events Events table Indefinite Internal analytics
Feedback Records feedback_records table Indefinite Internal + provider portal
Eval Metrics Events table + Metabase Indefinite Eval dashboard
Audit Logs audit_logs table Indefinite, immutable Compliance

Privacy & Compliance

  • Decision Records contain no PII — patient referenced by UUID only
  • Langfuse traces contain no PII — prompts use anonymized data
  • Feedback Records may contain provider names (not patient PII)
  • Provider clinical feedback is stored with provider consent
  • All feedback data is tenant-scoped
  • GDPR deletion cascades through decision records and feedback records
  • Audit trail is append-only — even corrections don't delete the original decision