Traceability & Feedback Loops¶

Full decision traceability and model feedback loops for the Curaway agentic AI platform. Every AI decision is recorded, linked to patient outcomes, and fed back into prompt tuning, weight adjustment, and threshold calibration.

Design Principles¶

Every decision traceable — why did the agent ask this question, skip that check, or rank this provider first?
Every outcome linked — did the patient accept the match? Did the provider respond? Was the extraction correct?
Every correction fed back — wrong ICD code → better prompt. Rejected match → adjusted weights. Missed condition → new few-shot example.

Architecture Overview¶

graph TD
    A[Patient Interaction] --> B[Decision Engine]
    B --> C[Decision Record]
    C --> D[Langfuse Trace]
    C --> E[Events Table]
    C --> F[Audit Log]

    G[Patient Outcome] --> H[Outcome Recorder]
    H --> I{Link to Decisions}
    I --> J[Match Accepted/Rejected]
    I --> K[Provider Response]
    I --> L[Patient Satisfaction]
    I --> M[Provider Clinical Feedback]

    J --> N[Feedback Store]
    K --> N
    L --> N
    M --> N

    N --> O[Eval Pipeline]
    O --> P[Prompt Tuning]
    O --> Q[Weight Adjustment]
    O --> R[Threshold Calibration]
    O --> S[ML Training Data]

    P --> B
    Q --> B
    R --> B

    style B fill:#008B8B,color:#fff
    style N fill:#FF7F50,color:#fff
    style O fill:#004D4D,color:#fff

Layer 1: Decision Records¶

Every patient turn produces a structured Decision Record stored in the events table.

Schema¶

{
    "event_type": "agent.decision",
    "case_id": "uuid",
    "turn_number": 3,
    "timestamp": "2026-04-01T10:30:00Z",

    # What the patient provided
    "input": {
        "message": "Dallas, no pain, blood works attached",
        "attachments": [{"document_id": "uuid", "filename": "blood_work.pdf"}],
        "input_method": "text",
    },

    # Classification
    "classification": {
        "category": "medical_travel",
        "confidence": 0.94,
        "model": "claude-haiku-4.5",
        "prompt_version": "classifier_v2",
    },

    # Document processing (per document)
    "document_processing": [
        {
            "document_id": "uuid",
            "ocr_method": "pymupdf",
            "ocr_duration_ms": 320,
            "chars_extracted": 4200,
            "clinical_context_agent": {
                "model": "claude-haiku-4.5",
                "prompt_version": "clinical_context_v3",
                "tokens_in": 3200,
                "tokens_out": 1800,
                "cost_usd": 0.012,
                "conditions_extracted": 7,
                "observations_extracted": 97,
                "langfuse_trace_id": "trace-uuid",
            },
            "lab_analyzer": {
                "comorbidities_detected": ["fatty_liver", "bradycardia", "impaired_glucose"],
                "method": "rule_based",
            },
            "validator": {
                "checks_run": ["laterality", "document_age", "body_part", "patient_name", "ocr_quality"],
                "checks_skipped": ["body_part"],  # systemic test exemption
                "issues_found": [],
            },
            "embedding_match": {
                "requirement_matched": "complete_blood_count",
                "similarity_score": 0.91,
                "confirmed_by_llm": False,  # score > 0.85, skipped re-ranker
                "coverage": 0.95,  # 19/20 required parameters found
                "missing_params": ["reticulocyte_count"],
            },
        }
    ],

    # State changes this turn caused
    "state_delta": {
        "location": {"before": None, "after": {"city": "Dallas", "country": "USA"}, "source": "message_extraction"},
        "conditions": {"before": [], "after": ["fatty_liver", "bradycardia", "impaired_glucose"], "source": "lab_analyzer"},
        "documents.cbc": {"before": "not_provided", "after": "complete", "source": "blood_work.pdf"},
        "ehr_completeness": {"before": 0.17, "after": 0.45},
    },

    # Routing decision
    "routing": {
        "branch_chosen": "_handle_attachment_response",
        "reason": "attachments processed + procedure already identified",
        "branches_skipped": [
            {"branch": "quick_questions", "reason": "location already known from message"},
            {"branch": "records_request", "reason": "records already provided as attachments"},
        ],
        "workflow_updates": {"records_requested": True, "procedure_identified": True},
    },

    # Response generation
    "response": {
        "model": "claude-haiku-4.5",
        "prompt_version": "attachment_response_v2",
        "tokens_in": 800,
        "tokens_out": 400,
        "cost_usd": 0.004,
        "content_type": "text",
        "langfuse_trace_id": "trace-uuid",
    },

    # What's needed next
    "pending_actions": ["awaiting_patient_response", "awaiting_imaging_upload"],
}

Storage¶

Decision Records are stored as event_type = "agent.decision" in the existing events table (JSONB payload). No new table needed. Indexed by case_id and timestamp for efficient retrieval.

Layer 2: Outcome Events¶

Track what happens after the AI makes a decision.

Events to Capture¶

Event	Trigger	Links To
`match.presented`	Match results shown to patient	decision records for this case
`match.accepted`	Patient selects a provider	match.presented event
`match.rejected`	Patient rejects all matches or requests re-match	match.presented event
`provider.notified`	Records forwarded to provider	match.accepted event
`provider.responded`	Provider acknowledges/schedules	provider.notified event
`provider.feedback`	Provider corrects EHR data	clinical extraction decision records
`patient.satisfaction`	Post-journey NPS survey	all decision records for this case
`extraction.correction`	Manual correction of AI extraction	specific document processing record

Outcome Schema¶

{
    "event_type": "match.accepted",
    "case_id": "uuid",
    "timestamp": "2026-04-02T14:00:00Z",

    "outcome": {
        "provider_selected": "apollo-chennai",
        "provider_rank": 1,  # was it the top match?
        "total_matches_shown": 5,
        "time_to_decision_minutes": 45,
        "doctor_selected": "dr-rajesh-patel",  # if doctor-level matching enabled
    },

    # Link back to the decision that produced the matches
    "decision_refs": ["event-id-of-matching-decision"],

    # Signals for feedback
    "feedback_signals": {
        "top_match_accepted": True,
        "selection_reason": None,  # future: patient can explain why they chose
    },
}

Layer 2.5: Auto-Reviewer (Active Now)¶

The auto-reviewer enables the feedback flywheel without provider partnerships. It compares the LLM's Clinical Context Agent extractions against the rule-based Lab Analyzer's detections to automatically generate ground truth.

graph LR
    A[Document Uploaded] --> B[Clinical Context Agent<br/>LLM Extraction]
    A --> C[Lab Analyzer<br/>Rule-Based Detection]
    B --> D{Compare}
    C --> D
    D -->|Match| E[Auto-confirmed]
    D -->|LLM missed| F[Auto-correction<br/>feedback record]
    D -->|LLM extra| G[Queue for<br/>clinical advisor]

How it works¶

Automated: Lab Analyzer uses clinical thresholds (HbA1c > 6.5 → diabetes, eGFR < 60 → CKD) which are deterministically correct
Compares: conditions the rules found vs conditions the LLM extracted
Creates feedback records for misses (correction_type = "condition_missed", reviewed_by = "automated_lab_analyzer")
Auto-confirms matches (both agree → quality_score = 1.0)

Activation status¶

Service: app/services/auto_reviewer.py
Manual trigger: POST /api/v1/cases/{id}/auto-review
Batch: POST /internal/eval/auto-review (QStash nightly)
Initial run: 16 cases reviewed, 88 feedback records created
Pattern detector: finds recurring misses for prompt improvement

Three layers of ground truth¶

Layer	Source	Covers	Accuracy
Automated	Lab analyzer rules vs LLM	Lab-detectable conditions	High (deterministic)
Clinical advisor	Dr. Shrikanth Naidu via API	Complex clinical conditions	Gold standard
Patient behavior	Match acceptance signals	Matching quality	Implicit signal

Layer 3: Feedback Store¶

Links decisions to outcomes for analysis and model improvement.

Schema¶

CREATE TABLE feedback_records (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL,
    case_id UUID NOT NULL,
    feedback_type VARCHAR(50) NOT NULL,  -- prompt_quality, match_quality, extraction_accuracy, clinical_correction
    decision_event_id UUID NOT NULL,     -- the agent.decision event
    outcome_event_id UUID,               -- the outcome event (if available)

    -- What was the AI's output?
    ai_output JSONB NOT NULL,

    -- What was the correct/desired output?
    ground_truth JSONB,                  -- null if not yet reviewed

    -- Scoring
    quality_score DECIMAL(3,2),          -- 0.00-1.00 (automated or manual)
    reviewed_by VARCHAR(100),            -- "automated", "clinical_advisor", "provider"
    reviewed_at TIMESTAMPTZ,

    -- Actionable feedback
    correction_type VARCHAR(50),         -- "icd_code_wrong", "condition_missed", "match_rank_wrong", "question_redundant"
    correction_detail JSONB,             -- specific correction data
    applied_to_prompt BOOLEAN DEFAULT false,  -- has this been used to update a prompt?
    applied_at TIMESTAMPTZ,

    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_fb_case ON feedback_records(case_id);
CREATE INDEX idx_fb_type ON feedback_records(feedback_type);
CREATE INDEX idx_fb_unapplied ON feedback_records(applied_to_prompt) WHERE applied_to_prompt = false;

Layer 4: Three Feedback Loops¶

Loop 1: Prompt Quality (cadence: weekly)¶

graph LR
    A[Decision Record] --> B[Langfuse Trace]
    B --> C[Manual Review<br/>or Automated Eval]
    C --> D{Quality OK?}
    D -->|No| E[Create Feedback Record]
    E --> F[Update Prompt<br/>in Langfuse]
    F --> G[A/B Test New<br/>vs Old Prompt]
    G --> H[Promote Winner]
    D -->|Yes| I[No Action]

What's evaluated:

Agent	Evaluation Criteria
Clinical Context	ICD code accuracy, condition completeness, observation extraction rate
Intake	Question relevance (was this already known?), information extraction accuracy
Match	Explanation quality, reasoning accuracy, no hallucinated stats
Explanation	Medical accuracy, readability, locale appropriateness
Classifier	Category accuracy, false positive rate for off-topic/medical-advice

Automated evaluators (Langfuse):

# ICD accuracy evaluator
def eval_icd_accuracy(trace):
    """Compare AI-extracted ICD codes against gold standard."""
    extracted = trace.output.get("icd_codes", [])
    gold = get_gold_standard(trace.metadata["document_id"])
    if not gold:
        return None  # no ground truth yet
    precision = len(set(extracted) & set(gold)) / max(len(extracted), 1)
    recall = len(set(extracted) & set(gold)) / max(len(gold), 1)
    return {"precision": precision, "recall": recall, "f1": 2 * precision * recall / max(precision + recall, 0.001)}

# Question relevance evaluator
def eval_question_relevance(trace):
    """Was this question necessary given what we already knew?"""
    state_before = trace.metadata.get("patient_state_before", {})
    question_asked = trace.output.get("question_topic")
    if question_asked == "location" and state_before.get("location"):
        return {"relevant": False, "reason": "location_already_known"}
    return {"relevant": True}

Loop 2: Matching Quality (cadence: per-case, analyzed monthly)¶

graph LR
    A[Match Presented] --> B{Patient Action}
    B -->|Accepts Top Match| C[Strong Signal:<br/>Ranking Correct]
    B -->|Accepts Lower Match| D[Moderate Signal:<br/>Ranking Suboptimal]
    B -->|Rejects All| E[Weak Signal:<br/>Investigate Why]
    B -->|Requests Re-match| F[Negative Signal:<br/>Criteria Wrong]

    C --> G[Feedback Store]
    D --> G
    E --> G
    F --> G

    G --> H[Monthly Analysis]
    H --> I[Adjust Dimension Weights]
    H --> J[Tune Score Thresholds]
    H --> K[Identify Scoring Gaps]

Signals captured:

Signal	Weight	Meaning
Top match accepted	+1.0	Ranking was correct
Match 2-3 accepted	+0.5	Close, but top wasn't best
Match 4-5 accepted	+0.2	Significant ranking error
All rejected, re-matched	-0.5	Criteria or weights wrong
Provider responded positively	+0.3	Provider-side validation
Provider declined case	-0.3	Match was one-sided
Patient satisfaction >= 4/5	+0.5	End-to-end success
Patient satisfaction <= 2/5	-0.5	Something went wrong

Weight adjustment (monthly):

# Simplified — production would use gradient-based optimization
def adjust_weights(feedback_records, current_weights):
    """Adjust matching weights based on accumulated feedback signals."""
    dimension_scores = defaultdict(list)

    for record in feedback_records:
        match_decision = record.ai_output
        outcome = record.ground_truth  # which provider was actually selected

        # Which dimensions predicted the selected provider best?
        selected_scores = match_decision["scores"][outcome["provider_id"]]
        for dim, score in selected_scores.items():
            if outcome["signal"] > 0:
                dimension_scores[dim].append(score * outcome["signal"])
            else:
                dimension_scores[dim].append(-score * abs(outcome["signal"]))

    # Dimensions that predicted well get weight boost, others get decrease
    adjustments = {}
    for dim, scores in dimension_scores.items():
        avg = sum(scores) / len(scores)
        adjustments[dim] = current_weights[dim] * (1 + 0.1 * avg)  # 10% max adjustment per cycle

    # Normalize to sum = 1.0
    total = sum(adjustments.values())
    return {k: v / total for k, v in adjustments.items()}

Loop 3: Clinical Accuracy (cadence: per-provider-interaction)¶

graph LR
    A[AI Extracts EHR] --> B[Records Forwarded<br/>to Provider]
    B --> C[Provider Reviews EHR]
    C --> D{Corrections Needed?}
    D -->|Yes| E[Provider Submits<br/>Corrections]
    D -->|No| F[Confirmed Accurate]

    E --> G[Feedback Store]
    F --> G

    G --> H[Correction Patterns<br/>Analysis]
    H --> I[Update Few-Shot<br/>Examples]
    H --> J[Adjust Confidence<br/>Thresholds]
    H --> K[Flag Systematic<br/>Errors]

Provider feedback endpoint (post-POC):

POST /api/v1/cases/{case_id}/provider-feedback
{
    "provider_id": "apollo-chennai",
    "reviewer": "Dr. Rajesh Patel",
    "ehr_review": {
        "conditions_confirmed": ["M17.11"],
        "conditions_added": ["E11.9"],      // AI missed diabetes
        "conditions_removed": [],
        "observations_corrected": [
            {"parameter": "HbA1c", "ai_value": 5.8, "correct_value": 6.2}
        ],
        "overall_accuracy": 0.85
    },
    "notes": "Good extraction overall. Missed pre-diabetic indication from HbA1c trend."
}

Correction → prompt improvement:

# When a correction pattern appears 3+ times, create a new few-shot example
def check_correction_patterns():
    patterns = db.query("""
        SELECT correction_detail->>'type' as error_type,
               COUNT(*) as occurrences
        FROM feedback_records
        WHERE feedback_type = 'clinical_correction'
          AND applied_to_prompt = false
        GROUP BY correction_detail->>'type'
        HAVING COUNT(*) >= 3
    """)

    for pattern in patterns:
        if pattern.error_type == "condition_missed":
            # Generate a new few-shot example showing the missed condition
            create_prompt_example(pattern)
            # Update prompt in Langfuse
            update_langfuse_prompt("clinical_context", add_example=new_example)
            # Mark feedback as applied
            mark_feedback_applied(pattern.ids)

Layer 5: Eval Pipeline¶

Automated nightly evaluation runs comparing AI outputs against accumulated ground truth.

Scheduled Evaluations¶

Eval	Schedule	Data Source	Metric
ICD extraction accuracy	Nightly	provider_feedback + manual annotations	Precision, Recall, F1
Comorbidity detection rate	Nightly	lab_analyzer outputs vs clinical confirmation	Sensitivity, Specificity
Match ranking quality	Weekly	match acceptance signals	NDCG@5, MRR
Question relevance	Weekly	decision records (was question needed?)	Redundancy rate
Document coverage scoring	Nightly	parameter extraction vs requirements	Coverage accuracy
Prompt regression	On prompt change	A/B comparison on held-out test set	Quality delta

Eval Dashboard (Langfuse + Metabase)¶

┌─────────────────────────────────────────────┐
│  Clinical Extraction Accuracy   March 2026  │
│  ────────────────────────────────────────── │
│  ICD-10 Precision:  0.87 (+0.03 vs Feb)     │
│  ICD-10 Recall:     0.79 (+0.05 vs Feb)     │
│  Condition F1:      0.83                     │
│  Observation Rate:  94% of lab values found  │
│                                              │
│  Top Missed Conditions:                      │
│  1. Pre-diabetes (HbA1c 5.7-6.4) — 12 cases│
│  2. Mild CKD (eGFR 60-89) — 8 cases        │
│  3. Subclinical hypothyroid — 5 cases        │
│                                              │
│  Prompt Version: clinical_context_v3         │
│  Recommended: Add pre-diabetes few-shot      │
└─────────────────────────────────────────────┘

Implementation Plan¶

Database Changes¶

-- New table for feedback records
CREATE TABLE feedback_records (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id),
    case_id UUID NOT NULL,
    feedback_type VARCHAR(50) NOT NULL,
    decision_event_id UUID NOT NULL,
    outcome_event_id UUID,
    ai_output JSONB NOT NULL,
    ground_truth JSONB,
    quality_score DECIMAL(3,2),
    reviewed_by VARCHAR(100),
    reviewed_at TIMESTAMPTZ,
    correction_type VARCHAR(50),
    correction_detail JSONB,
    applied_to_prompt BOOLEAN DEFAULT false,
    applied_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- New event types to add
-- agent.decision (per-turn decision record)
-- match.accepted / match.rejected
-- provider.feedback (clinical corrections)
-- patient.satisfaction (NPS survey)
-- extraction.correction (manual fix)

New Services¶

Service	Purpose
`app/services/decision_recorder.py`	Builds and stores Decision Records on every turn
`app/services/outcome_tracker.py`	Records match acceptance, provider response, satisfaction
`app/services/feedback_service.py`	Links decisions to outcomes, manages feedback records
`app/services/eval_runner.py`	Nightly eval pipeline, computes metrics, flags regressions
`app/services/weight_optimizer.py`	Monthly matching weight adjustment from feedback signals

New Endpoints¶

Method	Path	Description
`GET`	`/api/v1/cases/{id}/decisions`	Full decision trace for a case
`POST`	`/api/v1/cases/{id}/provider-feedback`	Provider submits EHR corrections
`POST`	`/api/v1/cases/{id}/satisfaction`	Patient NPS survey
`GET`	`/api/v1/internal/eval/summary`	Latest eval metrics dashboard
`POST`	`/api/v1/internal/eval/run`	Trigger manual eval run
`GET`	`/api/v1/internal/feedback/pending`	Unapplied corrections awaiting prompt updates

QStash Scheduled Tasks¶

Task	Schedule	Description
`eval-extraction-accuracy`	`0 2 * * *` (daily 2am)	Compare extractions against ground truth
`eval-match-quality`	`0 3 * * 1` (weekly Mon 3am)	Analyze match acceptance patterns
`eval-prompt-regression`	On prompt change	A/B test new vs old prompt
`feedback-pattern-detector`	`0 4 * * *` (daily 4am)	Find recurring correction patterns
`weight-optimizer`	`0 5 1 * *` (monthly 1st 5am)	Adjust matching weights from signals

Langfuse Integration¶

Current Tracing (already built)¶

# Every LLM call is traced
trace = create_trace(agent_name="case_chat", session_id=case_id, user_id=patient_id)
handler = get_langchain_handler(trace)
# → Captures: model, tokens, cost, latency, prompt, completion

Enhanced Tracing (to build)¶

# Add decision spans to existing traces
with trace.span("routing_decision") as span:
    span.input = {"patient_state": patient_state, "message": message}
    branch = determine_routing_branch(patient_state, message)
    span.output = {"branch": branch, "reason": reason, "skipped": skipped_branches}

with trace.span("document_processing") as span:
    span.input = {"document_id": doc_id, "ocr_method": "pymupdf"}
    result = process_document(doc_id)
    span.output = {"conditions": len(result.conditions), "observations": len(result.observations)}

with trace.span("embedding_match") as span:
    span.input = {"document_summary": summary, "collection": "requirement_embeddings"}
    matches = qdrant_search(embedding, limit=5)
    span.output = {"top_match": matches[0].payload, "score": matches[0].score, "coverage": coverage}

# Decision record auto-built from trace spans
decision_record = build_decision_record_from_trace(trace)
await store_decision_event(case_id, decision_record)

Langfuse Evaluators¶

# Register automated evaluators in Langfuse
langfuse.register_evaluator(
    name="icd_extraction_accuracy",
    description="Compares extracted ICD codes against provider-confirmed codes",
    function=eval_icd_accuracy,
    applies_to={"agent_name": "clinical_context_agent"},
)

langfuse.register_evaluator(
    name="question_relevance",
    description="Was this intake question necessary given known patient state?",
    function=eval_question_relevance,
    applies_to={"agent_name": "intake_agent"},
)

langfuse.register_evaluator(
    name="match_ranking_quality",
    description="Did the patient accept the top-ranked match?",
    function=eval_match_ranking,
    applies_to={"agent_name": "match_agent"},
)

Data Flow Summary¶

What	Where Stored	Retention	Access
Decision Records	Events table (JSONB)	Indefinite	GET /cases/{id}/decisions
Langfuse Traces	Langfuse Cloud	90 days (free tier)	Langfuse dashboard
Outcome Events	Events table	Indefinite	Internal analytics
Feedback Records	feedback_records table	Indefinite	Internal + provider portal
Eval Metrics	Events table + Metabase	Indefinite	Eval dashboard
Audit Logs	audit_logs table	Indefinite, immutable	Compliance

Privacy & Compliance¶

Decision Records contain no PII — patient referenced by UUID only
Langfuse traces contain no PII — prompts use anonymized data
Feedback Records may contain provider names (not patient PII)
Provider clinical feedback is stored with provider consent
All feedback data is tenant-scoped
GDPR deletion cascades through decision records and feedback records
Audit trail is append-only — even corrections don't delete the original decision