Skip to content

Langfuse Prompt Management — Implementation Recommendation

Status: Recommended for post-demo implementation Effort: 2-3 hours (dedicated session) Priority: Medium — do after demo stabilizes, before real patient tuning begins


Problem Statement

All 8 LLM prompts are hardcoded as Python string constants across 3 files. Changing any prompt requires a code change, commit, PR, deploy, and Railway restart. This is fine for development but becomes a bottleneck when:

  • Tuning brand voice based on patient feedback
  • Adjusting guardrail boundaries (too strict or too lenient)
  • A/B testing different intake approaches
  • CPO/CTO wants to iterate on prompts without engineering deploys

Proposed Solution

Migrate all LLM prompts to Langfuse Prompt Management (already deployed on the HIPAA instance at hipaa.cloud.langfuse.com). Prompts become editable in the Langfuse dashboard with versioning, rollback, and A/B testing. Current hardcoded prompts become the fallback when Langfuse is unavailable.


Approach

Phase 1: Create Prompts in Langfuse Dashboard

Create 8 prompts in the Langfuse UI with template variables:

Prompt Name Variables Current Location
curaway_system_v1 {{phase_context}}, {{patient_context}} llm_conversation.py:18
curaway_phase_identify_v1 None llm_conversation.py:68
curaway_phase_records_v1 None llm_conversation.py:76
curaway_phase_intake_v1 None llm_conversation.py:83
curaway_phase_document_review_v1 None llm_conversation.py:92
curaway_phase_general_v1 None llm_conversation.py:101
curaway_classifier_v1 {{categories}}, {{message}} message_classifier.py
curaway_clinical_extraction_v1 {{raw_text}}, {{report_type}} clinical_context.py

Phase 2: Add Prompt Fetching Helper

# app/agents/prompt_loader.py

from langfuse import get_client

_FALLBACKS = {}  # Populated from current hardcoded prompts

def get_prompt(name: str, variables: dict = None, fallback: str = "") -> str:
    """Fetch a prompt from Langfuse with automatic fallback.

    - Cache: Langfuse SDK caches prompts (configurable TTL, default 60s)
    - Fallback: if Langfuse unavailable, use hardcoded default
    - Logging: logs when fallback is used (indicates Langfuse issue)
    """
    try:
        client = get_client()
        prompt = client.get_prompt(name)
        return prompt.compile(**(variables or {}))
    except Exception as e:
        logger.warning("Langfuse prompt '%s' unavailable: %s — using fallback", name, e)
        template = _FALLBACKS.get(name, fallback)
        if variables:
            return template.format(**variables)
        return template

Phase 3: Refactor LLM Calls

Before:

system = CURAWAY_SYSTEM_PROMPT.format(
    phase_context=phase_context,
    patient_context=patient_context,
)

After:

system = get_prompt("curaway_system_v1", {
    "phase_context": phase_context,
    "patient_context": patient_context,
})

Phase 4: Register Fallbacks

Move current hardcoded prompts into a _FALLBACKS dict in prompt_loader.py. These are the safety net — if Langfuse is down, the agent still works with the last-known-good prompts from the codebase.


Advantages

Advantage Details
Hot-reload Edit prompt in Langfuse → next request uses new version. No deploy needed.
Versioning Every edit creates a version. Compare v3 vs v4 side by side. Roll back instantly.
A/B testing Langfuse supports prompt experiments — route traffic between versions and compare quality metrics.
Trace linkage Each trace shows which prompt version generated it. "Why did the agent say X?" → check the prompt version in the trace.
Non-engineer editing CPO can tweak brand voice, adjust guardrail wording, or refine phase instructions directly in the Langfuse dashboard.
Audit trail Full history of who changed what prompt and when. Important for healthcare compliance.
Prompt analytics Langfuse shows which prompts produce the best outcomes (lower latency, higher user satisfaction).

Risks

Risk Likelihood Impact Mitigation
Langfuse down Low (99.9% SLA, HIPAA instance) High (prompts unavailable) Fallback to hardcoded defaults. Log when fallback is used.
Bad prompt edit breaks agent Medium (human error) Medium (poor responses until reverted) Langfuse has instant version rollback. Test in staging before promoting.
Latency increase Low (SDK caches with 60s TTL) Negligible (+50ms on cache miss) First call per 60s fetches, rest served from cache.
Fallback drift Medium (over time, Langfuse version diverges from code fallback) Low (fallback is safety net, not primary) Periodic check: compare active Langfuse version with code fallback. Alert if they diverge significantly.
Template variable mismatch Low Medium (prompt renders with missing variables) Langfuse validates variables at compile time. Tests catch mismatches.

Tradeoffs

Factor Hardcoded (current) Langfuse Managed (proposed)
Change speed Code change → commit → PR → deploy (10-30 min) Edit in dashboard → instant (~5 sec)
Reliability Always available (in the codebase) Depends on Langfuse uptime (with fallback)
Version control Git history Langfuse version history (separate from Git)
Testing CI/CD pipeline tests Manual testing in Langfuse + trace review
Access control Requires code access Langfuse dashboard access (role-based)
Complexity Simple (strings in Python) Moderate (fetch → cache → compile → fallback)
Multi-environment Same prompts in dev/staging/prod (unless branched) Langfuse projects per environment

Cost Implications

Item Cost
Langfuse prompt management $0 incremental — included in existing plan
Additional API calls ~0 — prompts cached by SDK, no per-request fetch
HIPAA instance Already provisioned at hipaa.cloud.langfuse.com
Engineering effort 2-3 hours one-time migration
Ongoing maintenance ~5 min/week to keep fallbacks synced

Total incremental cost: $0/month

Files Affected

File Change
app/agents/prompt_loader.py New — prompt fetching with fallback
app/agents/llm_conversation.py Replace CURAWAY_SYSTEM_PROMPT and PHASE_CONTEXTS with get_prompt() calls
app/services/message_classifier.py Replace inline classifier prompt with get_prompt()
app/agents/clinical_context.py Replace extraction prompt with get_prompt()
requirements.txt No change — langfuse already installed

Decision Criteria

Do it when: - You're actively tuning prompts based on patient/demo feedback - You want to A/B test different intake approaches - Non-engineers need to edit prompts - You need audit trail for prompt changes (compliance)

Don't do it yet if: - Prompts are stable and rarely change - Only engineers touch prompts - Demo is imminent and stability is priority

  1. Now: Keep current inline prompts. Session 24 guardrails.yaml covers configurable rules.
  2. Post-demo (1-2 weeks): Migrate prompts to Langfuse. Current prompts become fallbacks.
  3. With real patients: Use Langfuse A/B testing to optimize intake flow, brand voice, and guardrail sensitivity.