Langfuse Prompt Management — Implementation Recommendation¶

Status: Recommended for post-demo implementation Effort: 2-3 hours (dedicated session) Priority: Medium — do after demo stabilizes, before real patient tuning begins

Problem Statement¶

All 8 LLM prompts are hardcoded as Python string constants across 3 files. Changing any prompt requires a code change, commit, PR, deploy, and Railway restart. This is fine for development but becomes a bottleneck when:

Tuning brand voice based on patient feedback
Adjusting guardrail boundaries (too strict or too lenient)
A/B testing different intake approaches
CPO/CTO wants to iterate on prompts without engineering deploys

Proposed Solution¶

Migrate all LLM prompts to Langfuse Prompt Management (already deployed on the HIPAA instance at hipaa.cloud.langfuse.com). Prompts become editable in the Langfuse dashboard with versioning, rollback, and A/B testing. Current hardcoded prompts become the fallback when Langfuse is unavailable.

Approach¶

Phase 1: Create Prompts in Langfuse Dashboard¶

Create 8 prompts in the Langfuse UI with template variables:

Prompt Name	Variables	Current Location
`curaway_system_v1`	`{{phase_context}}`, `{{patient_context}}`	`llm_conversation.py:18`
`curaway_phase_identify_v1`	None	`llm_conversation.py:68`
`curaway_phase_records_v1`	None	`llm_conversation.py:76`
`curaway_phase_intake_v1`	None	`llm_conversation.py:83`
`curaway_phase_document_review_v1`	None	`llm_conversation.py:92`
`curaway_phase_general_v1`	None	`llm_conversation.py:101`
`curaway_classifier_v1`	`{{categories}}`, `{{message}}`	`message_classifier.py`
`curaway_clinical_extraction_v1`	`{{raw_text}}`, `{{report_type}}`	`clinical_context.py`

Phase 2: Add Prompt Fetching Helper¶

# app/agents/prompt_loader.py

from langfuse import get_client

_FALLBACKS = {}  # Populated from current hardcoded prompts

def get_prompt(name: str, variables: dict = None, fallback: str = "") -> str:
    """Fetch a prompt from Langfuse with automatic fallback.

    - Cache: Langfuse SDK caches prompts (configurable TTL, default 60s)
    - Fallback: if Langfuse unavailable, use hardcoded default
    - Logging: logs when fallback is used (indicates Langfuse issue)
    """
    try:
        client = get_client()
        prompt = client.get_prompt(name)
        return prompt.compile(**(variables or {}))
    except Exception as e:
        logger.warning("Langfuse prompt '%s' unavailable: %s — using fallback", name, e)
        template = _FALLBACKS.get(name, fallback)
        if variables:
            return template.format(**variables)
        return template

Phase 3: Refactor LLM Calls¶

Before:

system = CURAWAY_SYSTEM_PROMPT.format(
    phase_context=phase_context,
    patient_context=patient_context,
)

After:

system = get_prompt("curaway_system_v1", {
    "phase_context": phase_context,
    "patient_context": patient_context,
})

Phase 4: Register Fallbacks¶

Move current hardcoded prompts into a _FALLBACKS dict in prompt_loader.py. These are the safety net — if Langfuse is down, the agent still works with the last-known-good prompts from the codebase.

Advantages¶

Advantage	Details
Hot-reload	Edit prompt in Langfuse → next request uses new version. No deploy needed.
Versioning	Every edit creates a version. Compare v3 vs v4 side by side. Roll back instantly.
A/B testing	Langfuse supports prompt experiments — route traffic between versions and compare quality metrics.
Trace linkage	Each trace shows which prompt version generated it. "Why did the agent say X?" → check the prompt version in the trace.
Non-engineer editing	CPO can tweak brand voice, adjust guardrail wording, or refine phase instructions directly in the Langfuse dashboard.
Audit trail	Full history of who changed what prompt and when. Important for healthcare compliance.
Prompt analytics	Langfuse shows which prompts produce the best outcomes (lower latency, higher user satisfaction).

Risks¶

Risk	Likelihood	Impact	Mitigation
Langfuse down	Low (99.9% SLA, HIPAA instance)	High (prompts unavailable)	Fallback to hardcoded defaults. Log when fallback is used.
Bad prompt edit breaks agent	Medium (human error)	Medium (poor responses until reverted)	Langfuse has instant version rollback. Test in staging before promoting.
Latency increase	Low (SDK caches with 60s TTL)	Negligible (+50ms on cache miss)	First call per 60s fetches, rest served from cache.
Fallback drift	Medium (over time, Langfuse version diverges from code fallback)	Low (fallback is safety net, not primary)	Periodic check: compare active Langfuse version with code fallback. Alert if they diverge significantly.
Template variable mismatch	Low	Medium (prompt renders with missing variables)	Langfuse validates variables at compile time. Tests catch mismatches.

Tradeoffs¶

Factor	Hardcoded (current)	Langfuse Managed (proposed)
Change speed	Code change → commit → PR → deploy (10-30 min)	Edit in dashboard → instant (~5 sec)
Reliability	Always available (in the codebase)	Depends on Langfuse uptime (with fallback)
Version control	Git history	Langfuse version history (separate from Git)
Testing	CI/CD pipeline tests	Manual testing in Langfuse + trace review
Access control	Requires code access	Langfuse dashboard access (role-based)
Complexity	Simple (strings in Python)	Moderate (fetch → cache → compile → fallback)
Multi-environment	Same prompts in dev/staging/prod (unless branched)	Langfuse projects per environment

Cost Implications¶

Item	Cost
Langfuse prompt management	$0 incremental — included in existing plan
Additional API calls	~0 — prompts cached by SDK, no per-request fetch
HIPAA instance	Already provisioned at `hipaa.cloud.langfuse.com`
Engineering effort	2-3 hours one-time migration
Ongoing maintenance	~5 min/week to keep fallbacks synced

Total incremental cost: $0/month

Files Affected¶

File	Change
`app/agents/prompt_loader.py`	New — prompt fetching with fallback
`app/agents/llm_conversation.py`	Replace `CURAWAY_SYSTEM_PROMPT` and `PHASE_CONTEXTS` with `get_prompt()` calls
`app/services/message_classifier.py`	Replace inline classifier prompt with `get_prompt()`
`app/agents/clinical_context.py`	Replace extraction prompt with `get_prompt()`
`requirements.txt`	No change — `langfuse` already installed

Decision Criteria¶

Do it when: - You're actively tuning prompts based on patient/demo feedback - You want to A/B test different intake approaches - Non-engineers need to edit prompts - You need audit trail for prompt changes (compliance)

Don't do it yet if: - Prompts are stable and rarely change - Only engineers touch prompts - Demo is imminent and stability is priority

Recommended Timeline¶

Now: Keep current inline prompts. Session 24 guardrails.yaml covers configurable rules.
Post-demo (1-2 weeks): Migrate prompts to Langfuse. Current prompts become fallbacks.
With real patients: Use Langfuse A/B testing to optimize intake flow, brand voice, and guardrail sensitivity.