Langfuse Prompt Management — Implementation Recommendation¶
Status: Recommended for post-demo implementation Effort: 2-3 hours (dedicated session) Priority: Medium — do after demo stabilizes, before real patient tuning begins
Problem Statement¶
All 8 LLM prompts are hardcoded as Python string constants across 3 files. Changing any prompt requires a code change, commit, PR, deploy, and Railway restart. This is fine for development but becomes a bottleneck when:
- Tuning brand voice based on patient feedback
- Adjusting guardrail boundaries (too strict or too lenient)
- A/B testing different intake approaches
- CPO/CTO wants to iterate on prompts without engineering deploys
Proposed Solution¶
Migrate all LLM prompts to Langfuse Prompt Management (already deployed on the HIPAA instance at hipaa.cloud.langfuse.com). Prompts become editable in the Langfuse dashboard with versioning, rollback, and A/B testing. Current hardcoded prompts become the fallback when Langfuse is unavailable.
Approach¶
Phase 1: Create Prompts in Langfuse Dashboard¶
Create 8 prompts in the Langfuse UI with template variables:
| Prompt Name | Variables | Current Location |
|---|---|---|
curaway_system_v1 |
{{phase_context}}, {{patient_context}} |
llm_conversation.py:18 |
curaway_phase_identify_v1 |
None | llm_conversation.py:68 |
curaway_phase_records_v1 |
None | llm_conversation.py:76 |
curaway_phase_intake_v1 |
None | llm_conversation.py:83 |
curaway_phase_document_review_v1 |
None | llm_conversation.py:92 |
curaway_phase_general_v1 |
None | llm_conversation.py:101 |
curaway_classifier_v1 |
{{categories}}, {{message}} |
message_classifier.py |
curaway_clinical_extraction_v1 |
{{raw_text}}, {{report_type}} |
clinical_context.py |
Phase 2: Add Prompt Fetching Helper¶
# app/agents/prompt_loader.py
from langfuse import get_client
_FALLBACKS = {} # Populated from current hardcoded prompts
def get_prompt(name: str, variables: dict = None, fallback: str = "") -> str:
"""Fetch a prompt from Langfuse with automatic fallback.
- Cache: Langfuse SDK caches prompts (configurable TTL, default 60s)
- Fallback: if Langfuse unavailable, use hardcoded default
- Logging: logs when fallback is used (indicates Langfuse issue)
"""
try:
client = get_client()
prompt = client.get_prompt(name)
return prompt.compile(**(variables or {}))
except Exception as e:
logger.warning("Langfuse prompt '%s' unavailable: %s — using fallback", name, e)
template = _FALLBACKS.get(name, fallback)
if variables:
return template.format(**variables)
return template
Phase 3: Refactor LLM Calls¶
Before:
system = CURAWAY_SYSTEM_PROMPT.format(
phase_context=phase_context,
patient_context=patient_context,
)
After:
system = get_prompt("curaway_system_v1", {
"phase_context": phase_context,
"patient_context": patient_context,
})
Phase 4: Register Fallbacks¶
Move current hardcoded prompts into a _FALLBACKS dict in prompt_loader.py. These are the safety net — if Langfuse is down, the agent still works with the last-known-good prompts from the codebase.
Advantages¶
| Advantage | Details |
|---|---|
| Hot-reload | Edit prompt in Langfuse → next request uses new version. No deploy needed. |
| Versioning | Every edit creates a version. Compare v3 vs v4 side by side. Roll back instantly. |
| A/B testing | Langfuse supports prompt experiments — route traffic between versions and compare quality metrics. |
| Trace linkage | Each trace shows which prompt version generated it. "Why did the agent say X?" → check the prompt version in the trace. |
| Non-engineer editing | CPO can tweak brand voice, adjust guardrail wording, or refine phase instructions directly in the Langfuse dashboard. |
| Audit trail | Full history of who changed what prompt and when. Important for healthcare compliance. |
| Prompt analytics | Langfuse shows which prompts produce the best outcomes (lower latency, higher user satisfaction). |
Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Langfuse down | Low (99.9% SLA, HIPAA instance) | High (prompts unavailable) | Fallback to hardcoded defaults. Log when fallback is used. |
| Bad prompt edit breaks agent | Medium (human error) | Medium (poor responses until reverted) | Langfuse has instant version rollback. Test in staging before promoting. |
| Latency increase | Low (SDK caches with 60s TTL) | Negligible (+50ms on cache miss) | First call per 60s fetches, rest served from cache. |
| Fallback drift | Medium (over time, Langfuse version diverges from code fallback) | Low (fallback is safety net, not primary) | Periodic check: compare active Langfuse version with code fallback. Alert if they diverge significantly. |
| Template variable mismatch | Low | Medium (prompt renders with missing variables) | Langfuse validates variables at compile time. Tests catch mismatches. |
Tradeoffs¶
| Factor | Hardcoded (current) | Langfuse Managed (proposed) |
|---|---|---|
| Change speed | Code change → commit → PR → deploy (10-30 min) | Edit in dashboard → instant (~5 sec) |
| Reliability | Always available (in the codebase) | Depends on Langfuse uptime (with fallback) |
| Version control | Git history | Langfuse version history (separate from Git) |
| Testing | CI/CD pipeline tests | Manual testing in Langfuse + trace review |
| Access control | Requires code access | Langfuse dashboard access (role-based) |
| Complexity | Simple (strings in Python) | Moderate (fetch → cache → compile → fallback) |
| Multi-environment | Same prompts in dev/staging/prod (unless branched) | Langfuse projects per environment |
Cost Implications¶
| Item | Cost |
|---|---|
| Langfuse prompt management | $0 incremental — included in existing plan |
| Additional API calls | ~0 — prompts cached by SDK, no per-request fetch |
| HIPAA instance | Already provisioned at hipaa.cloud.langfuse.com |
| Engineering effort | 2-3 hours one-time migration |
| Ongoing maintenance | ~5 min/week to keep fallbacks synced |
Total incremental cost: $0/month
Files Affected¶
| File | Change |
|---|---|
app/agents/prompt_loader.py |
New — prompt fetching with fallback |
app/agents/llm_conversation.py |
Replace CURAWAY_SYSTEM_PROMPT and PHASE_CONTEXTS with get_prompt() calls |
app/services/message_classifier.py |
Replace inline classifier prompt with get_prompt() |
app/agents/clinical_context.py |
Replace extraction prompt with get_prompt() |
requirements.txt |
No change — langfuse already installed |
Decision Criteria¶
Do it when: - You're actively tuning prompts based on patient/demo feedback - You want to A/B test different intake approaches - Non-engineers need to edit prompts - You need audit trail for prompt changes (compliance)
Don't do it yet if: - Prompts are stable and rarely change - Only engineers touch prompts - Demo is imminent and stability is priority
Recommended Timeline¶
- Now: Keep current inline prompts. Session 24 guardrails.yaml covers configurable rules.
- Post-demo (1-2 weeks): Migrate prompts to Langfuse. Current prompts become fallbacks.
- With real patients: Use Langfuse A/B testing to optimize intake flow, brand voice, and guardrail sensitivity.