Prompt Abstraction Layer — Steer Document¶
Date: 2026-04-10
Author: Srikanth Donthi (CPO/CTO) + Claude Code session 35
Status: Implemented — PR #92 merged, 14/14 parity verified
Companion spec: prompt-abstraction-feature.md
1. Problem Statement¶
All 19 LLM prompts in the Curaway backend are hardcoded as Python string constants across 8 files. Few-shot examples (~28.5% of total prompt tokens) are embedded inline. This creates five problems:
-
Language scaling: Adding Arabic requires editing 12 Python files. Adding a third language doubles the work. There's no way to load language-specific examples without code changes.
-
Token waste: Every LLM call sends all examples regardless of relevance. The extraction prompt is 59% examples — 4,200 chars of radiology/lab report samples sent even when processing a dental X-ray.
-
Clinical review friction: Dr. Naidu (Clinical Advisor) reviews wording in xlsx files, not Python string constants. Prompt examples should be reviewable the same way — in a human-readable format outside of code.
-
No prompt caching: Anthropic's prompt caching reduces cached input tokens to 10% cost. But caching requires a stable prefix. With examples embedded inline and dynamic placeholders scattered throughout, identifying the cacheable boundary is difficult.
-
Version drift: V1 and V2 prompts coexist with different example counts (V1 has 0 conversation examples, V2 has 5). Phase contexts have inconsistent example coverage. No central inventory of what examples exist.
2. Design Principles¶
2.1 Separation of concerns¶
Base prompt (rules, schema, guardrails) → Python constant or YAML
Few-shot examples → YAML, per-locale, per-category
Forbidden phrases → Already in voice_rules.yaml
Dynamic context (patient, phase, emotion) → Injected at runtime
The base prompt changes with code deploys. Examples change with clinical review cycles. Dynamic context changes per call. These three concerns have different change frequencies and different reviewers — they should not live in the same string.
2.2 Every call gets full examples — always¶
Prompt caching is a billing optimization, not a content optimization. The LLM sees the full prompt on every call. Caching means Anthropic's servers recognize the prefix and charge 10% for the cached portion. There is zero quality difference between a cache hit and a cache miss.
This means: - Examples are never "skipped" or "omitted" for any call - Language-specific examples are loaded from YAML and injected into the prompt before sending - The assembled prompt is byte-identical to today's hardcoded strings
2.3 Fail to English, never fail to nothing¶
If the requested locale's examples are missing, fall back to English. If English examples are missing, fail loud (raise, caught in CI). A prompt without examples is worse than a prompt with wrong-language examples — the model still benefits from structural guidance.
2.4 Prompt caching boundary¶
The cacheable prefix = base prompt + examples + forbidden phrases.
Everything before {patient_context} is stable across calls within
the same conversation. Patient context and conversation history are
the variable tail.
┌──────────────────────────────────┐
│ CACHED PREFIX (~80% of input) │
│ ├─ Base prompt (rules, schema) │
│ ├─ Few-shot examples (locale) │
│ ├─ Forbidden phrases │
│ └─ Phase context │
├──────────────────────────────────┤
│ VARIABLE TAIL (~20% of input) │
│ ├─ Patient context │
│ ├─ Conversation history │
│ └─ Current message │
└──────────────────────────────────┘
3. Scope: What Moves, What Stays¶
Moves to YAML (few-shot examples)¶
| Prompt | Current location | Examples to extract |
|---|---|---|
CURAWAY_SYSTEM_PROMPT_V2 |
llm_conversation.py:210-228 | 5 WRONG/RIGHT conversation pairs |
EXTRACTION_SYSTEM_PROMPT |
clinical_extraction.py:55-207 | 2 rich examples (radiology + lab) |
ICD_MAPPING_SYSTEM_PROMPT |
clinical_extraction.py:252-278 | 1 coded example |
ICD_MAPPING_SYSTEM_PROMPT_COT |
clinical_extraction.py:350-398 | 1 COT reasoning example |
FHIR_GENERATION_SYSTEM_PROMPT |
clinical_extraction.py:429-477 | 1 FHIR resource example |
EXPLANATION_SYSTEM_PROMPT |
explanation.py:26-36 | 1 explanation example |
CLINICAL_ANALYSIS_SYSTEM_PROMPT |
match_analysis.py:21-38 | 1 analysis example |
CLINICAL_ANALYSIS_COT |
match_analysis.py:125-161 | 1 COT analysis example |
RERANK_SYSTEM_PROMPT |
match_analysis.py:187-195 | 1 rerank example |
CLASSIFY_INTENT_PROMPT |
orchestrator.py:61-67 | 6 intent classification examples |
INTAKE_SYSTEM_PROMPT_V1 |
intake.py:38-49 | 1 intake example |
Phase: document_review (V1) |
llm_conversation.py:390-404 | 2 response shape examples |
Total: 23 example blocks across 12 prompts.
Stays in Python (base prompts + dynamic logic)¶
- Base prompt text (rules, schemas, guardrails)
_get_system_prompt()assembly function_get_phase_contexts()phase routing_get_forbidden_phrases_block()voice rules injection- All placeholder substitution (
{patient_context},{phase_context},{emotional_context},{forbidden_phrases_block}) - All runtime template variables in
MATCHER_USER_TEMPLATE - Feature flag routing (V1 vs V2, COT vs legacy)
Already external (no change needed)¶
config/voice_rules.yaml— forbidden phrasesconfig/guardrails.yaml— message classifier categories- Flagsmith feature flags — prompt version routing
4. Directory Structure¶
config/prompts/
├─ base/ # Base prompts (rules + schema, no examples)
│ ├─ conversation_v1.yaml
│ ├─ conversation_v2.yaml
│ ├─ clinical_extraction.yaml
│ ├─ icd_mapping.yaml
│ ├─ icd_mapping_cot.yaml
│ ├─ fhir_generation.yaml
│ ├─ intake_v1.yaml
│ ├─ intake_v2.yaml
│ ├─ explanation.yaml
│ ├─ clinical_analysis.yaml
│ ├─ clinical_analysis_cot.yaml
│ ├─ rerank.yaml
│ ├─ intent_classifier.yaml
│ ├─ chat_extractor_v1.yaml
│ ├─ chat_extractor_v2.yaml
│ └─ requirement_matcher.yaml
│
├─ examples/ # Few-shot examples, per locale
│ ├─ en/
│ │ ├─ conversation.yaml # 5 WRONG/RIGHT pairs
│ │ ├─ clinical_extraction.yaml # 2 rich medical report examples
│ │ ├─ icd_mapping.yaml # 1 ICD/SNOMED coding example
│ │ ├─ icd_mapping_cot.yaml # 1 COT reasoning example
│ │ ├─ fhir_generation.yaml # 1 FHIR resource example
│ │ ├─ intake.yaml # 1 intake exchange example
│ │ ├─ explanation.yaml # 1 match explanation example
│ │ ├─ clinical_analysis.yaml # 1 clinical analysis example
│ │ ├─ clinical_analysis_cot.yaml # 1 COT analysis example
│ │ ├─ rerank.yaml # 1 rerank example
│ │ ├─ intent_classifier.yaml # 6 intent routing examples
│ │ └─ document_review.yaml # 2 doc review response examples
│ │
│ └─ ar/ # Arabic (future — Phase 1 multilingual)
│ ├─ conversation.yaml
│ ├─ clinical_extraction.yaml
│ └─ ...
│
├─ phase_contexts/ # Phase-specific context blocks
│ ├─ v1/
│ │ ├─ identify_procedure.yaml
│ │ ├─ records_first.yaml
│ │ ├─ intake.yaml
│ │ ├─ document_review.yaml
│ │ └─ general.yaml
│ └─ v2/
│ ├─ identify_procedure.yaml
│ ├─ records_first.yaml
│ ├─ intake.yaml
│ ├─ document_review.yaml
│ └─ general.yaml
│
└─ forbidden/ # Already exists as voice_rules.yaml
└─ (symlink or reference to config/voice_rules.yaml)
YAML format¶
# config/prompts/examples/en/conversation.yaml
---
description: "Conversation few-shot examples — English"
locale: en
prompt_key: conversation
version: 2
examples:
- label: "Patient mentions pain and waiting"
user: "My left knee hurts badly. I can't walk my dog anymore. I'm tired of waiting 18 months for this surgery."
wrong: "It's completely natural to feel that way — exploring treatment options abroad is a big step. That's exactly why Curaway exists. Would you like to see how we evaluate providers?"
right: "Eighteen months waiting while you can't even walk your dog — that's a long time to lose the small things that matter. Let's see what we can do to get you moving again. Has your doctor confirmed it's a **Total Knee Replacement** you need?"
- label: "Patient scared of safety"
user: "I'm scared. Is this safe to do abroad?"
wrong: "Don't worry, you're in good hands! All our hospitals are excellent."
right: "Fair question — going abroad for surgery is a real decision. Every hospital we work with is internationally accredited (JCI or equivalent), and you'll see their outcomes data, surgeon credentials, and patient reviews before making any choice. Would you like to start by seeing a few hospitals, or would you rather understand the process first?"
# ... 3 more pairs
# config/prompts/examples/en/clinical_extraction.yaml
---
description: "Clinical extraction few-shot examples — English"
locale: en
prompt_key: clinical_extraction
version: 1
examples:
- label: "Radiology report — knee osteoarthritis"
input: |
Report: X-ray Right Knee
Findings: Moderate to severe osteoarthritis...
output: |
{
"entities": [...],
"observations": [...],
...
}
- label: "Lab report — CBC + metabolic panel"
input: |
Complete Blood Count:
WBC: 7.2 x10^9/L ...
output: |
{
"entities": [...],
"observations": [...],
...
}
Why YAML over JSON¶
- Multiline strings are readable (
|block scalar) - Comments allowed (useful for reviewer notes)
- Same tooling as
voice_rules.yamlandfeature_flags.yaml yaml.safe_load()is already used throughout the codebase
5. Prompt Assembly¶
5.1 The loader¶
# app/services/prompt_loader.py
def load_prompt(
prompt_key: str,
locale: str = "en",
version: str | None = None,
) -> str:
"""Assemble a complete prompt from base + examples.
Args:
prompt_key: e.g., "conversation", "clinical_extraction"
locale: ISO 639-1 code (e.g., "en", "ar")
version: optional version suffix (e.g., "v1", "v2", "cot")
Returns:
Assembled prompt string ready for LLM.
Raises:
PromptLoadError: if base prompt is missing (hard failure).
Never raises for missing locale — falls back to "en".
"""
5.2 Assembly order¶
1. Load base prompt YAML → base_text
2. Load examples YAML (requested locale, fallback to "en") → examples_text
3. Format examples into the prompt's expected shape:
- WRONG/RIGHT pairs → "EXAMPLES OF HOW TO RESPOND:\n\n..."
- Input/Output pairs → "Example Input 1:\n...\nExample Output 1:\n..."
- Intent examples → "Examples:\n..."
4. Inject examples into base_text at the {examples} placeholder
5. Return assembled string
5.3 Placeholder contract¶
Each base prompt YAML has a placeholders field listing what the
caller must substitute at runtime:
# config/prompts/base/conversation_v2.yaml
placeholders:
- name: "{examples}"
source: "prompt_loader (from examples/)"
injected_by: "prompt_loader.load_prompt()"
- name: "{forbidden_phrases_block}"
source: "config/voice_rules.yaml"
injected_by: "_get_forbidden_phrases_block()"
- name: "{emotional_context}"
source: "emotional_state.py"
injected_by: "generate_response_streaming()"
- name: "{phase_context}"
source: "config/prompts/phase_contexts/"
injected_by: "generate_response_streaming()"
- name: "{patient_context}"
source: "case_orchestrator.py"
injected_by: "generate_response_streaming()"
This makes the contract explicit — CI can verify that every placeholder listed in the YAML is substituted before the prompt reaches the LLM.
6. Failsafe Design¶
6.1 Failure modes and responses¶
| Failure | Detection | Response | Patient impact |
|---|---|---|---|
| Base prompt YAML missing | load_prompt() raises PromptLoadError |
Hard fail — blocks deployment. CI test catches this. | None — never reaches production. |
| Base prompt YAML malformed | yaml.safe_load() raises |
Hard fail — same as above. | None. |
| Examples YAML missing for requested locale | _load_examples() returns None |
Fall back to English examples. Log warning. | None — model gets English examples, responds in patient's language anyway. |
| Examples YAML missing for English (fallback) | _load_examples("en") returns None |
Hard fail — PromptLoadError. CI catches. |
None — never reaches production. |
| Examples YAML malformed | yaml.safe_load() raises or schema validation fails |
Fall back to English. Log error. | None — model gets valid English examples. |
| Examples YAML empty | Loaded but examples: [] |
Fall back to English. Log warning. | None. |
| Placeholder not substituted | CI test scans assembled prompt for {...} patterns |
Hard fail in CI. At runtime, _validate_prompt() logs error but sends prompt anyway (raw placeholder is harmless noise). |
Minimal — model ignores unrecognized {placeholder} text. |
| YAML file read permission error | OSError from Path.read_text() |
Fall back to English. Log error. | None. |
| prompt_loader.py itself has a bug | Unit tests cover all code paths. Integration test assembles every prompt in CI. | Hard fail in CI. | None — never reaches production. |
6.2 The fallback chain¶
Requested locale (e.g., "ar")
│
├─ YAML exists + valid → use it
│
├─ YAML missing or malformed
│ │
│ └─ Fall back to "en"
│ │
│ ├─ "en" YAML exists + valid → use it + log warning
│ │
│ └─ "en" YAML missing → PromptLoadError (blocks CI)
│
└─ All examples loading fails (shouldn't happen)
│
└─ Send base prompt WITHOUT examples + log error
(model still works — examples improve quality but aren't required)
6.3 Runtime validation¶
Before sending any assembled prompt to the LLM:
def _validate_prompt(prompt: str, prompt_key: str) -> str:
"""Check for unresolved placeholders and log if found."""
unresolved = re.findall(r'\{[a-z_]+\}', prompt)
# Known safe: {value} patterns inside JSON schema examples
safe = {'{value}', '{}'}
problems = [p for p in unresolved if p not in safe]
if problems:
logger.error(
"Unresolved placeholders in %s: %s — sending anyway",
prompt_key, problems,
)
return prompt
This is a log-and-continue strategy, not a hard fail — because an unresolved placeholder is less harmful than blocking the patient's conversation.
7. Prompt Caching Integration¶
7.1 How Anthropic prompt caching works¶
- Client sends the full prompt with
cache_controlmarkers - Anthropic hashes the prefix up to the marker
- If the hash matches a cached version, cached tokens cost 10%
- Cache TTL: 5 minutes (resets on each hit)
- Cache is scoped to the Anthropic workspace (platform-wide)
7.2 Where to place cache markers¶
The assembled prompt has this structure:
[base prompt] + [examples] + [forbidden phrases] + [phase context]
────────────────────── CACHEABLE PREFIX ──────────────────────
[patient context] + [conversation history] + [current message]
────────────────────── VARIABLE TAIL ──────────────────────────
Place cache_control: {"type": "ephemeral"} on the last message
in the cacheable prefix. In practice, this is the system message:
messages = [
{
"role": "system",
"content": assembled_prompt, # base + examples + forbidden + phase
"cache_control": {"type": "ephemeral"},
},
# ... conversation history (not cached)
{"role": "user", "content": current_message},
]
7.3 Cache hit rate projection¶
| Scenario | Cache warm? | Why |
|---|---|---|
| Same patient, same case, next message | Yes | Same system prompt prefix, <5 min between messages |
| Different patient, same locale + phase | Yes | Identical system prompt prefix (platform-scoped cache) |
| Same patient, phase transition | Partial | Phase context changes, but base + examples are still cached at a higher level |
| First call of the day | Miss | Cold start, full price |
| After code deploy | Miss | Base prompt changed, new hash |
At even modest traffic (10+ active conversations), the cache is warm continuously for English. Each additional locale adds its own cache entry — Arabic conversations share a cache, English conversations share a cache.
7.4 Cost projection¶
| Metric | Current (no caching) | With abstraction + caching |
|---|---|---|
| Conversation input tokens/month (100 cases) | ~930K | ~580K effective (37% reduction) |
| Extraction input tokens/month | ~356K | ~178K effective (50% reduction) |
| Total monthly input tokens | ~1.58M | ~962K |
| Monthly LLM cost (Haiku-heavy mix) | ~$14 | ~$9 |
8. Scoping: Platform, Case, or Patient?¶
8.1 What is scoped to what¶
| Component | Scope | Changes when |
|---|---|---|
| Base prompt text | Platform | Code deploy |
| Few-shot examples | Platform × locale | Clinical review cycle |
| Forbidden phrases | Platform | Voice rules update |
| Phase context | Platform × version | Feature flag change |
| Patient context | Case | Each message (grows) |
| Conversation history | Case | Each message (grows) |
| Locale selection | Case (from preferred_locale) |
Set once at intake |
| Emotional context | Message | Computed per message |
8.2 Locale detection¶
The patient's locale is detected at the case level:
- First message arrives
- Claude identifies the language as part of its normal response
chat_extractorextractslanguagefrom the responsecase.preferred_localeis set (e.g., "ar")- All subsequent LLM calls for that case load Arabic examples
If locale detection fails, default to English. The patient can override by saying "please respond in Arabic" — the extractor picks this up and updates the locale.
8.3 Mid-conversation locale switch¶
If a patient switches languages mid-conversation (e.g., starts in English, switches to Arabic):
- Extractor detects the new language
case.preferred_localeis updated- Next LLM call loads Arabic examples
- Prompt cache misses once (new prefix), then warms the Arabic cache
- No patient-visible disruption — the model adapts naturally
9. Edge Cases¶
9.1 Mixed-language documents¶
A patient uploads a report with Arabic headers and English lab values.
- OCR: PyMuPDF / Unstructured.io extracts both scripts correctly
- Extraction prompt: English few-shot examples still work — Claude handles multilingual input with English examples. The examples demonstrate the output schema, not the input language.
- ICD mapping: ICD-10 codes are language-neutral. Claude maps Arabic condition names to the same codes.
- Future improvement: Arabic extraction examples would improve accuracy for Arabic-specific medical terminology.
9.2 Locale with no examples yet¶
A patient speaks Hindi. No Hindi examples exist.
- Loader falls back to English examples
- Model still responds in Hindi (Claude is multilingual)
- Quality is slightly lower for Hindi-specific medical terms
- Logged as a warning for prioritization
9.3 Very long examples (extraction prompt)¶
The clinical extraction examples are 4,200 chars (59% of the prompt). These are detailed input/output pairs with FHIR-like JSON structures.
- These examples are critical for output quality — without them, the model's JSON structure drifts
- They are also the biggest caching win — stable across all calls, cached at 10% cost
- They should NOT be compressed further — the richness is intentional
9.4 COT examples reference reasoning steps¶
The ICD mapping COT and clinical analysis COT examples include
structured reasoning_steps arrays. These are designed for:
- LLM output quality (model follows the reasoning pattern)
- LangSmith eval datasets (steps stored in events table)
- MedGemma fine-tuning (steps become training data)
If Arabic COT examples are needed, they should follow the same 7-step structure with Arabic medical terminology. The step labels (BODY_SYSTEM, CONDITION_CATEGORY, etc.) stay in English — they're machine-readable keys, not patient-facing.
9.5 Concurrent YAML file update during request¶
If a YAML file is updated on disk while a request is in flight:
- The loader reads the file at call time (no caching of YAML content in memory by default)
- The request gets the new content — this is fine
- If the file is partially written (rare),
yaml.safe_load()may fail → falls back to English - For production safety, add an in-memory cache with 60s TTL (matches the Flagsmith cache pattern)
9.6 Prompt version rollback¶
If the V2 prompt regresses and we need to roll back to V1:
- Flagsmith flag
prompt_version→"v1_original" - V1 base prompt loads from
config/prompts/base/conversation_v1.yaml - V1 examples load from the same
examples/en/conversation.yaml(examples are version-agnostic — WRONG/RIGHT pairs work for both) - V1 phase contexts load from
config/prompts/phase_contexts/v1/ - Zero code changes needed
9.7 New prompt added in future¶
When a developer adds a new LLM prompt (e.g., for case porting):
- Add base prompt YAML to
config/prompts/base/ - Add English examples to
config/prompts/examples/en/ - Use
load_prompt()in the Python code - CI test auto-discovers all YAML files and validates them
The CI test uses glob("config/prompts/base/*.yaml") to find all
prompts and verifies each has a corresponding English example file.
No manual registration needed.
10. Migration Strategy¶
10.1 Zero-downtime migration¶
The migration replaces hardcoded string constants with
load_prompt() calls. The assembled output is byte-identical.
Verification: A CI test (test_prompt_migration_parity.py)
assembles each prompt via the new loader and asserts character-for-
character equality with the old hardcoded constant. This test is
temporary — removed after the migration PR merges.
10.2 Migration order¶
- Lowest risk first: Prompts called infrequently (rerank, explanation, intent classifier)
- Highest impact second: Extraction + ICD mapping (biggest example overhead, most caching benefit)
- Conversation last: Most complex (5 placeholders, 5 phase contexts, forbidden phrases injection, emotional context)
10.3 Rollback¶
Every migrated prompt retains the old constant (commented out) for one release cycle. If the loader fails in production, a one-line revert restores the hardcoded constant.
11. What This Enables¶
| Capability | How | When |
|---|---|---|
| Arabic support | Drop ar/ YAML files in examples/ |
Phase 1 multilingual |
| Clinical review workflow | Export examples to xlsx, review, re-import | Already established pattern |
| A/B testing examples | Feature flag selects example version | Post-abstraction |
| Prompt caching | Stable prefix → Anthropic cache | Immediate after migration |
| Fine-tuning dataset | YAML examples → training pairs export | Post-Series A |
| Prompt analytics | Track which examples are loaded per call via Langfuse | Post-abstraction |
| New language in < 1 day | Translate YAML files, no Python changes | After Arabic proves the pattern |
12. Success Criteria¶
- Zero behavioral change — assembled prompts are byte-identical to hardcoded constants (verified by parity test)
- All CI tests pass — no regressions in voice compliance, medical advice scanner, or existing prompt tests
- Prompt caching active — Langfuse traces show
cache_readtokens on repeated calls within the same conversation - 39% input token reduction — measured via Anthropic usage report after 1 week in production
- Arabic examples loadable —
load_prompt("conversation", "ar")returns a valid assembled prompt (even before Arabic content exists, falls back to English)
13. References¶
- Companion feature spec:
prompt-abstraction-feature.md - Prompt audit (Session 35): 19 prompts across 8 files, 23 example blocks, ~9,900 chars of inline examples
- Anthropic prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Voice rules (existing YAML pattern):
config/voice_rules.yaml - Feature flags (existing YAML pattern):
config/feature_flags.yaml - Medical advice review workflow:
curaway-medical-advice-review.xlsx(Session 35 — xlsx export → Dr. Naidu review → approved strings)