Prompt Abstraction Layer — Steer Document¶

Date: 2026-04-10 Author: Srikanth Donthi (CPO/CTO) + Claude Code session 35 Status: Implemented — PR #92 merged, 14/14 parity verified Companion spec: prompt-abstraction-feature.md

1. Problem Statement¶

All 19 LLM prompts in the Curaway backend are hardcoded as Python string constants across 8 files. Few-shot examples (~28.5% of total prompt tokens) are embedded inline. This creates five problems:

Language scaling: Adding Arabic requires editing 12 Python files. Adding a third language doubles the work. There's no way to load language-specific examples without code changes.
Token waste: Every LLM call sends all examples regardless of relevance. The extraction prompt is 59% examples — 4,200 chars of radiology/lab report samples sent even when processing a dental X-ray.
Clinical review friction: Dr. Naidu (Clinical Advisor) reviews wording in xlsx files, not Python string constants. Prompt examples should be reviewable the same way — in a human-readable format outside of code.
No prompt caching: Anthropic's prompt caching reduces cached input tokens to 10% cost. But caching requires a stable prefix. With examples embedded inline and dynamic placeholders scattered throughout, identifying the cacheable boundary is difficult.
Version drift: V1 and V2 prompts coexist with different example counts (V1 has 0 conversation examples, V2 has 5). Phase contexts have inconsistent example coverage. No central inventory of what examples exist.

2. Design Principles¶

2.1 Separation of concerns¶

Base prompt (rules, schema, guardrails)  → Python constant or YAML
Few-shot examples                        → YAML, per-locale, per-category
Forbidden phrases                        → Already in voice_rules.yaml
Dynamic context (patient, phase, emotion) → Injected at runtime

The base prompt changes with code deploys. Examples change with clinical review cycles. Dynamic context changes per call. These three concerns have different change frequencies and different reviewers — they should not live in the same string.

2.2 Every call gets full examples — always¶

Prompt caching is a billing optimization, not a content optimization. The LLM sees the full prompt on every call. Caching means Anthropic's servers recognize the prefix and charge 10% for the cached portion. There is zero quality difference between a cache hit and a cache miss.

This means: - Examples are never "skipped" or "omitted" for any call - Language-specific examples are loaded from YAML and injected into the prompt before sending - The assembled prompt is byte-identical to today's hardcoded strings

2.3 Fail to English, never fail to nothing¶

If the requested locale's examples are missing, fall back to English. If English examples are missing, fail loud (raise, caught in CI). A prompt without examples is worse than a prompt with wrong-language examples — the model still benefits from structural guidance.

2.4 Prompt caching boundary¶

The cacheable prefix = base prompt + examples + forbidden phrases. Everything before {patient_context} is stable across calls within the same conversation. Patient context and conversation history are the variable tail.

┌──────────────────────────────────┐
│  CACHED PREFIX (~80% of input)   │
│  ├─ Base prompt (rules, schema)  │
│  ├─ Few-shot examples (locale)   │
│  ├─ Forbidden phrases            │
│  └─ Phase context                │
├──────────────────────────────────┤
│  VARIABLE TAIL (~20% of input)   │
│  ├─ Patient context              │
│  ├─ Conversation history         │
│  └─ Current message              │
└──────────────────────────────────┘

3. Scope: What Moves, What Stays¶

Moves to YAML (few-shot examples)¶

Prompt	Current location	Examples to extract
`CURAWAY_SYSTEM_PROMPT_V2`	llm_conversation.py:210-228	5 WRONG/RIGHT conversation pairs
`EXTRACTION_SYSTEM_PROMPT`	clinical_extraction.py:55-207	2 rich examples (radiology + lab)
`ICD_MAPPING_SYSTEM_PROMPT`	clinical_extraction.py:252-278	1 coded example
`ICD_MAPPING_SYSTEM_PROMPT_COT`	clinical_extraction.py:350-398	1 COT reasoning example
`FHIR_GENERATION_SYSTEM_PROMPT`	clinical_extraction.py:429-477	1 FHIR resource example
`EXPLANATION_SYSTEM_PROMPT`	explanation.py:26-36	1 explanation example
`CLINICAL_ANALYSIS_SYSTEM_PROMPT`	match_analysis.py:21-38	1 analysis example
`CLINICAL_ANALYSIS_COT`	match_analysis.py:125-161	1 COT analysis example
`RERANK_SYSTEM_PROMPT`	match_analysis.py:187-195	1 rerank example
`CLASSIFY_INTENT_PROMPT`	orchestrator.py:61-67	6 intent classification examples
`INTAKE_SYSTEM_PROMPT_V1`	intake.py:38-49	1 intake example
Phase: `document_review` (V1)	llm_conversation.py:390-404	2 response shape examples

Total: 23 example blocks across 12 prompts.

Stays in Python (base prompts + dynamic logic)¶

Base prompt text (rules, schemas, guardrails)
_get_system_prompt() assembly function
_get_phase_contexts() phase routing
_get_forbidden_phrases_block() voice rules injection
All placeholder substitution ({patient_context}, {phase_context}, {emotional_context}, {forbidden_phrases_block})
All runtime template variables in MATCHER_USER_TEMPLATE
Feature flag routing (V1 vs V2, COT vs legacy)

Already external (no change needed)¶

config/voice_rules.yaml — forbidden phrases
config/guardrails.yaml — message classifier categories
Flagsmith feature flags — prompt version routing

4. Directory Structure¶

config/prompts/
  ├─ base/                              # Base prompts (rules + schema, no examples)
  │   ├─ conversation_v1.yaml
  │   ├─ conversation_v2.yaml
  │   ├─ clinical_extraction.yaml
  │   ├─ icd_mapping.yaml
  │   ├─ icd_mapping_cot.yaml
  │   ├─ fhir_generation.yaml
  │   ├─ intake_v1.yaml
  │   ├─ intake_v2.yaml
  │   ├─ explanation.yaml
  │   ├─ clinical_analysis.yaml
  │   ├─ clinical_analysis_cot.yaml
  │   ├─ rerank.yaml
  │   ├─ intent_classifier.yaml
  │   ├─ chat_extractor_v1.yaml
  │   ├─ chat_extractor_v2.yaml
  │   └─ requirement_matcher.yaml
  │
  ├─ examples/                          # Few-shot examples, per locale
  │   ├─ en/
  │   │   ├─ conversation.yaml          # 5 WRONG/RIGHT pairs
  │   │   ├─ clinical_extraction.yaml   # 2 rich medical report examples
  │   │   ├─ icd_mapping.yaml           # 1 ICD/SNOMED coding example
  │   │   ├─ icd_mapping_cot.yaml       # 1 COT reasoning example
  │   │   ├─ fhir_generation.yaml       # 1 FHIR resource example
  │   │   ├─ intake.yaml                # 1 intake exchange example
  │   │   ├─ explanation.yaml           # 1 match explanation example
  │   │   ├─ clinical_analysis.yaml     # 1 clinical analysis example
  │   │   ├─ clinical_analysis_cot.yaml # 1 COT analysis example
  │   │   ├─ rerank.yaml                # 1 rerank example
  │   │   ├─ intent_classifier.yaml     # 6 intent routing examples
  │   │   └─ document_review.yaml       # 2 doc review response examples
  │   │
  │   └─ ar/                            # Arabic (future — Phase 1 multilingual)
  │       ├─ conversation.yaml
  │       ├─ clinical_extraction.yaml
  │       └─ ...
  │
  ├─ phase_contexts/                    # Phase-specific context blocks
  │   ├─ v1/
  │   │   ├─ identify_procedure.yaml
  │   │   ├─ records_first.yaml
  │   │   ├─ intake.yaml
  │   │   ├─ document_review.yaml
  │   │   └─ general.yaml
  │   └─ v2/
  │       ├─ identify_procedure.yaml
  │       ├─ records_first.yaml
  │       ├─ intake.yaml
  │       ├─ document_review.yaml
  │       └─ general.yaml
  │
  └─ forbidden/                         # Already exists as voice_rules.yaml
      └─ (symlink or reference to config/voice_rules.yaml)

YAML format¶

# config/prompts/examples/en/conversation.yaml
---
description: "Conversation few-shot examples — English"
locale: en
prompt_key: conversation
version: 2

examples:
  - label: "Patient mentions pain and waiting"
    user: "My left knee hurts badly. I can't walk my dog anymore. I'm tired of waiting 18 months for this surgery."
    wrong: "It's completely natural to feel that way — exploring treatment options abroad is a big step. That's exactly why Curaway exists. Would you like to see how we evaluate providers?"
    right: "Eighteen months waiting while you can't even walk your dog — that's a long time to lose the small things that matter. Let's see what we can do to get you moving again. Has your doctor confirmed it's a **Total Knee Replacement** you need?"

  - label: "Patient scared of safety"
    user: "I'm scared. Is this safe to do abroad?"
    wrong: "Don't worry, you're in good hands! All our hospitals are excellent."
    right: "Fair question — going abroad for surgery is a real decision. Every hospital we work with is internationally accredited (JCI or equivalent), and you'll see their outcomes data, surgeon credentials, and patient reviews before making any choice. Would you like to start by seeing a few hospitals, or would you rather understand the process first?"

  # ... 3 more pairs

# config/prompts/examples/en/clinical_extraction.yaml
---
description: "Clinical extraction few-shot examples — English"
locale: en
prompt_key: clinical_extraction
version: 1

examples:
  - label: "Radiology report — knee osteoarthritis"
    input: |
      Report: X-ray Right Knee
      Findings: Moderate to severe osteoarthritis...
    output: |
      {
        "entities": [...],
        "observations": [...],
        ...
      }

  - label: "Lab report — CBC + metabolic panel"
    input: |
      Complete Blood Count:
      WBC: 7.2 x10^9/L ...
    output: |
      {
        "entities": [...],
        "observations": [...],
        ...
      }

Why YAML over JSON¶

Multiline strings are readable (| block scalar)
Comments allowed (useful for reviewer notes)
Same tooling as voice_rules.yaml and feature_flags.yaml
yaml.safe_load() is already used throughout the codebase

5. Prompt Assembly¶

5.1 The loader¶

# app/services/prompt_loader.py

def load_prompt(
    prompt_key: str,
    locale: str = "en",
    version: str | None = None,
) -> str:
    """Assemble a complete prompt from base + examples.

    Args:
        prompt_key: e.g., "conversation", "clinical_extraction"
        locale: ISO 639-1 code (e.g., "en", "ar")
        version: optional version suffix (e.g., "v1", "v2", "cot")

    Returns:
        Assembled prompt string ready for LLM.

    Raises:
        PromptLoadError: if base prompt is missing (hard failure).
        Never raises for missing locale — falls back to "en".
    """

5.2 Assembly order¶

1. Load base prompt YAML → base_text
2. Load examples YAML (requested locale, fallback to "en") → examples_text
3. Format examples into the prompt's expected shape:
   - WRONG/RIGHT pairs → "EXAMPLES OF HOW TO RESPOND:\n\n..."
   - Input/Output pairs → "Example Input 1:\n...\nExample Output 1:\n..."
   - Intent examples → "Examples:\n..."
4. Inject examples into base_text at the {examples} placeholder
5. Return assembled string

5.3 Placeholder contract¶

Each base prompt YAML has a placeholders field listing what the caller must substitute at runtime:

# config/prompts/base/conversation_v2.yaml
placeholders:
  - name: "{examples}"
    source: "prompt_loader (from examples/)"
    injected_by: "prompt_loader.load_prompt()"
  - name: "{forbidden_phrases_block}"
    source: "config/voice_rules.yaml"
    injected_by: "_get_forbidden_phrases_block()"
  - name: "{emotional_context}"
    source: "emotional_state.py"
    injected_by: "generate_response_streaming()"
  - name: "{phase_context}"
    source: "config/prompts/phase_contexts/"
    injected_by: "generate_response_streaming()"
  - name: "{patient_context}"
    source: "case_orchestrator.py"
    injected_by: "generate_response_streaming()"

This makes the contract explicit — CI can verify that every placeholder listed in the YAML is substituted before the prompt reaches the LLM.

6. Failsafe Design¶

6.1 Failure modes and responses¶

Failure	Detection	Response	Patient impact
Base prompt YAML missing	`load_prompt()` raises `PromptLoadError`	Hard fail — blocks deployment. CI test catches this.	None — never reaches production.
Base prompt YAML malformed	`yaml.safe_load()` raises	Hard fail — same as above.	None.
Examples YAML missing for requested locale	`_load_examples()` returns None	Fall back to English examples. Log warning.	None — model gets English examples, responds in patient's language anyway.
Examples YAML missing for English (fallback)	`_load_examples("en")` returns None	Hard fail — `PromptLoadError`. CI catches.	None — never reaches production.
Examples YAML malformed	`yaml.safe_load()` raises or schema validation fails	Fall back to English. Log error.	None — model gets valid English examples.
Examples YAML empty	Loaded but `examples: []`	Fall back to English. Log warning.	None.
Placeholder not substituted	CI test scans assembled prompt for `{...}` patterns	Hard fail in CI. At runtime, `_validate_prompt()` logs error but sends prompt anyway (raw placeholder is harmless noise).	Minimal — model ignores unrecognized `{placeholder}` text.
YAML file read permission error	`OSError` from `Path.read_text()`	Fall back to English. Log error.	None.
prompt_loader.py itself has a bug	Unit tests cover all code paths. Integration test assembles every prompt in CI.	Hard fail in CI.	None — never reaches production.

6.2 The fallback chain¶

Requested locale (e.g., "ar")
  │
  ├─ YAML exists + valid → use it
  │
  ├─ YAML missing or malformed
  │   │
  │   └─ Fall back to "en"
  │       │
  │       ├─ "en" YAML exists + valid → use it + log warning
  │       │
  │       └─ "en" YAML missing → PromptLoadError (blocks CI)
  │
  └─ All examples loading fails (shouldn't happen)
      │
      └─ Send base prompt WITHOUT examples + log error
          (model still works — examples improve quality but aren't required)

6.3 Runtime validation¶

Before sending any assembled prompt to the LLM:

def _validate_prompt(prompt: str, prompt_key: str) -> str:
    """Check for unresolved placeholders and log if found."""
    unresolved = re.findall(r'\{[a-z_]+\}', prompt)
    # Known safe: {value} patterns inside JSON schema examples
    safe = {'{value}', '{}'}
    problems = [p for p in unresolved if p not in safe]
    if problems:
        logger.error(
            "Unresolved placeholders in %s: %s — sending anyway",
            prompt_key, problems,
        )
    return prompt

This is a log-and-continue strategy, not a hard fail — because an unresolved placeholder is less harmful than blocking the patient's conversation.

7. Prompt Caching Integration¶

7.1 How Anthropic prompt caching works¶

Client sends the full prompt with cache_control markers
Anthropic hashes the prefix up to the marker
If the hash matches a cached version, cached tokens cost 10%
Cache TTL: 5 minutes (resets on each hit)
Cache is scoped to the Anthropic workspace (platform-wide)

7.2 Where to place cache markers¶

The assembled prompt has this structure:

[base prompt] + [examples] + [forbidden phrases] + [phase context]
────────────────────── CACHEABLE PREFIX ──────────────────────
[patient context] + [conversation history] + [current message]
────────────────────── VARIABLE TAIL ──────────────────────────

Place cache_control: {"type": "ephemeral"} on the last message in the cacheable prefix. In practice, this is the system message:

messages = [
    {
        "role": "system",
        "content": assembled_prompt,  # base + examples + forbidden + phase
        "cache_control": {"type": "ephemeral"},
    },
    # ... conversation history (not cached)
    {"role": "user", "content": current_message},
]

7.3 Cache hit rate projection¶

Scenario	Cache warm?	Why
Same patient, same case, next message	Yes	Same system prompt prefix, <5 min between messages
Different patient, same locale + phase	Yes	Identical system prompt prefix (platform-scoped cache)
Same patient, phase transition	Partial	Phase context changes, but base + examples are still cached at a higher level
First call of the day	Miss	Cold start, full price
After code deploy	Miss	Base prompt changed, new hash

At even modest traffic (10+ active conversations), the cache is warm continuously for English. Each additional locale adds its own cache entry — Arabic conversations share a cache, English conversations share a cache.

7.4 Cost projection¶

Metric	Current (no caching)	With abstraction + caching
Conversation input tokens/month (100 cases)	~930K	~580K effective (37% reduction)
Extraction input tokens/month	~356K	~178K effective (50% reduction)
Total monthly input tokens	~1.58M	~962K
Monthly LLM cost (Haiku-heavy mix)	~$14	~$9

8. Scoping: Platform, Case, or Patient?¶

8.1 What is scoped to what¶

Component	Scope	Changes when
Base prompt text	Platform	Code deploy
Few-shot examples	Platform × locale	Clinical review cycle
Forbidden phrases	Platform	Voice rules update
Phase context	Platform × version	Feature flag change
Patient context	Case	Each message (grows)
Conversation history	Case	Each message (grows)
Locale selection	Case (from `preferred_locale`)	Set once at intake
Emotional context	Message	Computed per message

8.2 Locale detection¶

The patient's locale is detected at the case level:

First message arrives
Claude identifies the language as part of its normal response
chat_extractor extracts language from the response
case.preferred_locale is set (e.g., "ar")
All subsequent LLM calls for that case load Arabic examples

If locale detection fails, default to English. The patient can override by saying "please respond in Arabic" — the extractor picks this up and updates the locale.

8.3 Mid-conversation locale switch¶

If a patient switches languages mid-conversation (e.g., starts in English, switches to Arabic):

Extractor detects the new language
case.preferred_locale is updated
Next LLM call loads Arabic examples
Prompt cache misses once (new prefix), then warms the Arabic cache
No patient-visible disruption — the model adapts naturally

9. Edge Cases¶

9.1 Mixed-language documents¶

A patient uploads a report with Arabic headers and English lab values.

OCR: PyMuPDF / Unstructured.io extracts both scripts correctly
Extraction prompt: English few-shot examples still work — Claude handles multilingual input with English examples. The examples demonstrate the output schema, not the input language.
ICD mapping: ICD-10 codes are language-neutral. Claude maps Arabic condition names to the same codes.
Future improvement: Arabic extraction examples would improve accuracy for Arabic-specific medical terminology.

9.2 Locale with no examples yet¶

A patient speaks Hindi. No Hindi examples exist.

Loader falls back to English examples
Model still responds in Hindi (Claude is multilingual)
Quality is slightly lower for Hindi-specific medical terms
Logged as a warning for prioritization

9.3 Very long examples (extraction prompt)¶

The clinical extraction examples are 4,200 chars (59% of the prompt). These are detailed input/output pairs with FHIR-like JSON structures.

These examples are critical for output quality — without them, the model's JSON structure drifts
They are also the biggest caching win — stable across all calls, cached at 10% cost
They should NOT be compressed further — the richness is intentional

9.4 COT examples reference reasoning steps¶

The ICD mapping COT and clinical analysis COT examples include structured reasoning_steps arrays. These are designed for:

LLM output quality (model follows the reasoning pattern)
LangSmith eval datasets (steps stored in events table)
MedGemma fine-tuning (steps become training data)

If Arabic COT examples are needed, they should follow the same 7-step structure with Arabic medical terminology. The step labels (BODY_SYSTEM, CONDITION_CATEGORY, etc.) stay in English — they're machine-readable keys, not patient-facing.

9.5 Concurrent YAML file update during request¶

If a YAML file is updated on disk while a request is in flight:

The loader reads the file at call time (no caching of YAML content in memory by default)
The request gets the new content — this is fine
If the file is partially written (rare), yaml.safe_load() may fail → falls back to English
For production safety, add an in-memory cache with 60s TTL (matches the Flagsmith cache pattern)

9.6 Prompt version rollback¶

If the V2 prompt regresses and we need to roll back to V1:

Flagsmith flag prompt_version → "v1_original"
V1 base prompt loads from config/prompts/base/conversation_v1.yaml
V1 examples load from the same examples/en/conversation.yaml (examples are version-agnostic — WRONG/RIGHT pairs work for both)
V1 phase contexts load from config/prompts/phase_contexts/v1/
Zero code changes needed

9.7 New prompt added in future¶

When a developer adds a new LLM prompt (e.g., for case porting):

Add base prompt YAML to config/prompts/base/
Add English examples to config/prompts/examples/en/
Use load_prompt() in the Python code
CI test auto-discovers all YAML files and validates them

The CI test uses glob("config/prompts/base/*.yaml") to find all prompts and verifies each has a corresponding English example file. No manual registration needed.

10. Migration Strategy¶

10.1 Zero-downtime migration¶

The migration replaces hardcoded string constants with load_prompt() calls. The assembled output is byte-identical.

Verification: A CI test (test_prompt_migration_parity.py) assembles each prompt via the new loader and asserts character-for- character equality with the old hardcoded constant. This test is temporary — removed after the migration PR merges.

10.2 Migration order¶

Lowest risk first: Prompts called infrequently (rerank, explanation, intent classifier)
Highest impact second: Extraction + ICD mapping (biggest example overhead, most caching benefit)
Conversation last: Most complex (5 placeholders, 5 phase contexts, forbidden phrases injection, emotional context)

10.3 Rollback¶

Every migrated prompt retains the old constant (commented out) for one release cycle. If the loader fails in production, a one-line revert restores the hardcoded constant.

11. What This Enables¶

Capability	How	When
Arabic support	Drop `ar/` YAML files in examples/	Phase 1 multilingual
Clinical review workflow	Export examples to xlsx, review, re-import	Already established pattern
A/B testing examples	Feature flag selects example version	Post-abstraction
Prompt caching	Stable prefix → Anthropic cache	Immediate after migration
Fine-tuning dataset	YAML examples → training pairs export	Post-Series A
Prompt analytics	Track which examples are loaded per call via Langfuse	Post-abstraction
New language in < 1 day	Translate YAML files, no Python changes	After Arabic proves the pattern

12. Success Criteria¶

Zero behavioral change — assembled prompts are byte-identical to hardcoded constants (verified by parity test)
All CI tests pass — no regressions in voice compliance, medical advice scanner, or existing prompt tests
Prompt caching active — Langfuse traces show cache_read tokens on repeated calls within the same conversation
39% input token reduction — measured via Anthropic usage report after 1 week in production
Arabic examples loadable — load_prompt("conversation", "ar") returns a valid assembled prompt (even before Arabic content exists, falls back to English)

13. References¶

Companion feature spec: prompt-abstraction-feature.md
Prompt audit (Session 35): 19 prompts across 8 files, 23 example blocks, ~9,900 chars of inline examples
Anthropic prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Voice rules (existing YAML pattern): config/voice_rules.yaml
Feature flags (existing YAML pattern): config/feature_flags.yaml
Medical advice review workflow: curaway-medical-advice-review.xlsx (Session 35 — xlsx export → Dr. Naidu review → approved strings)