Skip to content

Prompt Abstraction Layer — Steer Document

Date: 2026-04-10 Author: Srikanth Donthi (CPO/CTO) + Claude Code session 35 Status: Implemented — PR #92 merged, 14/14 parity verified Companion spec: prompt-abstraction-feature.md


1. Problem Statement

All 19 LLM prompts in the Curaway backend are hardcoded as Python string constants across 8 files. Few-shot examples (~28.5% of total prompt tokens) are embedded inline. This creates five problems:

  1. Language scaling: Adding Arabic requires editing 12 Python files. Adding a third language doubles the work. There's no way to load language-specific examples without code changes.

  2. Token waste: Every LLM call sends all examples regardless of relevance. The extraction prompt is 59% examples — 4,200 chars of radiology/lab report samples sent even when processing a dental X-ray.

  3. Clinical review friction: Dr. Naidu (Clinical Advisor) reviews wording in xlsx files, not Python string constants. Prompt examples should be reviewable the same way — in a human-readable format outside of code.

  4. No prompt caching: Anthropic's prompt caching reduces cached input tokens to 10% cost. But caching requires a stable prefix. With examples embedded inline and dynamic placeholders scattered throughout, identifying the cacheable boundary is difficult.

  5. Version drift: V1 and V2 prompts coexist with different example counts (V1 has 0 conversation examples, V2 has 5). Phase contexts have inconsistent example coverage. No central inventory of what examples exist.


2. Design Principles

2.1 Separation of concerns

Base prompt (rules, schema, guardrails)  → Python constant or YAML
Few-shot examples                        → YAML, per-locale, per-category
Forbidden phrases                        → Already in voice_rules.yaml
Dynamic context (patient, phase, emotion) → Injected at runtime

The base prompt changes with code deploys. Examples change with clinical review cycles. Dynamic context changes per call. These three concerns have different change frequencies and different reviewers — they should not live in the same string.

2.2 Every call gets full examples — always

Prompt caching is a billing optimization, not a content optimization. The LLM sees the full prompt on every call. Caching means Anthropic's servers recognize the prefix and charge 10% for the cached portion. There is zero quality difference between a cache hit and a cache miss.

This means: - Examples are never "skipped" or "omitted" for any call - Language-specific examples are loaded from YAML and injected into the prompt before sending - The assembled prompt is byte-identical to today's hardcoded strings

2.3 Fail to English, never fail to nothing

If the requested locale's examples are missing, fall back to English. If English examples are missing, fail loud (raise, caught in CI). A prompt without examples is worse than a prompt with wrong-language examples — the model still benefits from structural guidance.

2.4 Prompt caching boundary

The cacheable prefix = base prompt + examples + forbidden phrases. Everything before {patient_context} is stable across calls within the same conversation. Patient context and conversation history are the variable tail.

┌──────────────────────────────────┐
│  CACHED PREFIX (~80% of input)   │
│  ├─ Base prompt (rules, schema)  │
│  ├─ Few-shot examples (locale)   │
│  ├─ Forbidden phrases            │
│  └─ Phase context                │
├──────────────────────────────────┤
│  VARIABLE TAIL (~20% of input)   │
│  ├─ Patient context              │
│  ├─ Conversation history         │
│  └─ Current message              │
└──────────────────────────────────┘

3. Scope: What Moves, What Stays

Moves to YAML (few-shot examples)

Prompt Current location Examples to extract
CURAWAY_SYSTEM_PROMPT_V2 llm_conversation.py:210-228 5 WRONG/RIGHT conversation pairs
EXTRACTION_SYSTEM_PROMPT clinical_extraction.py:55-207 2 rich examples (radiology + lab)
ICD_MAPPING_SYSTEM_PROMPT clinical_extraction.py:252-278 1 coded example
ICD_MAPPING_SYSTEM_PROMPT_COT clinical_extraction.py:350-398 1 COT reasoning example
FHIR_GENERATION_SYSTEM_PROMPT clinical_extraction.py:429-477 1 FHIR resource example
EXPLANATION_SYSTEM_PROMPT explanation.py:26-36 1 explanation example
CLINICAL_ANALYSIS_SYSTEM_PROMPT match_analysis.py:21-38 1 analysis example
CLINICAL_ANALYSIS_COT match_analysis.py:125-161 1 COT analysis example
RERANK_SYSTEM_PROMPT match_analysis.py:187-195 1 rerank example
CLASSIFY_INTENT_PROMPT orchestrator.py:61-67 6 intent classification examples
INTAKE_SYSTEM_PROMPT_V1 intake.py:38-49 1 intake example
Phase: document_review (V1) llm_conversation.py:390-404 2 response shape examples

Total: 23 example blocks across 12 prompts.

Stays in Python (base prompts + dynamic logic)

  • Base prompt text (rules, schemas, guardrails)
  • _get_system_prompt() assembly function
  • _get_phase_contexts() phase routing
  • _get_forbidden_phrases_block() voice rules injection
  • All placeholder substitution ({patient_context}, {phase_context}, {emotional_context}, {forbidden_phrases_block})
  • All runtime template variables in MATCHER_USER_TEMPLATE
  • Feature flag routing (V1 vs V2, COT vs legacy)

Already external (no change needed)

  • config/voice_rules.yaml — forbidden phrases
  • config/guardrails.yaml — message classifier categories
  • Flagsmith feature flags — prompt version routing

4. Directory Structure

config/prompts/
  ├─ base/                              # Base prompts (rules + schema, no examples)
  │   ├─ conversation_v1.yaml
  │   ├─ conversation_v2.yaml
  │   ├─ clinical_extraction.yaml
  │   ├─ icd_mapping.yaml
  │   ├─ icd_mapping_cot.yaml
  │   ├─ fhir_generation.yaml
  │   ├─ intake_v1.yaml
  │   ├─ intake_v2.yaml
  │   ├─ explanation.yaml
  │   ├─ clinical_analysis.yaml
  │   ├─ clinical_analysis_cot.yaml
  │   ├─ rerank.yaml
  │   ├─ intent_classifier.yaml
  │   ├─ chat_extractor_v1.yaml
  │   ├─ chat_extractor_v2.yaml
  │   └─ requirement_matcher.yaml
  ├─ examples/                          # Few-shot examples, per locale
  │   ├─ en/
  │   │   ├─ conversation.yaml          # 5 WRONG/RIGHT pairs
  │   │   ├─ clinical_extraction.yaml   # 2 rich medical report examples
  │   │   ├─ icd_mapping.yaml           # 1 ICD/SNOMED coding example
  │   │   ├─ icd_mapping_cot.yaml       # 1 COT reasoning example
  │   │   ├─ fhir_generation.yaml       # 1 FHIR resource example
  │   │   ├─ intake.yaml                # 1 intake exchange example
  │   │   ├─ explanation.yaml           # 1 match explanation example
  │   │   ├─ clinical_analysis.yaml     # 1 clinical analysis example
  │   │   ├─ clinical_analysis_cot.yaml # 1 COT analysis example
  │   │   ├─ rerank.yaml                # 1 rerank example
  │   │   ├─ intent_classifier.yaml     # 6 intent routing examples
  │   │   └─ document_review.yaml       # 2 doc review response examples
  │   │
  │   └─ ar/                            # Arabic (future — Phase 1 multilingual)
  │       ├─ conversation.yaml
  │       ├─ clinical_extraction.yaml
  │       └─ ...
  ├─ phase_contexts/                    # Phase-specific context blocks
  │   ├─ v1/
  │   │   ├─ identify_procedure.yaml
  │   │   ├─ records_first.yaml
  │   │   ├─ intake.yaml
  │   │   ├─ document_review.yaml
  │   │   └─ general.yaml
  │   └─ v2/
  │       ├─ identify_procedure.yaml
  │       ├─ records_first.yaml
  │       ├─ intake.yaml
  │       ├─ document_review.yaml
  │       └─ general.yaml
  └─ forbidden/                         # Already exists as voice_rules.yaml
      └─ (symlink or reference to config/voice_rules.yaml)

YAML format

# config/prompts/examples/en/conversation.yaml
---
description: "Conversation few-shot examples  English"
locale: en
prompt_key: conversation
version: 2

examples:
  - label: "Patient mentions pain and waiting"
    user: "My left knee hurts badly. I can't walk my dog anymore. I'm tired of waiting 18 months for this surgery."
    wrong: "It's completely natural to feel that way  exploring treatment options abroad is a big step. That's exactly why Curaway exists. Would you like to see how we evaluate providers?"
    right: "Eighteen months waiting while you can't even walk your dog  that's a long time to lose the small things that matter. Let's see what we can do to get you moving again. Has your doctor confirmed it's a **Total Knee Replacement** you need?"

  - label: "Patient scared of safety"
    user: "I'm scared. Is this safe to do abroad?"
    wrong: "Don't worry, you're in good hands! All our hospitals are excellent."
    right: "Fair question  going abroad for surgery is a real decision. Every hospital we work with is internationally accredited (JCI or equivalent), and you'll see their outcomes data, surgeon credentials, and patient reviews before making any choice. Would you like to start by seeing a few hospitals, or would you rather understand the process first?"

  # ... 3 more pairs
# config/prompts/examples/en/clinical_extraction.yaml
---
description: "Clinical extraction few-shot examples  English"
locale: en
prompt_key: clinical_extraction
version: 1

examples:
  - label: "Radiology report  knee osteoarthritis"
    input: |
      Report: X-ray Right Knee
      Findings: Moderate to severe osteoarthritis...
    output: |
      {
        "entities": [...],
        "observations": [...],
        ...
      }

  - label: "Lab report  CBC + metabolic panel"
    input: |
      Complete Blood Count:
      WBC: 7.2 x10^9/L ...
    output: |
      {
        "entities": [...],
        "observations": [...],
        ...
      }

Why YAML over JSON

  • Multiline strings are readable (| block scalar)
  • Comments allowed (useful for reviewer notes)
  • Same tooling as voice_rules.yaml and feature_flags.yaml
  • yaml.safe_load() is already used throughout the codebase

5. Prompt Assembly

5.1 The loader

# app/services/prompt_loader.py

def load_prompt(
    prompt_key: str,
    locale: str = "en",
    version: str | None = None,
) -> str:
    """Assemble a complete prompt from base + examples.

    Args:
        prompt_key: e.g., "conversation", "clinical_extraction"
        locale: ISO 639-1 code (e.g., "en", "ar")
        version: optional version suffix (e.g., "v1", "v2", "cot")

    Returns:
        Assembled prompt string ready for LLM.

    Raises:
        PromptLoadError: if base prompt is missing (hard failure).
        Never raises for missing locale — falls back to "en".
    """

5.2 Assembly order

1. Load base prompt YAML → base_text
2. Load examples YAML (requested locale, fallback to "en") → examples_text
3. Format examples into the prompt's expected shape:
   - WRONG/RIGHT pairs → "EXAMPLES OF HOW TO RESPOND:\n\n..."
   - Input/Output pairs → "Example Input 1:\n...\nExample Output 1:\n..."
   - Intent examples → "Examples:\n..."
4. Inject examples into base_text at the {examples} placeholder
5. Return assembled string

5.3 Placeholder contract

Each base prompt YAML has a placeholders field listing what the caller must substitute at runtime:

# config/prompts/base/conversation_v2.yaml
placeholders:
  - name: "{examples}"
    source: "prompt_loader (from examples/)"
    injected_by: "prompt_loader.load_prompt()"
  - name: "{forbidden_phrases_block}"
    source: "config/voice_rules.yaml"
    injected_by: "_get_forbidden_phrases_block()"
  - name: "{emotional_context}"
    source: "emotional_state.py"
    injected_by: "generate_response_streaming()"
  - name: "{phase_context}"
    source: "config/prompts/phase_contexts/"
    injected_by: "generate_response_streaming()"
  - name: "{patient_context}"
    source: "case_orchestrator.py"
    injected_by: "generate_response_streaming()"

This makes the contract explicit — CI can verify that every placeholder listed in the YAML is substituted before the prompt reaches the LLM.


6. Failsafe Design

6.1 Failure modes and responses

Failure Detection Response Patient impact
Base prompt YAML missing load_prompt() raises PromptLoadError Hard fail — blocks deployment. CI test catches this. None — never reaches production.
Base prompt YAML malformed yaml.safe_load() raises Hard fail — same as above. None.
Examples YAML missing for requested locale _load_examples() returns None Fall back to English examples. Log warning. None — model gets English examples, responds in patient's language anyway.
Examples YAML missing for English (fallback) _load_examples("en") returns None Hard fail — PromptLoadError. CI catches. None — never reaches production.
Examples YAML malformed yaml.safe_load() raises or schema validation fails Fall back to English. Log error. None — model gets valid English examples.
Examples YAML empty Loaded but examples: [] Fall back to English. Log warning. None.
Placeholder not substituted CI test scans assembled prompt for {...} patterns Hard fail in CI. At runtime, _validate_prompt() logs error but sends prompt anyway (raw placeholder is harmless noise). Minimal — model ignores unrecognized {placeholder} text.
YAML file read permission error OSError from Path.read_text() Fall back to English. Log error. None.
prompt_loader.py itself has a bug Unit tests cover all code paths. Integration test assembles every prompt in CI. Hard fail in CI. None — never reaches production.

6.2 The fallback chain

Requested locale (e.g., "ar")
  ├─ YAML exists + valid → use it
  ├─ YAML missing or malformed
  │   │
  │   └─ Fall back to "en"
  │       │
  │       ├─ "en" YAML exists + valid → use it + log warning
  │       │
  │       └─ "en" YAML missing → PromptLoadError (blocks CI)
  └─ All examples loading fails (shouldn't happen)
      └─ Send base prompt WITHOUT examples + log error
          (model still works — examples improve quality but aren't required)

6.3 Runtime validation

Before sending any assembled prompt to the LLM:

def _validate_prompt(prompt: str, prompt_key: str) -> str:
    """Check for unresolved placeholders and log if found."""
    unresolved = re.findall(r'\{[a-z_]+\}', prompt)
    # Known safe: {value} patterns inside JSON schema examples
    safe = {'{value}', '{}'}
    problems = [p for p in unresolved if p not in safe]
    if problems:
        logger.error(
            "Unresolved placeholders in %s: %s — sending anyway",
            prompt_key, problems,
        )
    return prompt

This is a log-and-continue strategy, not a hard fail — because an unresolved placeholder is less harmful than blocking the patient's conversation.


7. Prompt Caching Integration

7.1 How Anthropic prompt caching works

  • Client sends the full prompt with cache_control markers
  • Anthropic hashes the prefix up to the marker
  • If the hash matches a cached version, cached tokens cost 10%
  • Cache TTL: 5 minutes (resets on each hit)
  • Cache is scoped to the Anthropic workspace (platform-wide)

7.2 Where to place cache markers

The assembled prompt has this structure:

[base prompt] + [examples] + [forbidden phrases] + [phase context]
────────────────────── CACHEABLE PREFIX ──────────────────────
[patient context] + [conversation history] + [current message]
────────────────────── VARIABLE TAIL ──────────────────────────

Place cache_control: {"type": "ephemeral"} on the last message in the cacheable prefix. In practice, this is the system message:

messages = [
    {
        "role": "system",
        "content": assembled_prompt,  # base + examples + forbidden + phase
        "cache_control": {"type": "ephemeral"},
    },
    # ... conversation history (not cached)
    {"role": "user", "content": current_message},
]

7.3 Cache hit rate projection

Scenario Cache warm? Why
Same patient, same case, next message Yes Same system prompt prefix, <5 min between messages
Different patient, same locale + phase Yes Identical system prompt prefix (platform-scoped cache)
Same patient, phase transition Partial Phase context changes, but base + examples are still cached at a higher level
First call of the day Miss Cold start, full price
After code deploy Miss Base prompt changed, new hash

At even modest traffic (10+ active conversations), the cache is warm continuously for English. Each additional locale adds its own cache entry — Arabic conversations share a cache, English conversations share a cache.

7.4 Cost projection

Metric Current (no caching) With abstraction + caching
Conversation input tokens/month (100 cases) ~930K ~580K effective (37% reduction)
Extraction input tokens/month ~356K ~178K effective (50% reduction)
Total monthly input tokens ~1.58M ~962K
Monthly LLM cost (Haiku-heavy mix) ~$14 ~$9

8. Scoping: Platform, Case, or Patient?

8.1 What is scoped to what

Component Scope Changes when
Base prompt text Platform Code deploy
Few-shot examples Platform × locale Clinical review cycle
Forbidden phrases Platform Voice rules update
Phase context Platform × version Feature flag change
Patient context Case Each message (grows)
Conversation history Case Each message (grows)
Locale selection Case (from preferred_locale) Set once at intake
Emotional context Message Computed per message

8.2 Locale detection

The patient's locale is detected at the case level:

  1. First message arrives
  2. Claude identifies the language as part of its normal response
  3. chat_extractor extracts language from the response
  4. case.preferred_locale is set (e.g., "ar")
  5. All subsequent LLM calls for that case load Arabic examples

If locale detection fails, default to English. The patient can override by saying "please respond in Arabic" — the extractor picks this up and updates the locale.

8.3 Mid-conversation locale switch

If a patient switches languages mid-conversation (e.g., starts in English, switches to Arabic):

  1. Extractor detects the new language
  2. case.preferred_locale is updated
  3. Next LLM call loads Arabic examples
  4. Prompt cache misses once (new prefix), then warms the Arabic cache
  5. No patient-visible disruption — the model adapts naturally

9. Edge Cases

9.1 Mixed-language documents

A patient uploads a report with Arabic headers and English lab values.

  • OCR: PyMuPDF / Unstructured.io extracts both scripts correctly
  • Extraction prompt: English few-shot examples still work — Claude handles multilingual input with English examples. The examples demonstrate the output schema, not the input language.
  • ICD mapping: ICD-10 codes are language-neutral. Claude maps Arabic condition names to the same codes.
  • Future improvement: Arabic extraction examples would improve accuracy for Arabic-specific medical terminology.

9.2 Locale with no examples yet

A patient speaks Hindi. No Hindi examples exist.

  • Loader falls back to English examples
  • Model still responds in Hindi (Claude is multilingual)
  • Quality is slightly lower for Hindi-specific medical terms
  • Logged as a warning for prioritization

9.3 Very long examples (extraction prompt)

The clinical extraction examples are 4,200 chars (59% of the prompt). These are detailed input/output pairs with FHIR-like JSON structures.

  • These examples are critical for output quality — without them, the model's JSON structure drifts
  • They are also the biggest caching win — stable across all calls, cached at 10% cost
  • They should NOT be compressed further — the richness is intentional

9.4 COT examples reference reasoning steps

The ICD mapping COT and clinical analysis COT examples include structured reasoning_steps arrays. These are designed for:

  1. LLM output quality (model follows the reasoning pattern)
  2. LangSmith eval datasets (steps stored in events table)
  3. MedGemma fine-tuning (steps become training data)

If Arabic COT examples are needed, they should follow the same 7-step structure with Arabic medical terminology. The step labels (BODY_SYSTEM, CONDITION_CATEGORY, etc.) stay in English — they're machine-readable keys, not patient-facing.

9.5 Concurrent YAML file update during request

If a YAML file is updated on disk while a request is in flight:

  • The loader reads the file at call time (no caching of YAML content in memory by default)
  • The request gets the new content — this is fine
  • If the file is partially written (rare), yaml.safe_load() may fail → falls back to English
  • For production safety, add an in-memory cache with 60s TTL (matches the Flagsmith cache pattern)

9.6 Prompt version rollback

If the V2 prompt regresses and we need to roll back to V1:

  • Flagsmith flag prompt_version"v1_original"
  • V1 base prompt loads from config/prompts/base/conversation_v1.yaml
  • V1 examples load from the same examples/en/conversation.yaml (examples are version-agnostic — WRONG/RIGHT pairs work for both)
  • V1 phase contexts load from config/prompts/phase_contexts/v1/
  • Zero code changes needed

9.7 New prompt added in future

When a developer adds a new LLM prompt (e.g., for case porting):

  1. Add base prompt YAML to config/prompts/base/
  2. Add English examples to config/prompts/examples/en/
  3. Use load_prompt() in the Python code
  4. CI test auto-discovers all YAML files and validates them

The CI test uses glob("config/prompts/base/*.yaml") to find all prompts and verifies each has a corresponding English example file. No manual registration needed.


10. Migration Strategy

10.1 Zero-downtime migration

The migration replaces hardcoded string constants with load_prompt() calls. The assembled output is byte-identical.

Verification: A CI test (test_prompt_migration_parity.py) assembles each prompt via the new loader and asserts character-for- character equality with the old hardcoded constant. This test is temporary — removed after the migration PR merges.

10.2 Migration order

  1. Lowest risk first: Prompts called infrequently (rerank, explanation, intent classifier)
  2. Highest impact second: Extraction + ICD mapping (biggest example overhead, most caching benefit)
  3. Conversation last: Most complex (5 placeholders, 5 phase contexts, forbidden phrases injection, emotional context)

10.3 Rollback

Every migrated prompt retains the old constant (commented out) for one release cycle. If the loader fails in production, a one-line revert restores the hardcoded constant.


11. What This Enables

Capability How When
Arabic support Drop ar/ YAML files in examples/ Phase 1 multilingual
Clinical review workflow Export examples to xlsx, review, re-import Already established pattern
A/B testing examples Feature flag selects example version Post-abstraction
Prompt caching Stable prefix → Anthropic cache Immediate after migration
Fine-tuning dataset YAML examples → training pairs export Post-Series A
Prompt analytics Track which examples are loaded per call via Langfuse Post-abstraction
New language in < 1 day Translate YAML files, no Python changes After Arabic proves the pattern

12. Success Criteria

  1. Zero behavioral change — assembled prompts are byte-identical to hardcoded constants (verified by parity test)
  2. All CI tests pass — no regressions in voice compliance, medical advice scanner, or existing prompt tests
  3. Prompt caching active — Langfuse traces show cache_read tokens on repeated calls within the same conversation
  4. 39% input token reduction — measured via Anthropic usage report after 1 week in production
  5. Arabic examples loadableload_prompt("conversation", "ar") returns a valid assembled prompt (even before Arabic content exists, falls back to English)

13. References

  • Companion feature spec: prompt-abstraction-feature.md
  • Prompt audit (Session 35): 19 prompts across 8 files, 23 example blocks, ~9,900 chars of inline examples
  • Anthropic prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  • Voice rules (existing YAML pattern): config/voice_rules.yaml
  • Feature flags (existing YAML pattern): config/feature_flags.yaml
  • Medical advice review workflow: curaway-medical-advice-review.xlsx (Session 35 — xlsx export → Dr. Naidu review → approved strings)