Prompt Abstraction Layer — Feature Spec¶

Date: 2026-04-10 Status: Implemented — PR #92 merged Companion steer: ai-steer/prompt-abstraction-steer.md

1. Summary¶

Extract all 23 inline few-shot example blocks from 12 Python prompt constants into YAML files under config/prompts/. Build a prompt_loader.py service that assembles prompts from base + locale- specific examples at runtime. Enable Anthropic prompt caching on the stable prefix. Zero behavioral change — assembled output is byte- identical to today's hardcoded strings.

2. File-by-File Change List¶

New files¶

File	Description	LOC
`app/services/prompt_loader.py`	Prompt assembly service	~120
`config/prompts/base/conversation_v1.yaml`	V1 conversation base	~80
`config/prompts/base/conversation_v2.yaml`	V2 conversation base	~130
`config/prompts/base/clinical_extraction.yaml`	Extraction base	~30
`config/prompts/base/icd_mapping.yaml`	ICD mapping base	~25
`config/prompts/base/icd_mapping_cot.yaml`	ICD COT base	~40
`config/prompts/base/fhir_generation.yaml`	FHIR generation base	~25
`config/prompts/base/intake_v1.yaml`	Intake V1 base	~25
`config/prompts/base/intake_v2.yaml`	Intake V2 base	~12
`config/prompts/base/explanation.yaml`	Explanation base	~15
`config/prompts/base/clinical_analysis.yaml`	Analysis base	~15
`config/prompts/base/clinical_analysis_cot.yaml`	Analysis COT base	~40
`config/prompts/base/rerank.yaml`	Rerank base	~12
`config/prompts/base/intent_classifier.yaml`	Intent classifier base	~15
`config/prompts/base/chat_extractor_v1.yaml`	Chat extractor V1	~15
`config/prompts/base/chat_extractor_v2.yaml`	Chat extractor V2	~20
`config/prompts/base/requirement_matcher.yaml`	Requirement matcher base	~10
`config/prompts/examples/en/conversation.yaml`	5 WRONG/RIGHT pairs	~60
`config/prompts/examples/en/clinical_extraction.yaml`	2 rich examples	~160
`config/prompts/examples/en/icd_mapping.yaml`	1 coding example	~30
`config/prompts/examples/en/icd_mapping_cot.yaml`	1 COT example	~50
`config/prompts/examples/en/fhir_generation.yaml`	1 FHIR example	~40
`config/prompts/examples/en/intake.yaml`	1 intake example	~15
`config/prompts/examples/en/explanation.yaml`	1 explanation example	~15
`config/prompts/examples/en/clinical_analysis.yaml`	1 analysis example	~20
`config/prompts/examples/en/clinical_analysis_cot.yaml`	1 COT example	~40
`config/prompts/examples/en/rerank.yaml`	1 rerank example	~12
`config/prompts/examples/en/intent_classifier.yaml`	6 intent examples	~15
`config/prompts/examples/en/document_review.yaml`	2 response examples	~20
`config/prompts/phase_contexts/v1/*.yaml`	5 V1 phase contexts	~5 files
`config/prompts/phase_contexts/v2/*.yaml`	5 V2 phase contexts	~5 files
`tests/test_prompt_loader.py`	Loader unit tests	~120
`tests/test_prompt_migration_parity.py`	Parity test (temporary)	~80

Modified files¶

File	Change
`app/agents/llm_conversation.py`	Replace inline V1/V2 constants with `load_prompt()`. Update `_get_system_prompt()` to use loader. Update `_get_phase_contexts()` to load from YAML. Add `cache_control` to system message. Keep old constants commented out for one release.
`app/agents/prompts/clinical_extraction.py`	Replace 4 inline constants with `load_prompt()` calls.
`app/agents/prompts/intake.py`	Replace 2 inline constants with `load_prompt()` calls.
`app/agents/prompts/explanation.py`	Replace 1 inline constant with `load_prompt()` call.
`app/agents/prompts/match_analysis.py`	Replace 3 inline constants with `load_prompt()` calls.
`app/agents/orchestrator.py`	Replace `CLASSIFY_INTENT_PROMPT` with `load_prompt()` call.
`app/services/chat_extractor.py`	Replace 2 inline constants with `load_prompt()` calls.
`app/services/requirement_matcher.py`	Replace 1 inline constant with `load_prompt()` call. `MATCHER_USER_TEMPLATE` stays as Python (runtime template vars).

3. `prompt_loader.py` — Public Surface¶

from app.services.prompt_loader import load_prompt, load_phase_context

# Load a complete prompt (base + locale examples)
prompt: str = load_prompt(
    prompt_key="conversation",       # matches base/conversation_v2.yaml
    locale="ar",                     # ISO 639-1
    version="v2",                    # optional, defaults to latest
)

# Load a phase context
context: str = load_phase_context(
    phase="records_first",           # phase name
    version="v2",                    # v1 or v2
)

Internal functions¶

def _load_yaml(path: Path) -> dict:
    """Load + parse YAML. Raises PromptLoadError on missing/malformed."""

def _load_base(prompt_key: str, version: str | None) -> str:
    """Load base prompt text from config/prompts/base/."""

def _load_examples(prompt_key: str, locale: str) -> str | None:
    """Load examples for locale. Returns None if missing (caller falls back)."""

def _format_examples(prompt_key: str, examples: list[dict]) -> str:
    """Format raw example dicts into the prompt's expected text shape.
    Different prompts use different formats:
    - conversation: WRONG/RIGHT pairs
    - extraction: Input/Output blocks
    - intent: one-line examples
    """

def _validate_prompt(prompt: str, prompt_key: str) -> str:
    """Check for unresolved {placeholders}. Log errors, never block."""

@lru_cache(maxsize=64)
def _cached_load(path_str: str, mtime: float) -> dict:
    """In-memory YAML cache keyed on path + mtime.
    Invalidates when file is modified (mtime changes).
    maxsize=64 covers all prompts × locales with room to spare."""

Error type¶

class PromptLoadError(Exception):
    """Raised when a required prompt file is missing or malformed.
    Only raised for base prompts and English examples (hard requirements).
    Never raised for non-English locale examples (falls back to English)."""

Caching strategy (in-memory)¶

YAML files are small (< 10KB each) and rarely change. Cache in memory using lru_cache keyed on (file_path, file_mtime). When a file is modified, mtime changes and the cache entry is evicted automatically. This avoids reading disk on every LLM call while still picking up changes without a restart.

TTL is implicit — mtime changes force a reload. For production stability, also add a 60-second floor (don't re-stat the file more than once per minute):

_stat_cache: dict[str, tuple[float, float]] = {}  # path → (mtime, last_checked)

def _get_mtime(path: Path) -> float:
    now = time.monotonic()
    cached = _stat_cache.get(str(path))
    if cached and now - cached[1] < 60:
        return cached[0]
    mtime = path.stat().st_mtime
    _stat_cache[str(path)] = (mtime, now)
    return mtime

4. Migration: Prompt-by-Prompt¶

Group 1 — Low frequency, low risk (do first)¶

Prompt	Calls/case	Risk
`RERANK_SYSTEM_PROMPT`	1	Lowest — rarely fires
`EXPLANATION_SYSTEM_PROMPT`	5-10	Low — independent per provider
`CLASSIFY_INTENT_PROMPT`	per message	Low — simple classification
`INTAKE_SYSTEM_PROMPT_V1/V2`	per turn	Low — short, no examples in V2

Group 2 — High caching benefit (do second)¶

Prompt	Calls/case	Caching benefit
`EXTRACTION_SYSTEM_PROMPT`	per doc	Highest — 59% examples, cached across all docs
`ICD_MAPPING_SYSTEM_PROMPT`	per doc	High — 30% examples
`ICD_MAPPING_SYSTEM_PROMPT_COT`	per doc	High — 29% examples
`FHIR_GENERATION_SYSTEM_PROMPT`	per doc	Medium — 22% examples

Group 3 — Conversation (do last, most complex)¶

Prompt	Complexity
`CURAWAY_SYSTEM_PROMPT_V1/V2`	5 placeholders, forbidden phrases, emotional context
`PHASE_CONTEXTS_V1/V2`	5 phases × 2 versions = 10 YAML files
Phase: `document_review` (V1)	Has inline examples
`EXTRACTOR_PROMPT_V1/V2`	Inline rule-examples (not separable)
`CLINICAL_ANALYSIS_*`	COT variant has rich examples

Migration per prompt¶

For each prompt:

Create config/prompts/base/{key}.yaml with the rules/schema portion (everything except examples)
Create config/prompts/examples/en/{key}.yaml with the examples
Add {examples} placeholder in the base YAML where examples were
Update the Python file to call load_prompt(key, locale)
Run test_prompt_migration_parity.py to verify byte-equality
Comment out (don't delete) the old constant

5. Anthropic Prompt Caching Integration¶

Where to add `cache_control`¶

Only on the system message in the messages array:

# app/agents/llm_conversation.py — generate_response_streaming()

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": assembled_system_prompt,
                "cache_control": {"type": "ephemeral"},
            }
        ],
    },
    # ... conversation history messages (not cached)
    {"role": "user", "content": current_message},
]

LangChain integration¶

LangChain's ChatAnthropic supports cache control via the extra_body parameter or the newer anthropic_cache_control message extension. Check the installed version and use the appropriate API.

Langfuse tracking¶

After enabling caching, Langfuse traces will show: - cache_creation_input_tokens — tokens written to cache (first call) - cache_read_input_tokens — tokens read from cache (subsequent)

Monitor these in the Langfuse dashboard to verify caching is active.

6. Test Plan¶

Unit tests (`tests/test_prompt_loader.py`)¶

test_load_prompt_returns_string — basic happy path
test_load_prompt_includes_examples — examples are in the output
test_load_prompt_locale_fallback — missing locale falls back to English
test_load_prompt_en_fallback_required — missing English raises PromptLoadError
test_load_prompt_malformed_yaml — graceful fallback on bad YAML
test_load_prompt_empty_examples — falls back to English
test_load_phase_context — loads correct phase for version
test_load_phase_context_missing — raises on missing phase
test_format_examples_conversation — WRONG/RIGHT format
test_format_examples_extraction — Input/Output format
test_format_examples_intent — one-line format
test_validate_prompt_catches_placeholders — detects unresolved {...}
test_yaml_cache_invalidation — mtime change forces reload
test_concurrent_loads — thread-safe loading

Parity test (`tests/test_prompt_migration_parity.py`) — temporary¶

For each migrated prompt, assert:

def test_conversation_v2_parity():
    """Assembled prompt matches the original hardcoded constant."""
    from app.services.prompt_loader import load_prompt
    assembled = load_prompt("conversation", "en", "v2")
    # Replace dynamic placeholders with known values
    assembled = assembled.replace("{forbidden_phrases_block}", KNOWN_FORBIDDEN)
    assembled = assembled.replace("{emotional_context}", "neutral")
    assembled = assembled.replace("{phase_context}", "")
    assembled = assembled.replace("{patient_context}", "")
    assert assembled == ORIGINAL_CURAWAY_SYSTEM_PROMPT_V2

One test per prompt. Removed after the migration PR is stable in production for 1 week.

CI integration¶

# .github/workflows/ci.yml — add step
- name: Prompt validation
  run: |
    python -m pytest tests/test_prompt_loader.py -v
    python -m pytest tests/test_prompt_migration_parity.py -v

Validation test (permanent)¶

def test_all_base_prompts_have_english_examples():
    """Every base prompt YAML has a corresponding English examples YAML."""
    base_dir = Path("config/prompts/base")
    en_dir = Path("config/prompts/examples/en")
    for base_file in base_dir.glob("*.yaml"):
        key = base_file.stem
        # Some prompts legitimately have no examples (extractor_v2, matcher)
        if key in PROMPTS_WITHOUT_EXAMPLES:
            continue
        assert (en_dir / f"{key}.yaml").exists(), \
            f"Missing English examples for {key}"

def test_all_prompts_assemble_cleanly():
    """Every base+example combination produces a valid prompt."""
    for base_file in Path("config/prompts/base").glob("*.yaml"):
        key = base_file.stem
        prompt = load_prompt(key, "en")
        assert len(prompt) > 100, f"Prompt {key} suspiciously short"
        # Check no YAML artifacts leaked into the prompt
        assert "---\n" not in prompt
        assert "description:" not in prompt

7. Implementation Checklist¶

[ ] Create config/prompts/ directory structure
[ ] Write app/services/prompt_loader.py with load_prompt(), load_phase_context(), caching, validation
[ ] Write tests/test_prompt_loader.py (14 unit tests)
[ ] Group 1 migration:
[ ] Extract RERANK_SYSTEM_PROMPT → base + examples YAML
[ ] Extract EXPLANATION_SYSTEM_PROMPT → base + examples YAML
[ ] Extract CLASSIFY_INTENT_PROMPT → base + examples YAML
[ ] Extract INTAKE_SYSTEM_PROMPT_V1/V2 → base + examples YAML
[ ] Update Python files to use load_prompt()
[ ] Run parity tests
[ ] Group 2 migration:
[ ] Extract EXTRACTION_SYSTEM_PROMPT → base + examples YAML
[ ] Extract ICD_MAPPING_SYSTEM_PROMPT → base + examples YAML
[ ] Extract ICD_MAPPING_SYSTEM_PROMPT_COT → base + examples YAML
[ ] Extract FHIR_GENERATION_SYSTEM_PROMPT → base + examples YAML
[ ] Update Python files to use load_prompt()
[ ] Run parity tests
[ ] Group 3 migration:
[ ] Extract CURAWAY_SYSTEM_PROMPT_V1 → base + examples YAML
[ ] Extract CURAWAY_SYSTEM_PROMPT_V2 → base + examples YAML
[ ] Extract 10 phase contexts → phase_contexts/v1/*.yaml and v2/*.yaml
[ ] Extract EXTRACTOR_PROMPT_V1/V2 → base YAML (no separable examples)
[ ] Extract CLINICAL_ANALYSIS_* → base + examples YAML
[ ] Update Python files to use load_prompt() and load_phase_context()
[ ] Run parity tests
[ ] A/B testing framework for examples with Flagsmith integration
[ ] Write tests/test_prompt_migration_parity.py (1 test per prompt)
[ ] Add prompt caching (cache_control) to generate_response_streaming()
[ ] Add prompt caching to clinical extraction calls
[ ] Update CI workflow to run prompt validation tests
[ ] Verify Langfuse traces show cache_read_input_tokens
[ ] Run full patient flow end-to-end (sign up → intake → doc upload → matching)
[ ] Commit, push, open PR

8. Out of Scope¶

Arabic example YAML files (Phase 1 multilingual — separate PR)
Prompt analytics dashboard (post-abstraction follow-up)
Fine-tuning dataset export from YAML (post-Series A)
MATCHER_USER_TEMPLATE migration (runtime template vars, stays in Python)
ConversationApp dark mode (separate track)
Case record porting implementation (separate track)

9. Risks¶

Risk	Likelihood	Impact	Mitigation
YAML parse error in production	Low	Medium — prompt fails	CI validates all YAML. Runtime falls back to English.
Parity test misses a difference	Low	High — behavioral regression	Run full e2e flow before merging. Keep old constants commented for rollback.
Prompt caching doesn't activate	Medium	Low — just costs more	Check Langfuse for `cache_read` tokens. Verify `cache_control` header format matches Anthropic docs.
In-memory cache grows unbounded	Very low	Low — 64 entries × <10KB each = <640KB	`lru_cache(maxsize=64)` caps it.
Developer adds prompt without YAML	Medium	Low — falls back	CI test `test_all_base_prompts_have_english_examples` catches it.
File system latency on YAML reads	Very low	Low — 60s stat cache floor	Reads cached in memory. Only re-stats once per minute.

10. References¶

Companion steer: ai-steer/prompt-abstraction-steer.md
Prompt audit data (Session 35): 19 prompts, 23 example blocks, ~9,900 chars inline
Anthropic prompt caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Existing YAML patterns: config/voice_rules.yaml, config/feature_flags.yaml
Medical advice review workflow: xlsx export → clinical advisor review → approved strings