Skip to content

Prompt Abstraction Layer — Feature Spec

Date: 2026-04-10 Status: Implemented — PR #92 merged Companion steer: ai-steer/prompt-abstraction-steer.md


1. Summary

Extract all 23 inline few-shot example blocks from 12 Python prompt constants into YAML files under config/prompts/. Build a prompt_loader.py service that assembles prompts from base + locale- specific examples at runtime. Enable Anthropic prompt caching on the stable prefix. Zero behavioral change — assembled output is byte- identical to today's hardcoded strings.


2. File-by-File Change List

New files

File Description LOC
app/services/prompt_loader.py Prompt assembly service ~120
config/prompts/base/conversation_v1.yaml V1 conversation base ~80
config/prompts/base/conversation_v2.yaml V2 conversation base ~130
config/prompts/base/clinical_extraction.yaml Extraction base ~30
config/prompts/base/icd_mapping.yaml ICD mapping base ~25
config/prompts/base/icd_mapping_cot.yaml ICD COT base ~40
config/prompts/base/fhir_generation.yaml FHIR generation base ~25
config/prompts/base/intake_v1.yaml Intake V1 base ~25
config/prompts/base/intake_v2.yaml Intake V2 base ~12
config/prompts/base/explanation.yaml Explanation base ~15
config/prompts/base/clinical_analysis.yaml Analysis base ~15
config/prompts/base/clinical_analysis_cot.yaml Analysis COT base ~40
config/prompts/base/rerank.yaml Rerank base ~12
config/prompts/base/intent_classifier.yaml Intent classifier base ~15
config/prompts/base/chat_extractor_v1.yaml Chat extractor V1 ~15
config/prompts/base/chat_extractor_v2.yaml Chat extractor V2 ~20
config/prompts/base/requirement_matcher.yaml Requirement matcher base ~10
config/prompts/examples/en/conversation.yaml 5 WRONG/RIGHT pairs ~60
config/prompts/examples/en/clinical_extraction.yaml 2 rich examples ~160
config/prompts/examples/en/icd_mapping.yaml 1 coding example ~30
config/prompts/examples/en/icd_mapping_cot.yaml 1 COT example ~50
config/prompts/examples/en/fhir_generation.yaml 1 FHIR example ~40
config/prompts/examples/en/intake.yaml 1 intake example ~15
config/prompts/examples/en/explanation.yaml 1 explanation example ~15
config/prompts/examples/en/clinical_analysis.yaml 1 analysis example ~20
config/prompts/examples/en/clinical_analysis_cot.yaml 1 COT example ~40
config/prompts/examples/en/rerank.yaml 1 rerank example ~12
config/prompts/examples/en/intent_classifier.yaml 6 intent examples ~15
config/prompts/examples/en/document_review.yaml 2 response examples ~20
config/prompts/phase_contexts/v1/*.yaml 5 V1 phase contexts ~5 files
config/prompts/phase_contexts/v2/*.yaml 5 V2 phase contexts ~5 files
tests/test_prompt_loader.py Loader unit tests ~120
tests/test_prompt_migration_parity.py Parity test (temporary) ~80

Modified files

File Change
app/agents/llm_conversation.py Replace inline V1/V2 constants with load_prompt(). Update _get_system_prompt() to use loader. Update _get_phase_contexts() to load from YAML. Add cache_control to system message. Keep old constants commented out for one release.
app/agents/prompts/clinical_extraction.py Replace 4 inline constants with load_prompt() calls.
app/agents/prompts/intake.py Replace 2 inline constants with load_prompt() calls.
app/agents/prompts/explanation.py Replace 1 inline constant with load_prompt() call.
app/agents/prompts/match_analysis.py Replace 3 inline constants with load_prompt() calls.
app/agents/orchestrator.py Replace CLASSIFY_INTENT_PROMPT with load_prompt() call.
app/services/chat_extractor.py Replace 2 inline constants with load_prompt() calls.
app/services/requirement_matcher.py Replace 1 inline constant with load_prompt() call. MATCHER_USER_TEMPLATE stays as Python (runtime template vars).

3. prompt_loader.py — Public Surface

from app.services.prompt_loader import load_prompt, load_phase_context

# Load a complete prompt (base + locale examples)
prompt: str = load_prompt(
    prompt_key="conversation",       # matches base/conversation_v2.yaml
    locale="ar",                     # ISO 639-1
    version="v2",                    # optional, defaults to latest
)

# Load a phase context
context: str = load_phase_context(
    phase="records_first",           # phase name
    version="v2",                    # v1 or v2
)

Internal functions

def _load_yaml(path: Path) -> dict:
    """Load + parse YAML. Raises PromptLoadError on missing/malformed."""

def _load_base(prompt_key: str, version: str | None) -> str:
    """Load base prompt text from config/prompts/base/."""

def _load_examples(prompt_key: str, locale: str) -> str | None:
    """Load examples for locale. Returns None if missing (caller falls back)."""

def _format_examples(prompt_key: str, examples: list[dict]) -> str:
    """Format raw example dicts into the prompt's expected text shape.
    Different prompts use different formats:
    - conversation: WRONG/RIGHT pairs
    - extraction: Input/Output blocks
    - intent: one-line examples
    """

def _validate_prompt(prompt: str, prompt_key: str) -> str:
    """Check for unresolved {placeholders}. Log errors, never block."""

@lru_cache(maxsize=64)
def _cached_load(path_str: str, mtime: float) -> dict:
    """In-memory YAML cache keyed on path + mtime.
    Invalidates when file is modified (mtime changes).
    maxsize=64 covers all prompts × locales with room to spare."""

Error type

class PromptLoadError(Exception):
    """Raised when a required prompt file is missing or malformed.
    Only raised for base prompts and English examples (hard requirements).
    Never raised for non-English locale examples (falls back to English)."""

Caching strategy (in-memory)

YAML files are small (< 10KB each) and rarely change. Cache in memory using lru_cache keyed on (file_path, file_mtime). When a file is modified, mtime changes and the cache entry is evicted automatically. This avoids reading disk on every LLM call while still picking up changes without a restart.

TTL is implicit — mtime changes force a reload. For production stability, also add a 60-second floor (don't re-stat the file more than once per minute):

_stat_cache: dict[str, tuple[float, float]] = {}  # path → (mtime, last_checked)

def _get_mtime(path: Path) -> float:
    now = time.monotonic()
    cached = _stat_cache.get(str(path))
    if cached and now - cached[1] < 60:
        return cached[0]
    mtime = path.stat().st_mtime
    _stat_cache[str(path)] = (mtime, now)
    return mtime

4. Migration: Prompt-by-Prompt

Group 1 — Low frequency, low risk (do first)

Prompt Calls/case Risk
RERANK_SYSTEM_PROMPT 1 Lowest — rarely fires
EXPLANATION_SYSTEM_PROMPT 5-10 Low — independent per provider
CLASSIFY_INTENT_PROMPT per message Low — simple classification
INTAKE_SYSTEM_PROMPT_V1/V2 per turn Low — short, no examples in V2

Group 2 — High caching benefit (do second)

Prompt Calls/case Caching benefit
EXTRACTION_SYSTEM_PROMPT per doc Highest — 59% examples, cached across all docs
ICD_MAPPING_SYSTEM_PROMPT per doc High — 30% examples
ICD_MAPPING_SYSTEM_PROMPT_COT per doc High — 29% examples
FHIR_GENERATION_SYSTEM_PROMPT per doc Medium — 22% examples

Group 3 — Conversation (do last, most complex)

Prompt Complexity
CURAWAY_SYSTEM_PROMPT_V1/V2 5 placeholders, forbidden phrases, emotional context
PHASE_CONTEXTS_V1/V2 5 phases × 2 versions = 10 YAML files
Phase: document_review (V1) Has inline examples
EXTRACTOR_PROMPT_V1/V2 Inline rule-examples (not separable)
CLINICAL_ANALYSIS_* COT variant has rich examples

Migration per prompt

For each prompt:

  1. Create config/prompts/base/{key}.yaml with the rules/schema portion (everything except examples)
  2. Create config/prompts/examples/en/{key}.yaml with the examples
  3. Add {examples} placeholder in the base YAML where examples were
  4. Update the Python file to call load_prompt(key, locale)
  5. Run test_prompt_migration_parity.py to verify byte-equality
  6. Comment out (don't delete) the old constant

5. Anthropic Prompt Caching Integration

Where to add cache_control

Only on the system message in the messages array:

# app/agents/llm_conversation.py — generate_response_streaming()

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": assembled_system_prompt,
                "cache_control": {"type": "ephemeral"},
            }
        ],
    },
    # ... conversation history messages (not cached)
    {"role": "user", "content": current_message},
]

LangChain integration

LangChain's ChatAnthropic supports cache control via the extra_body parameter or the newer anthropic_cache_control message extension. Check the installed version and use the appropriate API.

Langfuse tracking

After enabling caching, Langfuse traces will show: - cache_creation_input_tokens — tokens written to cache (first call) - cache_read_input_tokens — tokens read from cache (subsequent)

Monitor these in the Langfuse dashboard to verify caching is active.


6. Test Plan

Unit tests (tests/test_prompt_loader.py)

  1. test_load_prompt_returns_string — basic happy path
  2. test_load_prompt_includes_examples — examples are in the output
  3. test_load_prompt_locale_fallback — missing locale falls back to English
  4. test_load_prompt_en_fallback_required — missing English raises PromptLoadError
  5. test_load_prompt_malformed_yaml — graceful fallback on bad YAML
  6. test_load_prompt_empty_examples — falls back to English
  7. test_load_phase_context — loads correct phase for version
  8. test_load_phase_context_missing — raises on missing phase
  9. test_format_examples_conversation — WRONG/RIGHT format
  10. test_format_examples_extraction — Input/Output format
  11. test_format_examples_intent — one-line format
  12. test_validate_prompt_catches_placeholders — detects unresolved {...}
  13. test_yaml_cache_invalidation — mtime change forces reload
  14. test_concurrent_loads — thread-safe loading

Parity test (tests/test_prompt_migration_parity.py) — temporary

For each migrated prompt, assert:

def test_conversation_v2_parity():
    """Assembled prompt matches the original hardcoded constant."""
    from app.services.prompt_loader import load_prompt
    assembled = load_prompt("conversation", "en", "v2")
    # Replace dynamic placeholders with known values
    assembled = assembled.replace("{forbidden_phrases_block}", KNOWN_FORBIDDEN)
    assembled = assembled.replace("{emotional_context}", "neutral")
    assembled = assembled.replace("{phase_context}", "")
    assembled = assembled.replace("{patient_context}", "")
    assert assembled == ORIGINAL_CURAWAY_SYSTEM_PROMPT_V2

One test per prompt. Removed after the migration PR is stable in production for 1 week.

CI integration

# .github/workflows/ci.yml — add step
- name: Prompt validation
  run: |
    python -m pytest tests/test_prompt_loader.py -v
    python -m pytest tests/test_prompt_migration_parity.py -v

Validation test (permanent)

def test_all_base_prompts_have_english_examples():
    """Every base prompt YAML has a corresponding English examples YAML."""
    base_dir = Path("config/prompts/base")
    en_dir = Path("config/prompts/examples/en")
    for base_file in base_dir.glob("*.yaml"):
        key = base_file.stem
        # Some prompts legitimately have no examples (extractor_v2, matcher)
        if key in PROMPTS_WITHOUT_EXAMPLES:
            continue
        assert (en_dir / f"{key}.yaml").exists(), \
            f"Missing English examples for {key}"

def test_all_prompts_assemble_cleanly():
    """Every base+example combination produces a valid prompt."""
    for base_file in Path("config/prompts/base").glob("*.yaml"):
        key = base_file.stem
        prompt = load_prompt(key, "en")
        assert len(prompt) > 100, f"Prompt {key} suspiciously short"
        # Check no YAML artifacts leaked into the prompt
        assert "---\n" not in prompt
        assert "description:" not in prompt

7. Implementation Checklist

  • [ ] Create config/prompts/ directory structure
  • [ ] Write app/services/prompt_loader.py with load_prompt(), load_phase_context(), caching, validation
  • [ ] Write tests/test_prompt_loader.py (14 unit tests)
  • [ ] Group 1 migration:
  • [ ] Extract RERANK_SYSTEM_PROMPT → base + examples YAML
  • [ ] Extract EXPLANATION_SYSTEM_PROMPT → base + examples YAML
  • [ ] Extract CLASSIFY_INTENT_PROMPT → base + examples YAML
  • [ ] Extract INTAKE_SYSTEM_PROMPT_V1/V2 → base + examples YAML
  • [ ] Update Python files to use load_prompt()
  • [ ] Run parity tests
  • [ ] Group 2 migration:
  • [ ] Extract EXTRACTION_SYSTEM_PROMPT → base + examples YAML
  • [ ] Extract ICD_MAPPING_SYSTEM_PROMPT → base + examples YAML
  • [ ] Extract ICD_MAPPING_SYSTEM_PROMPT_COT → base + examples YAML
  • [ ] Extract FHIR_GENERATION_SYSTEM_PROMPT → base + examples YAML
  • [ ] Update Python files to use load_prompt()
  • [ ] Run parity tests
  • [ ] Group 3 migration:
  • [ ] Extract CURAWAY_SYSTEM_PROMPT_V1 → base + examples YAML
  • [ ] Extract CURAWAY_SYSTEM_PROMPT_V2 → base + examples YAML
  • [ ] Extract 10 phase contexts → phase_contexts/v1/*.yaml and v2/*.yaml
  • [ ] Extract EXTRACTOR_PROMPT_V1/V2 → base YAML (no separable examples)
  • [ ] Extract CLINICAL_ANALYSIS_* → base + examples YAML
  • [ ] Update Python files to use load_prompt() and load_phase_context()
  • [ ] Run parity tests
  • [ ] A/B testing framework for examples with Flagsmith integration
  • [ ] Write tests/test_prompt_migration_parity.py (1 test per prompt)
  • [ ] Add prompt caching (cache_control) to generate_response_streaming()
  • [ ] Add prompt caching to clinical extraction calls
  • [ ] Update CI workflow to run prompt validation tests
  • [ ] Verify Langfuse traces show cache_read_input_tokens
  • [ ] Run full patient flow end-to-end (sign up → intake → doc upload → matching)
  • [ ] Commit, push, open PR

8. Out of Scope

  • Arabic example YAML files (Phase 1 multilingual — separate PR)
  • Prompt analytics dashboard (post-abstraction follow-up)
  • Fine-tuning dataset export from YAML (post-Series A)
  • MATCHER_USER_TEMPLATE migration (runtime template vars, stays in Python)
  • ConversationApp dark mode (separate track)
  • Case record porting implementation (separate track)

9. Risks

Risk Likelihood Impact Mitigation
YAML parse error in production Low Medium — prompt fails CI validates all YAML. Runtime falls back to English.
Parity test misses a difference Low High — behavioral regression Run full e2e flow before merging. Keep old constants commented for rollback.
Prompt caching doesn't activate Medium Low — just costs more Check Langfuse for cache_read tokens. Verify cache_control header format matches Anthropic docs.
In-memory cache grows unbounded Very low Low — 64 entries × <10KB each = <640KB lru_cache(maxsize=64) caps it.
Developer adds prompt without YAML Medium Low — falls back CI test test_all_base_prompts_have_english_examples catches it.
File system latency on YAML reads Very low Low — 60s stat cache floor Reads cached in memory. Only re-stats once per minute.

10. References

  • Companion steer: ai-steer/prompt-abstraction-steer.md
  • Prompt audit data (Session 35): 19 prompts, 23 example blocks, ~9,900 chars inline
  • Anthropic prompt caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  • Existing YAML patterns: config/voice_rules.yaml, config/feature_flags.yaml
  • Medical advice review workflow: xlsx export → clinical advisor review → approved strings