Prompt Abstraction Layer — Feature Spec¶
Date: 2026-04-10
Status: Implemented — PR #92 merged
Companion steer: ai-steer/prompt-abstraction-steer.md
1. Summary¶
Extract all 23 inline few-shot example blocks from 12 Python prompt
constants into YAML files under config/prompts/. Build a
prompt_loader.py service that assembles prompts from base + locale-
specific examples at runtime. Enable Anthropic prompt caching on the
stable prefix. Zero behavioral change — assembled output is byte-
identical to today's hardcoded strings.
2. File-by-File Change List¶
New files¶
| File | Description | LOC |
|---|---|---|
app/services/prompt_loader.py |
Prompt assembly service | ~120 |
config/prompts/base/conversation_v1.yaml |
V1 conversation base | ~80 |
config/prompts/base/conversation_v2.yaml |
V2 conversation base | ~130 |
config/prompts/base/clinical_extraction.yaml |
Extraction base | ~30 |
config/prompts/base/icd_mapping.yaml |
ICD mapping base | ~25 |
config/prompts/base/icd_mapping_cot.yaml |
ICD COT base | ~40 |
config/prompts/base/fhir_generation.yaml |
FHIR generation base | ~25 |
config/prompts/base/intake_v1.yaml |
Intake V1 base | ~25 |
config/prompts/base/intake_v2.yaml |
Intake V2 base | ~12 |
config/prompts/base/explanation.yaml |
Explanation base | ~15 |
config/prompts/base/clinical_analysis.yaml |
Analysis base | ~15 |
config/prompts/base/clinical_analysis_cot.yaml |
Analysis COT base | ~40 |
config/prompts/base/rerank.yaml |
Rerank base | ~12 |
config/prompts/base/intent_classifier.yaml |
Intent classifier base | ~15 |
config/prompts/base/chat_extractor_v1.yaml |
Chat extractor V1 | ~15 |
config/prompts/base/chat_extractor_v2.yaml |
Chat extractor V2 | ~20 |
config/prompts/base/requirement_matcher.yaml |
Requirement matcher base | ~10 |
config/prompts/examples/en/conversation.yaml |
5 WRONG/RIGHT pairs | ~60 |
config/prompts/examples/en/clinical_extraction.yaml |
2 rich examples | ~160 |
config/prompts/examples/en/icd_mapping.yaml |
1 coding example | ~30 |
config/prompts/examples/en/icd_mapping_cot.yaml |
1 COT example | ~50 |
config/prompts/examples/en/fhir_generation.yaml |
1 FHIR example | ~40 |
config/prompts/examples/en/intake.yaml |
1 intake example | ~15 |
config/prompts/examples/en/explanation.yaml |
1 explanation example | ~15 |
config/prompts/examples/en/clinical_analysis.yaml |
1 analysis example | ~20 |
config/prompts/examples/en/clinical_analysis_cot.yaml |
1 COT example | ~40 |
config/prompts/examples/en/rerank.yaml |
1 rerank example | ~12 |
config/prompts/examples/en/intent_classifier.yaml |
6 intent examples | ~15 |
config/prompts/examples/en/document_review.yaml |
2 response examples | ~20 |
config/prompts/phase_contexts/v1/*.yaml |
5 V1 phase contexts | ~5 files |
config/prompts/phase_contexts/v2/*.yaml |
5 V2 phase contexts | ~5 files |
tests/test_prompt_loader.py |
Loader unit tests | ~120 |
tests/test_prompt_migration_parity.py |
Parity test (temporary) | ~80 |
Modified files¶
| File | Change |
|---|---|
app/agents/llm_conversation.py |
Replace inline V1/V2 constants with load_prompt(). Update _get_system_prompt() to use loader. Update _get_phase_contexts() to load from YAML. Add cache_control to system message. Keep old constants commented out for one release. |
app/agents/prompts/clinical_extraction.py |
Replace 4 inline constants with load_prompt() calls. |
app/agents/prompts/intake.py |
Replace 2 inline constants with load_prompt() calls. |
app/agents/prompts/explanation.py |
Replace 1 inline constant with load_prompt() call. |
app/agents/prompts/match_analysis.py |
Replace 3 inline constants with load_prompt() calls. |
app/agents/orchestrator.py |
Replace CLASSIFY_INTENT_PROMPT with load_prompt() call. |
app/services/chat_extractor.py |
Replace 2 inline constants with load_prompt() calls. |
app/services/requirement_matcher.py |
Replace 1 inline constant with load_prompt() call. MATCHER_USER_TEMPLATE stays as Python (runtime template vars). |
3. prompt_loader.py — Public Surface¶
from app.services.prompt_loader import load_prompt, load_phase_context
# Load a complete prompt (base + locale examples)
prompt: str = load_prompt(
prompt_key="conversation", # matches base/conversation_v2.yaml
locale="ar", # ISO 639-1
version="v2", # optional, defaults to latest
)
# Load a phase context
context: str = load_phase_context(
phase="records_first", # phase name
version="v2", # v1 or v2
)
Internal functions¶
def _load_yaml(path: Path) -> dict:
"""Load + parse YAML. Raises PromptLoadError on missing/malformed."""
def _load_base(prompt_key: str, version: str | None) -> str:
"""Load base prompt text from config/prompts/base/."""
def _load_examples(prompt_key: str, locale: str) -> str | None:
"""Load examples for locale. Returns None if missing (caller falls back)."""
def _format_examples(prompt_key: str, examples: list[dict]) -> str:
"""Format raw example dicts into the prompt's expected text shape.
Different prompts use different formats:
- conversation: WRONG/RIGHT pairs
- extraction: Input/Output blocks
- intent: one-line examples
"""
def _validate_prompt(prompt: str, prompt_key: str) -> str:
"""Check for unresolved {placeholders}. Log errors, never block."""
@lru_cache(maxsize=64)
def _cached_load(path_str: str, mtime: float) -> dict:
"""In-memory YAML cache keyed on path + mtime.
Invalidates when file is modified (mtime changes).
maxsize=64 covers all prompts × locales with room to spare."""
Error type¶
class PromptLoadError(Exception):
"""Raised when a required prompt file is missing or malformed.
Only raised for base prompts and English examples (hard requirements).
Never raised for non-English locale examples (falls back to English)."""
Caching strategy (in-memory)¶
YAML files are small (< 10KB each) and rarely change. Cache in
memory using lru_cache keyed on (file_path, file_mtime).
When a file is modified, mtime changes and the cache entry is
evicted automatically. This avoids reading disk on every LLM call
while still picking up changes without a restart.
TTL is implicit — mtime changes force a reload. For production
stability, also add a 60-second floor (don't re-stat the file more
than once per minute):
_stat_cache: dict[str, tuple[float, float]] = {} # path → (mtime, last_checked)
def _get_mtime(path: Path) -> float:
now = time.monotonic()
cached = _stat_cache.get(str(path))
if cached and now - cached[1] < 60:
return cached[0]
mtime = path.stat().st_mtime
_stat_cache[str(path)] = (mtime, now)
return mtime
4. Migration: Prompt-by-Prompt¶
Group 1 — Low frequency, low risk (do first)¶
| Prompt | Calls/case | Risk |
|---|---|---|
RERANK_SYSTEM_PROMPT |
1 | Lowest — rarely fires |
EXPLANATION_SYSTEM_PROMPT |
5-10 | Low — independent per provider |
CLASSIFY_INTENT_PROMPT |
per message | Low — simple classification |
INTAKE_SYSTEM_PROMPT_V1/V2 |
per turn | Low — short, no examples in V2 |
Group 2 — High caching benefit (do second)¶
| Prompt | Calls/case | Caching benefit |
|---|---|---|
EXTRACTION_SYSTEM_PROMPT |
per doc | Highest — 59% examples, cached across all docs |
ICD_MAPPING_SYSTEM_PROMPT |
per doc | High — 30% examples |
ICD_MAPPING_SYSTEM_PROMPT_COT |
per doc | High — 29% examples |
FHIR_GENERATION_SYSTEM_PROMPT |
per doc | Medium — 22% examples |
Group 3 — Conversation (do last, most complex)¶
| Prompt | Complexity |
|---|---|
CURAWAY_SYSTEM_PROMPT_V1/V2 |
5 placeholders, forbidden phrases, emotional context |
PHASE_CONTEXTS_V1/V2 |
5 phases × 2 versions = 10 YAML files |
Phase: document_review (V1) |
Has inline examples |
EXTRACTOR_PROMPT_V1/V2 |
Inline rule-examples (not separable) |
CLINICAL_ANALYSIS_* |
COT variant has rich examples |
Migration per prompt¶
For each prompt:
- Create
config/prompts/base/{key}.yamlwith the rules/schema portion (everything except examples) - Create
config/prompts/examples/en/{key}.yamlwith the examples - Add
{examples}placeholder in the base YAML where examples were - Update the Python file to call
load_prompt(key, locale) - Run
test_prompt_migration_parity.pyto verify byte-equality - Comment out (don't delete) the old constant
5. Anthropic Prompt Caching Integration¶
Where to add cache_control¶
Only on the system message in the messages array:
# app/agents/llm_conversation.py — generate_response_streaming()
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": assembled_system_prompt,
"cache_control": {"type": "ephemeral"},
}
],
},
# ... conversation history messages (not cached)
{"role": "user", "content": current_message},
]
LangChain integration¶
LangChain's ChatAnthropic supports cache control via the
extra_body parameter or the newer anthropic_cache_control
message extension. Check the installed version and use the
appropriate API.
Langfuse tracking¶
After enabling caching, Langfuse traces will show:
- cache_creation_input_tokens — tokens written to cache (first call)
- cache_read_input_tokens — tokens read from cache (subsequent)
Monitor these in the Langfuse dashboard to verify caching is active.
6. Test Plan¶
Unit tests (tests/test_prompt_loader.py)¶
test_load_prompt_returns_string— basic happy pathtest_load_prompt_includes_examples— examples are in the outputtest_load_prompt_locale_fallback— missing locale falls back to Englishtest_load_prompt_en_fallback_required— missing English raisesPromptLoadErrortest_load_prompt_malformed_yaml— graceful fallback on bad YAMLtest_load_prompt_empty_examples— falls back to Englishtest_load_phase_context— loads correct phase for versiontest_load_phase_context_missing— raises on missing phasetest_format_examples_conversation— WRONG/RIGHT formattest_format_examples_extraction— Input/Output formattest_format_examples_intent— one-line formattest_validate_prompt_catches_placeholders— detects unresolved{...}test_yaml_cache_invalidation— mtime change forces reloadtest_concurrent_loads— thread-safe loading
Parity test (tests/test_prompt_migration_parity.py) — temporary¶
For each migrated prompt, assert:
def test_conversation_v2_parity():
"""Assembled prompt matches the original hardcoded constant."""
from app.services.prompt_loader import load_prompt
assembled = load_prompt("conversation", "en", "v2")
# Replace dynamic placeholders with known values
assembled = assembled.replace("{forbidden_phrases_block}", KNOWN_FORBIDDEN)
assembled = assembled.replace("{emotional_context}", "neutral")
assembled = assembled.replace("{phase_context}", "")
assembled = assembled.replace("{patient_context}", "")
assert assembled == ORIGINAL_CURAWAY_SYSTEM_PROMPT_V2
One test per prompt. Removed after the migration PR is stable in production for 1 week.
CI integration¶
# .github/workflows/ci.yml — add step
- name: Prompt validation
run: |
python -m pytest tests/test_prompt_loader.py -v
python -m pytest tests/test_prompt_migration_parity.py -v
Validation test (permanent)¶
def test_all_base_prompts_have_english_examples():
"""Every base prompt YAML has a corresponding English examples YAML."""
base_dir = Path("config/prompts/base")
en_dir = Path("config/prompts/examples/en")
for base_file in base_dir.glob("*.yaml"):
key = base_file.stem
# Some prompts legitimately have no examples (extractor_v2, matcher)
if key in PROMPTS_WITHOUT_EXAMPLES:
continue
assert (en_dir / f"{key}.yaml").exists(), \
f"Missing English examples for {key}"
def test_all_prompts_assemble_cleanly():
"""Every base+example combination produces a valid prompt."""
for base_file in Path("config/prompts/base").glob("*.yaml"):
key = base_file.stem
prompt = load_prompt(key, "en")
assert len(prompt) > 100, f"Prompt {key} suspiciously short"
# Check no YAML artifacts leaked into the prompt
assert "---\n" not in prompt
assert "description:" not in prompt
7. Implementation Checklist¶
- [ ] Create
config/prompts/directory structure - [ ] Write
app/services/prompt_loader.pywithload_prompt(),load_phase_context(), caching, validation - [ ] Write
tests/test_prompt_loader.py(14 unit tests) - [ ] Group 1 migration:
- [ ] Extract
RERANK_SYSTEM_PROMPT→ base + examples YAML - [ ] Extract
EXPLANATION_SYSTEM_PROMPT→ base + examples YAML - [ ] Extract
CLASSIFY_INTENT_PROMPT→ base + examples YAML - [ ] Extract
INTAKE_SYSTEM_PROMPT_V1/V2→ base + examples YAML - [ ] Update Python files to use
load_prompt() - [ ] Run parity tests
- [ ] Group 2 migration:
- [ ] Extract
EXTRACTION_SYSTEM_PROMPT→ base + examples YAML - [ ] Extract
ICD_MAPPING_SYSTEM_PROMPT→ base + examples YAML - [ ] Extract
ICD_MAPPING_SYSTEM_PROMPT_COT→ base + examples YAML - [ ] Extract
FHIR_GENERATION_SYSTEM_PROMPT→ base + examples YAML - [ ] Update Python files to use
load_prompt() - [ ] Run parity tests
- [ ] Group 3 migration:
- [ ] Extract
CURAWAY_SYSTEM_PROMPT_V1→ base + examples YAML - [ ] Extract
CURAWAY_SYSTEM_PROMPT_V2→ base + examples YAML - [ ] Extract 10 phase contexts →
phase_contexts/v1/*.yamlandv2/*.yaml - [ ] Extract
EXTRACTOR_PROMPT_V1/V2→ base YAML (no separable examples) - [ ] Extract
CLINICAL_ANALYSIS_*→ base + examples YAML - [ ] Update Python files to use
load_prompt()andload_phase_context() - [ ] Run parity tests
- [ ] A/B testing framework for examples with Flagsmith integration
- [ ] Write
tests/test_prompt_migration_parity.py(1 test per prompt) - [ ] Add prompt caching (
cache_control) togenerate_response_streaming() - [ ] Add prompt caching to clinical extraction calls
- [ ] Update CI workflow to run prompt validation tests
- [ ] Verify Langfuse traces show
cache_read_input_tokens - [ ] Run full patient flow end-to-end (sign up → intake → doc upload → matching)
- [ ] Commit, push, open PR
8. Out of Scope¶
- Arabic example YAML files (Phase 1 multilingual — separate PR)
- Prompt analytics dashboard (post-abstraction follow-up)
- Fine-tuning dataset export from YAML (post-Series A)
MATCHER_USER_TEMPLATEmigration (runtime template vars, stays in Python)- ConversationApp dark mode (separate track)
- Case record porting implementation (separate track)
9. Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| YAML parse error in production | Low | Medium — prompt fails | CI validates all YAML. Runtime falls back to English. |
| Parity test misses a difference | Low | High — behavioral regression | Run full e2e flow before merging. Keep old constants commented for rollback. |
| Prompt caching doesn't activate | Medium | Low — just costs more | Check Langfuse for cache_read tokens. Verify cache_control header format matches Anthropic docs. |
| In-memory cache grows unbounded | Very low | Low — 64 entries × <10KB each = <640KB | lru_cache(maxsize=64) caps it. |
| Developer adds prompt without YAML | Medium | Low — falls back | CI test test_all_base_prompts_have_english_examples catches it. |
| File system latency on YAML reads | Very low | Low — 60s stat cache floor | Reads cached in memory. Only re-stats once per minute. |
10. References¶
- Companion steer:
ai-steer/prompt-abstraction-steer.md - Prompt audit data (Session 35): 19 prompts, 23 example blocks, ~9,900 chars inline
- Anthropic prompt caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Existing YAML patterns:
config/voice_rules.yaml,config/feature_flags.yaml - Medical advice review workflow: xlsx export → clinical advisor review → approved strings