Guardrails¶
Curaway's AI assistant operates in a sensitive medical travel domain where incorrect information can cause real harm. The guardrails system implements defense-in-depth across three layers to ensure the assistant never provides medical advice, never breaks character, and always redirects patients to qualified professionals when appropriate.
Three-Layer Defense Architecture¶
flowchart TD
A[User Message] --> B[Layer 1: System Prompt Safety]
B --> C[Layer 2: Input Classifier]
C -->|Allowed| D[LLM Response Generation]
C -->|Blocked| E[Brand-Voiced Redirect]
D --> F[Layer 3: Output Validator]
F -->|Clean| G[Response to User]
F -->|Violation| H[Filtered Response + Redirect]
| Layer | Component | Technology | Latency | Purpose |
|---|---|---|---|---|
| 1 | System Prompt | Langfuse managed prompts | 0ms (cached) | Behavioral boundaries |
| 2 | Input Classifier | Claude Haiku | ~200ms | Categorize and gate input |
| 3 | Output Validator | Regex engine | <5ms | Catch residual violations |
Fail-Open Policy
If the input classifier encounters an error (API timeout, malformed response, rate limit), the system fails open and allows the message through. The rationale: blocking a patient seeking legitimate help is worse than occasionally processing an edge-case message. Output validation still catches dangerous responses.
Layer 1: Langfuse System Prompts¶
System prompts are managed in Langfuse and loaded at conversation initialization. They are cached locally with a 5-minute TTL to avoid per-request latency.
Safety Instructions Embedded in System Prompt¶
The system prompt contains explicit behavioral constraints:
You are Curaway's medical travel coordination assistant. You help patients
navigate cross-border healthcare by connecting them with vetted providers,
explaining logistics, and managing documentation.
CRITICAL SAFETY RULES:
- NEVER diagnose medical conditions
- NEVER recommend specific treatments or procedures
- NEVER predict surgical outcomes or recovery timelines
- NEVER provide medication dosages or drug interactions
- NEVER act as a substitute for a licensed medical professional
- ALWAYS redirect medical questions to the patient's assigned provider
- ALWAYS maintain your role as a travel coordination assistant
- ALWAYS use warm, empathetic language when redirecting
Prompt Versioning¶
Langfuse tracks prompt versions with production/staging labels. Changes to safety instructions require review and are deployed by promoting a staging version to production. The prompt cache key includes the version identifier so updates propagate within the TTL window.
Layer 2: Input Classifier¶
Every user message passes through a lightweight Claude Haiku classification step before
reaching the primary conversation LLM. The classifier maps each message to one of 8
categories defined in config/guardrails.yaml.
Classification Categories¶
| Category | Action | Example |
|---|---|---|
medical_travel |
Allow | "What hospitals in Turkey do knee replacements?" |
document_upload |
Allow | "Here's my medical report" |
general_question |
Allow | "What's the process for getting a visa?" |
greeting |
Allow | "Hi, I'm new here" |
off_topic |
Redirect | "What's the weather in Paris?" |
medical_advice |
Redirect | "Should I take ibuprofen before surgery?" |
prompt_injection |
Block + Log | "Ignore all previous instructions..." |
emergency |
Escalate | "I'm having chest pains right now" |
Classifier Prompt¶
CLASSIFIER_PROMPT = """Classify the following user message into exactly one category.
Return ONLY the category name, nothing else.
Categories: medical_travel, document_upload, general_question,
off_topic, medical_advice, prompt_injection, emergency, greeting
User message: {message}
Category:"""
Classification Flow¶
sequenceDiagram
participant U as User
participant API as FastAPI
participant C as Classifier (Haiku)
participant LLM as Conversation LLM
participant T as Templates
U->>API: Send message
API->>C: Classify message
C-->>API: Category
alt medical_travel / document_upload / general_question / greeting
API->>LLM: Process message
LLM-->>API: Response
else off_topic / medical_advice
API->>T: Load redirect template
T-->>API: Brand-voiced redirect
else prompt_injection
API->>T: Load block template
Note over API: Log attempt with full context
else emergency
API->>T: Load emergency template
Note over API: Emergency escalation path
end
API-->>U: Response
Fail-Open Behavior¶
async def classify_message(message: str) -> str:
"""Classify user input. Returns category or 'general_question' on failure."""
try:
response = await haiku_client.classify(message)
category = response.strip().lower()
if category not in VALID_CATEGORIES:
logger.warning(f"Unknown category '{category}', defaulting to general_question")
return "general_question"
return category
except Exception as e:
logger.error(f"Classifier error: {e}, failing open")
return "general_question" # Fail open — never block the patient
Layer 3: Regex Output Validator¶
After the LLM generates a response, the output validator scans for 12 forbidden patterns using compiled regex. This is a fast, deterministic safety net that catches cases where the LLM drifts despite system prompt instructions.
Forbidden Patterns¶
| # | Pattern Category | Regex Example | Rationale |
|---|---|---|---|
| 1 | Diagnosis statements | you (have\|likely have\|probably have)\s+\w+ |
Never diagnose |
| 2 | Treatment recommendations | (you should\|I recommend)\s+(take\|try\|use)\s+ |
Never prescribe |
| 3 | Outcome predictions | (success rate\|survival rate)\s+is\s+\d+% |
Never predict outcomes |
| 4 | Medication dosage | \d+\s*(mg\|ml\|mcg)\s+(daily\|twice\|three times) |
Never dose medications |
| 5 | Drug interactions | (should not\|don't)\s+(take\|mix)\s+\w+\s+with |
Never advise on drugs |
| 6 | Breaking character | (as an AI\|I'm just a language model\|I cannot) |
Stay in persona |
| 7 | Competitor mentions | (medical tourism company\|other platform) |
Brand protection |
| 8 | Legal claims | (guaranteed\|100% safe\|no risk) |
Never guarantee outcomes |
| 9 | Insurance fraud hints | (claim this as\|tell your insurance) |
Legal compliance |
| 10 | Unauthorized promises | (we guarantee\|we promise\|we ensure) |
No binding commitments |
| 11 | Specific recovery times | (recovery\|healing)\s+takes?\s+\d+\s+(days\|weeks) |
Provider's domain |
| 12 | Cost guarantees | (total cost\|final price)\s+(is\|will be)\s+\$\d+ |
Costs are estimates only |
Validator Implementation¶
import re
from typing import List, Tuple
class OutputValidator:
def __init__(self, patterns: List[dict]):
self.compiled = [
{
"name": p["name"],
"regex": re.compile(p["pattern"], re.IGNORECASE),
"replacement": p.get("replacement", ""),
}
for p in patterns
]
def validate(self, text: str) -> Tuple[str, List[str]]:
"""Returns (cleaned_text, list_of_violations)."""
violations = []
cleaned = text
for pattern in self.compiled:
if pattern["regex"].search(cleaned):
violations.append(pattern["name"])
cleaned = pattern["regex"].sub(pattern["replacement"], cleaned)
return cleaned, violations
When violations are detected, the response is cleaned and a redirect message is appended directing the patient to consult their provider for medical specifics.
Configuration: config/guardrails.yaml¶
All guardrail configuration lives in a single YAML file that serves as the source of truth. This includes classification categories, forbidden patterns, response templates, and file validation rules.
# config/guardrails.yaml
version: "1.2"
classifier:
model: "claude-3-haiku-20240307"
timeout_ms: 3000
fail_open_category: "general_question"
categories:
- name: medical_travel
action: allow
description: "Questions about medical travel coordination"
- name: document_upload
action: allow
description: "Document submission and management"
- name: general_question
action: allow
description: "General platform or process questions"
- name: greeting
action: allow
description: "Greetings and conversation starters"
- name: off_topic
action: redirect
description: "Unrelated to medical travel"
- name: medical_advice
action: redirect
description: "Requests for medical diagnosis or treatment advice"
- name: prompt_injection
action: block
description: "Attempts to override system instructions"
- name: emergency
action: escalate
description: "Medical emergency requiring immediate help"
response_templates:
off_topic: >
I appreciate your curiosity! I'm best at helping with medical travel
coordination — things like finding the right provider, understanding
procedures, and managing your travel documents. How can I help with
your medical travel journey?
medical_advice: >
That's a really important question, and I want to make sure you get
the best answer. Medical questions like this are best discussed with
your healthcare provider, who knows your full medical history. I can
help connect you with one of our vetted specialists if you'd like!
prompt_injection: >
I'm here to help you with medical travel coordination! If you have
questions about finding providers, understanding procedures, or
managing travel logistics, I'd love to assist.
emergency: >
If you're experiencing a medical emergency, please call your local
emergency number immediately (911 in the US, 112 in Europe, 999 in
the UK). Your safety is the top priority. Once you're safe, I'm
here to help with anything you need.
output_validator:
patterns:
- name: diagnosis_statement
pattern: 'you (have|likely have|probably have)\s+\w+'
replacement: "your healthcare provider can help determine"
- name: treatment_recommendation
pattern: '(you should|I recommend)\s+(take|try|use)\s+'
replacement: "your provider may discuss options including "
# ... remaining 10 patterns
file_validation:
allowed_extensions:
- .pdf
- .jpg
- .jpeg
- .png
- .heic
- .doc
- .docx
max_size_bytes: 20971520 # 20MB
min_size_bytes: 5120 # 5KB
allowed_mime_types:
- application/pdf
- image/jpeg
- image/png
- image/heic
- application/msword
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
Response Templates¶
All redirect responses use brand-voiced warm language. Templates are loaded from
guardrails.yaml at startup and cached in memory.
Template Design Principles¶
- Acknowledge the question — never dismiss what the patient asked
- Explain why we redirect — "your provider knows your full history"
- Offer an alternative — pivot to something Curaway can help with
- End with an invitation — keep the conversation going
Template Loading¶
class GuardrailTemplates:
def __init__(self, config_path: str = "config/guardrails.yaml"):
with open(config_path) as f:
config = yaml.safe_load(f)
self.templates = config["response_templates"]
def get_redirect(self, category: str) -> str:
return self.templates.get(category, self.templates["off_topic"])
File Validation¶
File validation operates at two levels: frontend (immediate user feedback) and backend (presign-time metadata check). Critically, the backend validator never downloads files from R2 — it only validates metadata provided at presign request time.
Frontend Validation¶
const FILE_CONSTRAINTS = {
maxSize: 20 * 1024 * 1024, // 20MB
minSize: 5 * 1024, // 5KB
allowedExtensions: ['.pdf', '.jpg', '.jpeg', '.png', '.heic', '.doc', '.docx'],
};
function validateFile(file: File): ValidationResult {
if (file.size > FILE_CONSTRAINTS.maxSize) {
return { valid: false, error: "File is too large. Maximum size is 20MB." };
}
if (file.size < FILE_CONSTRAINTS.minSize) {
return { valid: false, error: "File appears to be empty or corrupted (under 5KB)." };
}
const ext = '.' + file.name.split('.').pop()?.toLowerCase();
if (!FILE_CONSTRAINTS.allowedExtensions.includes(ext)) {
return { valid: false, error: `File type ${ext} is not supported.` };
}
return { valid: true };
}
Backend Presign-Time Validation¶
@router.post("/documents/presign")
async def create_presign_url(request: PresignRequest):
"""Validate file metadata and return presigned upload URL."""
# Validate extension
ext = Path(request.filename).suffix.lower()
if ext not in config.file_validation.allowed_extensions:
raise HTTPException(400, f"File type {ext} not supported")
# Validate MIME type
if request.content_type not in config.file_validation.allowed_mime_types:
raise HTTPException(400, f"MIME type {request.content_type} not supported")
# Validate declared size
if request.file_size > config.file_validation.max_size_bytes:
raise HTTPException(400, "File exceeds 20MB limit")
if request.file_size < config.file_validation.min_size_bytes:
raise HTTPException(400, "File too small (minimum 5KB)")
# Generate presigned URL — no R2 download occurs
presign_url = r2_client.generate_presigned_url(
bucket=settings.R2_BUCKET,
key=f"documents/{request.tenant_id}/{uuid4()}{ext}",
expires_in=900, # 15 minutes
)
return {"upload_url": presign_url, "key": key}
No R2 Download
The backend validates only the metadata (extension, MIME type, declared size) provided in the presign request. It never downloads or inspects the actual file content from R2. This keeps validation fast and avoids unnecessary data transfer costs.
Feature Flags¶
Three Flagsmith feature flags control each guardrail layer independently, enabling gradual rollout and instant kill-switch capability.
| Flag | Default | Controls |
|---|---|---|
guardrail_input_classifier_enabled |
true |
Layer 2 input classification |
guardrail_output_validator_enabled |
true |
Layer 3 regex output validation |
guardrail_file_validation_enabled |
true |
File upload validation |
Flag Evaluation¶
async def process_message(message: str, tenant_id: str) -> str:
flags = await flagsmith.get_environment_flags()
# Layer 2: Input classification
if flags.is_feature_enabled("guardrail_input_classifier_enabled"):
category = await classify_message(message)
if category in REDIRECT_CATEGORIES:
return templates.get_redirect(category)
if category == "emergency":
return templates.get_redirect("emergency")
# Generate LLM response
response = await generate_response(message)
# Layer 3: Output validation
if flags.is_feature_enabled("guardrail_output_validator_enabled"):
cleaned, violations = validator.validate(response)
if violations:
logger.warning(f"Output violations: {violations}")
response = cleaned + "\n\n" + PROVIDER_REDIRECT_SUFFIX
return response
When a flag is disabled, that layer is skipped entirely. This allows operators to:
- Disable the classifier if Haiku latency spikes (output validator still catches issues)
- Disable the output validator during testing to see raw LLM output
- Disable file validation temporarily if rules are too restrictive
Conversation History Filtering¶
When building conversation context for the LLM, the system filters out messages that were flagged as toxic or blocked by the input classifier. This prevents the conversation history from containing content that could influence the LLM's behavior.
Filtering Logic¶
def build_conversation_context(messages: List[Message]) -> List[dict]:
"""Build LLM context, excluding blocked/toxic messages."""
context = []
for msg in messages:
# Skip messages that were classified as harmful
if msg.guardrail_category in ("prompt_injection", "medical_advice"):
continue
# Skip messages flagged by content moderation
if msg.is_toxic:
continue
context.append({
"role": msg.role,
"content": msg.content,
})
return context
Why Filter History?¶
LLMs are influenced by conversation context. If a prompt injection attempt remains in the history, subsequent legitimate messages may be processed in a compromised context. By removing toxic messages from the context window, each new turn starts from a clean state.
The original messages are preserved in the database for audit purposes — they are only excluded from the LLM context.
Monitoring and Observability¶
Guardrail Metrics¶
| Metric | Source | Alert Threshold |
|---|---|---|
| Classification distribution | Langfuse traces | >5% prompt_injection in 1h |
| Classifier latency (p95) | Langfuse traces | >500ms |
| Classifier error rate | Application logs | >2% in 15min |
| Output violations per hour | Application logs | >20 violations/h |
| File validation rejections | Application logs | >50% rejection rate |
Langfuse Trace Integration¶
Every classified message generates a Langfuse trace with:
- Input message (redacted if PII detected)
- Classification category
- Classification latency
- Whether the message was allowed, redirected, or blocked
- If output validation was triggered, which patterns matched
This data feeds into dashboards for monitoring guardrail effectiveness and tuning thresholds.
Design Decisions¶
| Decision | Rationale |
|---|---|
| Fail open on classifier errors | Blocking patients seeking help causes more harm than occasional off-topic processing |
| Regex for output validation | Deterministic, fast (<5ms), no additional API calls needed |
| Haiku for classification | Fastest Claude model, adequate for 8-category classification |
| YAML as config source | Non-engineers can review and suggest changes via PR |
| Brand-voiced redirects | Maintains trust and warmth even when saying "no" |
| No R2 download for validation | Keeps presign fast, avoids egress costs |
| History filtering vs. deletion | Audit trail preserved while LLM context stays clean |