Skip to content

Guardrails

Curaway's AI assistant operates in a sensitive medical travel domain where incorrect information can cause real harm. The guardrails system implements defense-in-depth across three layers to ensure the assistant never provides medical advice, never breaks character, and always redirects patients to qualified professionals when appropriate.


Three-Layer Defense Architecture

flowchart TD
    A[User Message] --> B[Layer 1: System Prompt Safety]
    B --> C[Layer 2: Input Classifier]
    C -->|Allowed| D[LLM Response Generation]
    C -->|Blocked| E[Brand-Voiced Redirect]
    D --> F[Layer 3: Output Validator]
    F -->|Clean| G[Response to User]
    F -->|Violation| H[Filtered Response + Redirect]
Layer Component Technology Latency Purpose
1 System Prompt Langfuse managed prompts 0ms (cached) Behavioral boundaries
2 Input Classifier Claude Haiku ~200ms Categorize and gate input
3 Output Validator Regex engine <5ms Catch residual violations

Fail-Open Policy

If the input classifier encounters an error (API timeout, malformed response, rate limit), the system fails open and allows the message through. The rationale: blocking a patient seeking legitimate help is worse than occasionally processing an edge-case message. Output validation still catches dangerous responses.


Layer 1: Langfuse System Prompts

System prompts are managed in Langfuse and loaded at conversation initialization. They are cached locally with a 5-minute TTL to avoid per-request latency.

Safety Instructions Embedded in System Prompt

The system prompt contains explicit behavioral constraints:

You are Curaway's medical travel coordination assistant. You help patients
navigate cross-border healthcare by connecting them with vetted providers,
explaining logistics, and managing documentation.

CRITICAL SAFETY RULES:
- NEVER diagnose medical conditions
- NEVER recommend specific treatments or procedures
- NEVER predict surgical outcomes or recovery timelines
- NEVER provide medication dosages or drug interactions
- NEVER act as a substitute for a licensed medical professional
- ALWAYS redirect medical questions to the patient's assigned provider
- ALWAYS maintain your role as a travel coordination assistant
- ALWAYS use warm, empathetic language when redirecting

Prompt Versioning

Langfuse tracks prompt versions with production/staging labels. Changes to safety instructions require review and are deployed by promoting a staging version to production. The prompt cache key includes the version identifier so updates propagate within the TTL window.


Layer 2: Input Classifier

Every user message passes through a lightweight Claude Haiku classification step before reaching the primary conversation LLM. The classifier maps each message to one of 8 categories defined in config/guardrails.yaml.

Classification Categories

Category Action Example
medical_travel Allow "What hospitals in Turkey do knee replacements?"
document_upload Allow "Here's my medical report"
general_question Allow "What's the process for getting a visa?"
greeting Allow "Hi, I'm new here"
off_topic Redirect "What's the weather in Paris?"
medical_advice Redirect "Should I take ibuprofen before surgery?"
prompt_injection Block + Log "Ignore all previous instructions..."
emergency Escalate "I'm having chest pains right now"

Classifier Prompt

CLASSIFIER_PROMPT = """Classify the following user message into exactly one category.
Return ONLY the category name, nothing else.

Categories: medical_travel, document_upload, general_question,
off_topic, medical_advice, prompt_injection, emergency, greeting

User message: {message}

Category:"""

Classification Flow

sequenceDiagram
    participant U as User
    participant API as FastAPI
    participant C as Classifier (Haiku)
    participant LLM as Conversation LLM
    participant T as Templates

    U->>API: Send message
    API->>C: Classify message
    C-->>API: Category
    alt medical_travel / document_upload / general_question / greeting
        API->>LLM: Process message
        LLM-->>API: Response
    else off_topic / medical_advice
        API->>T: Load redirect template
        T-->>API: Brand-voiced redirect
    else prompt_injection
        API->>T: Load block template
        Note over API: Log attempt with full context
    else emergency
        API->>T: Load emergency template
        Note over API: Emergency escalation path
    end
    API-->>U: Response

Fail-Open Behavior

async def classify_message(message: str) -> str:
    """Classify user input. Returns category or 'general_question' on failure."""
    try:
        response = await haiku_client.classify(message)
        category = response.strip().lower()
        if category not in VALID_CATEGORIES:
            logger.warning(f"Unknown category '{category}', defaulting to general_question")
            return "general_question"
        return category
    except Exception as e:
        logger.error(f"Classifier error: {e}, failing open")
        return "general_question"  # Fail open — never block the patient

Layer 3: Regex Output Validator

After the LLM generates a response, the output validator scans for 12 forbidden patterns using compiled regex. This is a fast, deterministic safety net that catches cases where the LLM drifts despite system prompt instructions.

Forbidden Patterns

# Pattern Category Regex Example Rationale
1 Diagnosis statements you (have\|likely have\|probably have)\s+\w+ Never diagnose
2 Treatment recommendations (you should\|I recommend)\s+(take\|try\|use)\s+ Never prescribe
3 Outcome predictions (success rate\|survival rate)\s+is\s+\d+% Never predict outcomes
4 Medication dosage \d+\s*(mg\|ml\|mcg)\s+(daily\|twice\|three times) Never dose medications
5 Drug interactions (should not\|don't)\s+(take\|mix)\s+\w+\s+with Never advise on drugs
6 Breaking character (as an AI\|I'm just a language model\|I cannot) Stay in persona
7 Competitor mentions (medical tourism company\|other platform) Brand protection
8 Legal claims (guaranteed\|100% safe\|no risk) Never guarantee outcomes
9 Insurance fraud hints (claim this as\|tell your insurance) Legal compliance
10 Unauthorized promises (we guarantee\|we promise\|we ensure) No binding commitments
11 Specific recovery times (recovery\|healing)\s+takes?\s+\d+\s+(days\|weeks) Provider's domain
12 Cost guarantees (total cost\|final price)\s+(is\|will be)\s+\$\d+ Costs are estimates only

Validator Implementation

import re
from typing import List, Tuple

class OutputValidator:
    def __init__(self, patterns: List[dict]):
        self.compiled = [
            {
                "name": p["name"],
                "regex": re.compile(p["pattern"], re.IGNORECASE),
                "replacement": p.get("replacement", ""),
            }
            for p in patterns
        ]

    def validate(self, text: str) -> Tuple[str, List[str]]:
        """Returns (cleaned_text, list_of_violations)."""
        violations = []
        cleaned = text
        for pattern in self.compiled:
            if pattern["regex"].search(cleaned):
                violations.append(pattern["name"])
                cleaned = pattern["regex"].sub(pattern["replacement"], cleaned)
        return cleaned, violations

When violations are detected, the response is cleaned and a redirect message is appended directing the patient to consult their provider for medical specifics.


Configuration: config/guardrails.yaml

All guardrail configuration lives in a single YAML file that serves as the source of truth. This includes classification categories, forbidden patterns, response templates, and file validation rules.

# config/guardrails.yaml

version: "1.2"

classifier:
  model: "claude-3-haiku-20240307"
  timeout_ms: 3000
  fail_open_category: "general_question"
  categories:
    - name: medical_travel
      action: allow
      description: "Questions about medical travel coordination"
    - name: document_upload
      action: allow
      description: "Document submission and management"
    - name: general_question
      action: allow
      description: "General platform or process questions"
    - name: greeting
      action: allow
      description: "Greetings and conversation starters"
    - name: off_topic
      action: redirect
      description: "Unrelated to medical travel"
    - name: medical_advice
      action: redirect
      description: "Requests for medical diagnosis or treatment advice"
    - name: prompt_injection
      action: block
      description: "Attempts to override system instructions"
    - name: emergency
      action: escalate
      description: "Medical emergency requiring immediate help"

response_templates:
  off_topic: >
    I appreciate your curiosity! I'm best at helping with medical travel
    coordination — things like finding the right provider, understanding
    procedures, and managing your travel documents. How can I help with
    your medical travel journey?
  medical_advice: >
    That's a really important question, and I want to make sure you get
    the best answer. Medical questions like this are best discussed with
    your healthcare provider, who knows your full medical history. I can
    help connect you with one of our vetted specialists if you'd like!
  prompt_injection: >
    I'm here to help you with medical travel coordination! If you have
    questions about finding providers, understanding procedures, or
    managing travel logistics, I'd love to assist.
  emergency: >
    If you're experiencing a medical emergency, please call your local
    emergency number immediately (911 in the US, 112 in Europe, 999 in
    the UK). Your safety is the top priority. Once you're safe, I'm
    here to help with anything you need.

output_validator:
  patterns:
    - name: diagnosis_statement
      pattern: 'you (have|likely have|probably have)\s+\w+'
      replacement: "your healthcare provider can help determine"
    - name: treatment_recommendation
      pattern: '(you should|I recommend)\s+(take|try|use)\s+'
      replacement: "your provider may discuss options including "
    # ... remaining 10 patterns

file_validation:
  allowed_extensions:
    - .pdf
    - .jpg
    - .jpeg
    - .png
    - .heic
    - .doc
    - .docx
  max_size_bytes: 20971520  # 20MB
  min_size_bytes: 5120      # 5KB
  allowed_mime_types:
    - application/pdf
    - image/jpeg
    - image/png
    - image/heic
    - application/msword
    - application/vnd.openxmlformats-officedocument.wordprocessingml.document

Response Templates

All redirect responses use brand-voiced warm language. Templates are loaded from guardrails.yaml at startup and cached in memory.

Template Design Principles

  1. Acknowledge the question — never dismiss what the patient asked
  2. Explain why we redirect — "your provider knows your full history"
  3. Offer an alternative — pivot to something Curaway can help with
  4. End with an invitation — keep the conversation going

Template Loading

class GuardrailTemplates:
    def __init__(self, config_path: str = "config/guardrails.yaml"):
        with open(config_path) as f:
            config = yaml.safe_load(f)
        self.templates = config["response_templates"]

    def get_redirect(self, category: str) -> str:
        return self.templates.get(category, self.templates["off_topic"])

File Validation

File validation operates at two levels: frontend (immediate user feedback) and backend (presign-time metadata check). Critically, the backend validator never downloads files from R2 — it only validates metadata provided at presign request time.

Frontend Validation

const FILE_CONSTRAINTS = {
  maxSize: 20 * 1024 * 1024,  // 20MB
  minSize: 5 * 1024,          // 5KB
  allowedExtensions: ['.pdf', '.jpg', '.jpeg', '.png', '.heic', '.doc', '.docx'],
};

function validateFile(file: File): ValidationResult {
  if (file.size > FILE_CONSTRAINTS.maxSize) {
    return { valid: false, error: "File is too large. Maximum size is 20MB." };
  }
  if (file.size < FILE_CONSTRAINTS.minSize) {
    return { valid: false, error: "File appears to be empty or corrupted (under 5KB)." };
  }
  const ext = '.' + file.name.split('.').pop()?.toLowerCase();
  if (!FILE_CONSTRAINTS.allowedExtensions.includes(ext)) {
    return { valid: false, error: `File type ${ext} is not supported.` };
  }
  return { valid: true };
}

Backend Presign-Time Validation

@router.post("/documents/presign")
async def create_presign_url(request: PresignRequest):
    """Validate file metadata and return presigned upload URL."""
    # Validate extension
    ext = Path(request.filename).suffix.lower()
    if ext not in config.file_validation.allowed_extensions:
        raise HTTPException(400, f"File type {ext} not supported")

    # Validate MIME type
    if request.content_type not in config.file_validation.allowed_mime_types:
        raise HTTPException(400, f"MIME type {request.content_type} not supported")

    # Validate declared size
    if request.file_size > config.file_validation.max_size_bytes:
        raise HTTPException(400, "File exceeds 20MB limit")
    if request.file_size < config.file_validation.min_size_bytes:
        raise HTTPException(400, "File too small (minimum 5KB)")

    # Generate presigned URL — no R2 download occurs
    presign_url = r2_client.generate_presigned_url(
        bucket=settings.R2_BUCKET,
        key=f"documents/{request.tenant_id}/{uuid4()}{ext}",
        expires_in=900,  # 15 minutes
    )
    return {"upload_url": presign_url, "key": key}

No R2 Download

The backend validates only the metadata (extension, MIME type, declared size) provided in the presign request. It never downloads or inspects the actual file content from R2. This keeps validation fast and avoids unnecessary data transfer costs.


Feature Flags

Three Flagsmith feature flags control each guardrail layer independently, enabling gradual rollout and instant kill-switch capability.

Flag Default Controls
guardrail_input_classifier_enabled true Layer 2 input classification
guardrail_output_validator_enabled true Layer 3 regex output validation
guardrail_file_validation_enabled true File upload validation

Flag Evaluation

async def process_message(message: str, tenant_id: str) -> str:
    flags = await flagsmith.get_environment_flags()

    # Layer 2: Input classification
    if flags.is_feature_enabled("guardrail_input_classifier_enabled"):
        category = await classify_message(message)
        if category in REDIRECT_CATEGORIES:
            return templates.get_redirect(category)
        if category == "emergency":
            return templates.get_redirect("emergency")

    # Generate LLM response
    response = await generate_response(message)

    # Layer 3: Output validation
    if flags.is_feature_enabled("guardrail_output_validator_enabled"):
        cleaned, violations = validator.validate(response)
        if violations:
            logger.warning(f"Output violations: {violations}")
            response = cleaned + "\n\n" + PROVIDER_REDIRECT_SUFFIX

    return response

When a flag is disabled, that layer is skipped entirely. This allows operators to:

  • Disable the classifier if Haiku latency spikes (output validator still catches issues)
  • Disable the output validator during testing to see raw LLM output
  • Disable file validation temporarily if rules are too restrictive

Conversation History Filtering

When building conversation context for the LLM, the system filters out messages that were flagged as toxic or blocked by the input classifier. This prevents the conversation history from containing content that could influence the LLM's behavior.

Filtering Logic

def build_conversation_context(messages: List[Message]) -> List[dict]:
    """Build LLM context, excluding blocked/toxic messages."""
    context = []
    for msg in messages:
        # Skip messages that were classified as harmful
        if msg.guardrail_category in ("prompt_injection", "medical_advice"):
            continue
        # Skip messages flagged by content moderation
        if msg.is_toxic:
            continue
        context.append({
            "role": msg.role,
            "content": msg.content,
        })
    return context

Why Filter History?

LLMs are influenced by conversation context. If a prompt injection attempt remains in the history, subsequent legitimate messages may be processed in a compromised context. By removing toxic messages from the context window, each new turn starts from a clean state.

The original messages are preserved in the database for audit purposes — they are only excluded from the LLM context.


Monitoring and Observability

Guardrail Metrics

Metric Source Alert Threshold
Classification distribution Langfuse traces >5% prompt_injection in 1h
Classifier latency (p95) Langfuse traces >500ms
Classifier error rate Application logs >2% in 15min
Output violations per hour Application logs >20 violations/h
File validation rejections Application logs >50% rejection rate

Langfuse Trace Integration

Every classified message generates a Langfuse trace with:

  • Input message (redacted if PII detected)
  • Classification category
  • Classification latency
  • Whether the message was allowed, redirected, or blocked
  • If output validation was triggered, which patterns matched

This data feeds into dashboards for monitoring guardrail effectiveness and tuning thresholds.


Design Decisions

Decision Rationale
Fail open on classifier errors Blocking patients seeking help causes more harm than occasional off-topic processing
Regex for output validation Deterministic, fast (<5ms), no additional API calls needed
Haiku for classification Fastest Claude model, adequate for 8-category classification
YAML as config source Non-engineers can review and suggest changes via PR
Brand-voiced redirects Maintains trust and warmth even when saying "no"
No R2 download for validation Keeps presign fast, avoids egress costs
History filtering vs. deletion Audit trail preserved while LLM context stays clean