Skip to content

Data Ingestion

Curaway's data platform is populated through a series of seed scripts that load providers, procedures, doctors, and graph/vector data in a specific order. This document covers every seed script, what it creates, and the correct execution sequence.


Seed Execution Order

The seed scripts must run in a specific order due to foreign key dependencies between tables and cross-system references (PostgreSQL, Neo4j, Qdrant).

flowchart TD
    A[1. seed<br/>Tenant + Consent + Legal] --> B[2. seed_providers<br/>42 Providers]
    B --> C[3. seed_demo<br/>Procedures + Links + Demo Patient]
    C --> D[4. seed_graph<br/>Neo4j Graph Population]
    D --> E[5. seed_embeddings<br/>Qdrant Vectors]
    E --> F[6. seed_doctors<br/>8 Doctors + Neo4j Sync]

Quick Start

# Full seed from scratch (run in order)
python -m app.seed                # Tenant, consent purposes, legal agreements
python -m app.seed_providers      # 42 providers across 8 countries
python -m app.seed_demo           # Procedures, links, demo patient
python -m app.seed_graph          # Neo4j nodes and relationships
python -m app.seed_embeddings     # Qdrant vector embeddings
python -m app.seed_doctors        # 8 doctors with Neo4j sync

Order Matters

Running seed_graph before seed_providers will create an empty graph. Running seed_embeddings before seed_demo will miss procedure embeddings. Always follow the sequence above.


1. Tenant Seeding: python -m app.seed

The foundation seed creates the tenant, consent purposes, and legal agreements that everything else depends on.

What It Creates

Entity Count Details
Tenant 1 tenant-apollo-001
Consent purposes 6 4 required + 2 optional
Legal agreements 2 ToS v1, Privacy Policy v1
Purpose Required Description
data_processing Yes Core data processing for service delivery
medical_data_sharing Yes Sharing medical records with matched providers
cross_border_transfer Yes Transferring data across international borders
communication Yes Essential service communications
marketing No Marketing emails and promotional content
analytics No Anonymous usage analytics

Idempotency

The seed script checks for existing records before inserting. Running it multiple times is safe — it skips records that already exist.

async def seed_tenant():
    existing = await db.fetch_one(
        "SELECT id FROM tenants WHERE tenant_id = :tid",
        {"tid": "tenant-apollo-001"},
    )
    if existing:
        logger.info("Tenant already exists, skipping")
        return
    # Create tenant...

2. Provider Seeding: python -m app.seed_providers

Seeds 42 medical providers across 8 countries, representing the initial curated network.

Provider Distribution by Country

Country Count Example Providers
India 7 Apollo Hospitals Chennai, Fortis Memorial Gurgaon, Max Super Speciality Delhi
Turkey 6 Acibadem Altunizade, Memorial Bahcelievler, Medicana International Istanbul
Thailand 5 Bumrungrad International, Bangkok Hospital, Samitivej Sukhumvit
Mexico 5 Hospital Galenia Cancun, Christus Muguerza Monterrey, Star Medica Merida
South Korea 5 Samsung Medical Center, Severance Hospital, Asan Medical Center
UAE 5 Cleveland Clinic Abu Dhabi, Mediclinic City Hospital, American Hospital Dubai
Spain 5 Quironsalud Barcelona, Hospital Universitario HM Madrid, Teknon Barcelona
Costa Rica 4 CIMA Hospital San Jose, Clinica Biblica, Hospital Metropolitano

Provider Data Model

Each provider record includes:

provider = {
    "name": "Bumrungrad International Hospital",
    "country": "Thailand",
    "city": "Bangkok",
    "tier": "premium",           # premium, standard, value
    "accreditations": ["JCI", "TEMOS"],
    "specialties": ["orthopedics", "cardiology", "oncology"],
    "bed_count": 580,
    "international_patient_center": True,
    "languages_spoken": ["en", "th", "ar", "ja"],
    "year_established": 1980,
    "website": "https://www.bumrungrad.com",
    "contact_email": "intl@bumrungrad.com",
    "description": "One of Southeast Asia's largest private hospitals...",
    "tenant_id": "tenant-apollo-001",
}

Country Coverage Map

                    Provider Network Coverage

    Costa Rica (4)      Spain (5)         UAE (5)
         |                 |                |
    Mexico (5)          Turkey (6)      India (7)
                           |                |
                      South Korea (5)   Thailand (5)

    Total: 42 providers across 8 countries

3. Procedure Templates & Demo: python -m app.seed_demo

This script creates procedure templates with parent-child inheritance, links providers to procedures, and creates the demo patient.

Procedure Templates (12)

Category Procedure Parent Template ICD-10
Orthopedics Total Knee Replacement (TKR) ORTHO_BASE 0SRD0JZ
Orthopedics Total Hip Replacement (THR) ORTHO_BASE 0SR9019
Orthopedics Spinal Fusion ORTHO_BASE 0SG0070
Cardiology Coronary Artery Bypass (CABG) CARDIAC_BASE 0210093
Cardiology Heart Valve Replacement CARDIAC_BASE 02RF07Z
Oncology Mastectomy ONCO_BASE 0HTT0ZZ
Oncology Prostatectomy ONCO_BASE 0VT00ZZ
Dental Dental Implants (Full Arch) DENTAL_BASE 0DH607Z
Dental Dental Veneers DENTAL_BASE -
Fertility IVF Cycle FERTILITY_BASE 3E0P3GC
Bariatric Gastric Sleeve BARIATRIC_BASE 0DB64Z3
Ophthalmology LASIK OPHTH_BASE 08B1XZZ

Parent Template Inheritance

flowchart TD
    OB[ORTHO_BASE<br/>Pre-op clearance, imaging, PT protocol] --> TKR[TKR<br/>+ knee-specific prep]
    OB --> THR[THR<br/>+ hip-specific prep]
    OB --> SF[Spinal Fusion<br/>+ spine-specific prep]

    CB[CARDIAC_BASE<br/>Cardiac clearance, echo, stress test] --> CABG[CABG<br/>+ bypass-specific prep]
    CB --> HVR[Heart Valve<br/>+ valve-specific prep]

Parent templates define shared requirements (pre-op clearance, imaging protocols, recovery milestones). Child procedures inherit these and add procedure-specific requirements.

ORTHO_BASE = {
    "category": "orthopedics",
    "pre_op_requirements": [
        "Complete blood count",
        "Metabolic panel",
        "Chest X-ray",
        "EKG",
        "Orthopedic surgeon clearance",
        "Physical therapy assessment",
    ],
    "post_op_milestones": [
        "Pain management protocol initiated",
        "First physical therapy session",
        "Weight-bearing assessment",
        "Discharge evaluation",
        "Follow-up imaging",
    ],
    "typical_hospital_stay_days": 3,
    "typical_recovery_weeks": 8,
}

TKR = {
    **ORTHO_BASE,
    "name": "Total Knee Replacement",
    "icd_10": "0SRD0JZ",
    "snomed_ct": "609588000",
    "pre_op_requirements": ORTHO_BASE["pre_op_requirements"] + [
        "Knee MRI",
        "Weight-bearing X-rays",
        "Bone density scan (if over 65)",
    ],
    "typical_hospital_stay_days": 2,
    "typical_recovery_weeks": 6,
}

38 links connect 15 providers to 10 procedures, including procedure-specific pricing.

Procedure # Providers Linked Price Range (USD)
Total Knee Replacement 6 $5,500 - $14,000
Total Hip Replacement 5 $6,000 - $15,000
CABG 4 $12,000 - $35,000
Dental Implants 4 $3,500 - $12,000
IVF Cycle 3 $4,000 - $8,500
Gastric Sleeve 4 $4,500 - $9,000
LASIK 3 $1,000 - $3,500
Heart Valve Replacement 3 $15,000 - $40,000
Spinal Fusion 3 $10,000 - $25,000
Dental Veneers 3 $2,000 - $8,000
provider_procedure_link = {
    "provider_id": provider.id,
    "procedure_id": procedure.id,
    "price_amount": 850000,      # $8,500.00 in cents
    "price_currency": "USD",
    "estimated_duration_days": 14,  # Total trip including recovery
    "success_rate_source": "hospital_reported",
    "volume_per_year": 450,
    "is_active": True,
}

Demo Patient: Aisha Al-Rashid

The seed creates a demo patient for testing and demonstration purposes.

demo_patient = {
    "full_name": encrypt_field("Aisha Al-Rashid"),
    "email": encrypt_field("aisha.alrashid@example.com"),
    "phone": encrypt_field("+971-50-123-4567"),
    "date_of_birth": "1985-03-15",
    "country_of_residence": "UAE",
    "preferred_language": "en",
    "timezone": "Asia/Dubai",
    "tenant_id": "tenant-apollo-001",
}

Aisha's profile includes:

Field Value
Name Aisha Al-Rashid
Age 41
Country UAE
Language English
Timezone Asia/Dubai
Tenant tenant-apollo-001
Medical need Total Knee Replacement

4. Neo4j Graph Population: python -m app.seed_graph

Populates the Neo4j knowledge graph with nodes and relationships derived from the PostgreSQL data.

Node Types

Node Label Count Source
Provider 42 providers table
Procedure 12 procedures table
Country 8 Derived from providers
Specialty ~20 Derived from provider specialties
Accreditation ~10 Derived from provider accreditations

Relationship Types

Relationship From To Count
OFFERS Provider Procedure 38
LOCATED_IN Provider Country 42
SPECIALIZES_IN Provider Specialty ~80
ACCREDITED_BY Provider Accreditation ~60
BELONGS_TO Procedure Specialty 12

Graph Creation Queries

// Create Provider nodes
MERGE (p:Provider {id: $provider_id})
SET p.name = $name,
    p.city = $city,
    p.tier = $tier,
    p.bed_count = $bed_count,
    p.year_established = $year_established

// Create Country nodes and relationships
MERGE (c:Country {name: $country_name, iso_code: $iso_code})
MERGE (p)-[:LOCATED_IN]->(c)

// Create Procedure nodes and OFFERS relationships
MERGE (proc:Procedure {id: $procedure_id})
SET proc.name = $name,
    proc.category = $category,
    proc.icd_10 = $icd_10
MERGE (p)-[r:OFFERS]->(proc)
SET r.price_amount = $price,
    r.price_currency = $currency,
    r.volume_per_year = $volume

// Create Specialty nodes
MERGE (s:Specialty {name: $specialty_name})
MERGE (p)-[:SPECIALIZES_IN]->(s)

Graph Visualization

                         ┌──────────────┐
                         │   Country    │
                         │   (India)    │
                         └──────┬───────┘
                                │ LOCATED_IN
                    ┌───────────┴───────────┐
                    │                       │
              ┌─────┴──────┐         ┌──────┴─────┐
              │  Provider   │         │  Provider   │
              │  (Apollo)   │         │  (Fortis)   │
              └──┬───┬──┬──┘         └──┬───┬──────┘
                 │   │  │               │   │
    OFFERS ──────┘   │  └── SPEC_IN     │   └── OFFERS
                     │                  │
              ┌──────┴──────┐    ┌──────┴──────┐
              │  Procedure   │    │  Specialty   │
              │    (TKR)     │    │(Orthopedics) │
              └──────────────┘    └──────────────┘

5. Qdrant Embeddings: python -m app.seed_embeddings

Generates vector embeddings for semantic search using the Voyage AI voyage-3.5-lite model (1024 dimensions).

Embedding Collections

Collection Vectors Source Embedding Content
providers 42 Provider profiles Name + city + country + specialties + description
conditions 12 Procedure templates Name + category + description + requirements
requirement_embeddings 70 Travel requirements Requirement text + country + procedure context

Total: ~124 vectors

Embedding Generation

from voyageai import Client as VoyageClient

voyage = VoyageClient(api_key=VOYAGE_API_KEY)

async def generate_provider_embeddings():
    """Generate and upsert provider embeddings to Qdrant."""
    providers = await db.fetch_all("SELECT * FROM providers")

    for provider in providers:
        # Compose embedding text from multiple fields
        text = f"""
        {provider['name']} is a {provider['tier']} medical facility
        located in {provider['city']}, {provider['country']}.
        Specialties: {', '.join(provider['specialties'])}.
        Accreditations: {', '.join(provider['accreditations'])}.
        {provider['description']}
        """

        # Generate embedding via Voyage AI
        embedding = voyage.embed(
            texts=[text],
            model="voyage-3.5-lite",
        ).embeddings[0]  # 1024 dimensions

        # Upsert to Qdrant
        qdrant.upsert(
            collection_name="providers",
            points=[{
                "id": str(provider["id"]),
                "vector": embedding,
                "payload": {
                    "name": provider["name"],
                    "country": provider["country"],
                    "city": provider["city"],
                    "tier": provider["tier"],
                    "specialties": provider["specialties"],
                },
            }],
        )

Collection Configuration

from qdrant_client.models import Distance, VectorParams

# Create collections with cosine similarity
for collection_name in ["providers", "conditions", "requirement_embeddings"]:
    qdrant.recreate_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(
            size=1024,              # voyage-3.5-lite dimension
            distance=Distance.COSINE,
        ),
    )

Requirement Embeddings

The 70 requirement vectors encode travel-specific knowledge:

Category Example Requirements Count
Visa "Medical visa requirements for India from USA" 16
Insurance "International health insurance for cardiac surgery" 10
Pre-op "Pre-operative requirements for knee replacement" 12
Travel "Post-surgery flight restrictions for orthopedic patients" 10
Accommodation "Recovery-friendly hotels near Bumrungrad Hospital" 8
Documents "Medical records translation and apostille requirements" 8
Costs "Payment methods and financing for medical travel" 6

6. Doctor Seeding: python -m app.seed_doctors

Seeds 8 doctors with varying levels of profile completeness across 6 providers. This was introduced in Session 26 to support the doctor-level matching feature.

Doctor Profiles

Doctor Provider Specialty Status Completeness
Dr. Rajesh Sharma Apollo Chennai Orthopedic Surgery Verified Full
Dr. Ayse Demir Acibadem Istanbul Cardiothoracic Surgery Verified Full
Dr. Somchai Patel Bumrungrad Bangkok Orthopedic Surgery Complete Full
Dr. Maria Garcia Hospital Galenia Cancun Bariatric Surgery Complete Full
Dr. Jin-Woo Park Samsung Medical Center Oncology Complete Standard
Dr. Carlos Mendez CIMA San Jose Fertility/IVF Complete Standard
Dr. Ahmed Hassan Cleveland Clinic Abu Dhabi Cardiology Basic Minimal
Dr. Elena Volkov Quironsalud Barcelona Ophthalmology Basic Minimal

Completeness Levels

Level Fields Populated Use Case
Verified (2) All fields + board certifications + publications + verified badge Full matching and display
Complete (4) All clinical fields + credentials Standard matching
Basic (2) Name, specialty, provider affiliation Placeholder for future enrichment

Doctor Data Model

doctor = {
    "name": "Dr. Rajesh Sharma",
    "provider_id": apollo_chennai.id,
    "specialty": "Orthopedic Surgery",
    "sub_specialty": "Joint Replacement",
    "years_experience": 22,
    "education": [
        {"degree": "MBBS", "institution": "AIIMS Delhi", "year": 1998},
        {"degree": "MS Orthopedics", "institution": "AIIMS Delhi", "year": 2002},
        {"degree": "Fellowship", "institution": "HSS New York", "year": 2004},
    ],
    "board_certifications": ["National Board of Examinations (India)", "AO Trauma"],
    "languages": ["en", "hi", "ta"],
    "procedures_performed": 3500,
    "publications_count": 45,
    "verified": True,
    "verification_date": "2025-11-15",
    "profile_image_url": None,  # To be added
    "bio": "Dr. Sharma is one of India's leading joint replacement surgeons...",
    "tenant_id": "tenant-apollo-001",
}

Neo4j Doctor Sync

The doctor seed script also syncs doctor nodes to Neo4j:

// Create Doctor node
MERGE (d:Doctor {id: $doctor_id})
SET d.name = $name,
    d.specialty = $specialty,
    d.sub_specialty = $sub_specialty,
    d.years_experience = $years_experience,
    d.verified = $verified

// Link to Provider
MATCH (p:Provider {id: $provider_id})
MERGE (d)-[:PRACTICES_AT]->(p)

// Link to Specialty
MERGE (s:Specialty {name: $specialty})
MERGE (d)-[:SPECIALIZES_IN]->(s)

// Link to Procedures
MATCH (proc:Procedure {name: $procedure_name})
MERGE (d)-[:PERFORMS]->(proc)

Graph After Doctor Seeding

    ┌────────────┐     PRACTICES_AT     ┌──────────────┐
    │   Doctor    │────────────────────>│   Provider    │
    │ (Dr.Sharma) │                     │   (Apollo)    │
    └──┬────┬────┘                     └──────────────┘
       │    │
       │    └── SPECIALIZES_IN ──> [Orthopedics]
       └── PERFORMS ──> [TKR]  [THR]

Provider Language Services

6 providers have been updated with detailed language service data, enabling better matching for patients who need specific language support.

Language Service Types

Service Type Description Example
interpreter On-site medical interpreter Arabic interpreter at Bumrungrad
coordinator International patient coordinator English coordinator at Acibadem
document_translation Medical document translation Japanese documents at Samsung MC

Updated Providers

Provider Interpreters Coordinators Document Languages
Bumrungrad Bangkok en, ar, ja, zh en, ar, ja en, ar, ja, zh, de
Apollo Chennai en, hi, ar, bn en, ar en, hi, ar, ta
Acibadem Istanbul en, ar, de, ru en, ar, de en, ar, de, ru, fr
Samsung Medical Center en, ja, zh, ru en, ja, zh en, ja, zh, ko
Cleveland Clinic Abu Dhabi en, ar, hi, ur en, ar en, ar, hi, fr
Hospital Galenia Cancun en, es en, es en, es, fr

Data Structure

language_services = {
    "interpreter_languages": ["en", "ar", "ja", "zh"],
    "coordinator_languages": ["en", "ar", "ja"],
    "document_translation_languages": ["en", "ar", "ja", "zh", "de"],
}

# Stored as JSONB on the providers table
await db.execute(
    """
    UPDATE providers SET language_services = :services
    WHERE id = :provider_id
    """,
    {"services": json.dumps(language_services), "provider_id": provider.id},
)

Seed Verification

After running all seeds, verify the data is consistent across stores.

Verification Checklist

# PostgreSQL counts
python -c "
from app.db import get_db
db = get_db()
print('Providers:', db.execute('SELECT COUNT(*) FROM providers').scalar())
print('Procedures:', db.execute('SELECT COUNT(*) FROM procedures').scalar())
print('Links:', db.execute('SELECT COUNT(*) FROM provider_procedures').scalar())
print('Doctors:', db.execute('SELECT COUNT(*) FROM doctors').scalar())
print('Patients:', db.execute('SELECT COUNT(*) FROM patients').scalar())
"

# Neo4j counts
python -c "
from app.graph import get_driver
with get_driver().session() as s:
    for label in ['Provider','Procedure','Doctor','Country','Specialty']:
        count = s.run(f'MATCH (n:{label}) RETURN count(n)').single()[0]
        print(f'{label}: {count}')
"

# Qdrant counts
python -c "
from app.vector import get_client
client = get_client()
for name in ['providers','conditions','requirement_embeddings']:
    info = client.get_collection(name)
    print(f'{name}: {info.points_count} vectors')
"

Expected Counts

Store Entity Expected Count
PostgreSQL providers 42
PostgreSQL procedures 12
PostgreSQL provider_procedures 38
PostgreSQL doctors 8
PostgreSQL patients 1 (demo)
Neo4j Provider nodes 42
Neo4j Procedure nodes 12
Neo4j Doctor nodes 8
Neo4j Country nodes 8
Qdrant providers 42 vectors
Qdrant conditions 12 vectors
Qdrant requirement_embeddings 70 vectors

Rebuilding Data

To completely rebuild all data (e.g., after schema changes):

# 1. Clear existing data (destructive!)
python -m app.clear_data          # Truncates PostgreSQL tables
python -m app.clear_graph         # Deletes all Neo4j nodes
python -m app.clear_embeddings    # Drops Qdrant collections

# 2. Re-run full seed sequence
python -m app.seed
python -m app.seed_providers
python -m app.seed_demo
python -m app.seed_graph
python -m app.seed_embeddings
python -m app.seed_doctors

Destructive Operation

The clear scripts permanently delete all data. Only run these in development or staging environments. Production data should be migrated, not re-seeded.