Data Ingestion¶

Curaway's data platform is populated through a series of seed scripts that load providers, procedures, doctors, and graph/vector data in a specific order. This document covers every seed script, what it creates, and the correct execution sequence.

Seed Execution Order¶

The seed scripts must run in a specific order due to foreign key dependencies between tables and cross-system references (PostgreSQL, Neo4j, Qdrant).

flowchart TD
    A[1. seed<br/>Tenant + Consent + Legal] --> B[2. seed_providers<br/>42 Providers]
    B --> C[3. seed_demo<br/>Procedures + Links + Demo Patient]
    C --> D[4. seed_graph<br/>Neo4j Graph Population]
    D --> E[5. seed_embeddings<br/>Qdrant Vectors]
    E --> F[6. seed_doctors<br/>8 Doctors + Neo4j Sync]

Quick Start¶

# Full seed from scratch (run in order)
python -m app.seed                # Tenant, consent purposes, legal agreements
python -m app.seed_providers      # 42 providers across 8 countries
python -m app.seed_demo           # Procedures, links, demo patient
python -m app.seed_graph          # Neo4j nodes and relationships
python -m app.seed_embeddings     # Qdrant vector embeddings
python -m app.seed_doctors        # 8 doctors with Neo4j sync

Order Matters

Running seed_graph before seed_providers will create an empty graph. Running seed_embeddings before seed_demo will miss procedure embeddings. Always follow the sequence above.

1. Tenant Seeding: `python -m app.seed`¶

The foundation seed creates the tenant, consent purposes, and legal agreements that everything else depends on.

What It Creates¶

Entity	Count	Details
Tenant	1	`tenant-apollo-001`
Consent purposes	6	4 required + 2 optional
Legal agreements	2	ToS v1, Privacy Policy v1

Purpose	Required	Description
`data_processing`	Yes	Core data processing for service delivery
`medical_data_sharing`	Yes	Sharing medical records with matched providers
`cross_border_transfer`	Yes	Transferring data across international borders
`communication`	Yes	Essential service communications
`marketing`	No	Marketing emails and promotional content
`analytics`	No	Anonymous usage analytics

Idempotency¶

The seed script checks for existing records before inserting. Running it multiple times is safe — it skips records that already exist.

async def seed_tenant():
    existing = await db.fetch_one(
        "SELECT id FROM tenants WHERE tenant_id = :tid",
        {"tid": "tenant-apollo-001"},
    )
    if existing:
        logger.info("Tenant already exists, skipping")
        return
    # Create tenant...

2. Provider Seeding: `python -m app.seed_providers`¶

Seeds 42 medical providers across 8 countries, representing the initial curated network.

Provider Distribution by Country¶

Country	Count	Example Providers
India	7	Apollo Hospitals Chennai, Fortis Memorial Gurgaon, Max Super Speciality Delhi
Turkey	6	Acibadem Altunizade, Memorial Bahcelievler, Medicana International Istanbul
Thailand	5	Bumrungrad International, Bangkok Hospital, Samitivej Sukhumvit
Mexico	5	Hospital Galenia Cancun, Christus Muguerza Monterrey, Star Medica Merida
South Korea	5	Samsung Medical Center, Severance Hospital, Asan Medical Center
UAE	5	Cleveland Clinic Abu Dhabi, Mediclinic City Hospital, American Hospital Dubai
Spain	5	Quironsalud Barcelona, Hospital Universitario HM Madrid, Teknon Barcelona
Costa Rica	4	CIMA Hospital San Jose, Clinica Biblica, Hospital Metropolitano

Provider Data Model¶

Each provider record includes:

provider = {
    "name": "Bumrungrad International Hospital",
    "country": "Thailand",
    "city": "Bangkok",
    "tier": "premium",           # premium, standard, value
    "accreditations": ["JCI", "TEMOS"],
    "specialties": ["orthopedics", "cardiology", "oncology"],
    "bed_count": 580,
    "international_patient_center": True,
    "languages_spoken": ["en", "th", "ar", "ja"],
    "year_established": 1980,
    "website": "https://www.bumrungrad.com",
    "contact_email": "intl@bumrungrad.com",
    "description": "One of Southeast Asia's largest private hospitals...",
    "tenant_id": "tenant-apollo-001",
}

Country Coverage Map¶

                    Provider Network Coverage

    Costa Rica (4)      Spain (5)         UAE (5)
         |                 |                |
    Mexico (5)          Turkey (6)      India (7)
                           |                |
                      South Korea (5)   Thailand (5)

    Total: 42 providers across 8 countries

3. Procedure Templates & Demo: `python -m app.seed_demo`¶

This script creates procedure templates with parent-child inheritance, links providers to procedures, and creates the demo patient.

Procedure Templates (12)¶

Category	Procedure	Parent Template	ICD-10
Orthopedics	Total Knee Replacement (TKR)	ORTHO_BASE	0SRD0JZ
Orthopedics	Total Hip Replacement (THR)	ORTHO_BASE	0SR9019
Orthopedics	Spinal Fusion	ORTHO_BASE	0SG0070
Cardiology	Coronary Artery Bypass (CABG)	CARDIAC_BASE	0210093
Cardiology	Heart Valve Replacement	CARDIAC_BASE	02RF07Z
Oncology	Mastectomy	ONCO_BASE	0HTT0ZZ
Oncology	Prostatectomy	ONCO_BASE	0VT00ZZ
Dental	Dental Implants (Full Arch)	DENTAL_BASE	0DH607Z
Dental	Dental Veneers	DENTAL_BASE	-
Fertility	IVF Cycle	FERTILITY_BASE	3E0P3GC
Bariatric	Gastric Sleeve	BARIATRIC_BASE	0DB64Z3
Ophthalmology	LASIK	OPHTH_BASE	08B1XZZ

Parent Template Inheritance¶

flowchart TD
    OB[ORTHO_BASE<br/>Pre-op clearance, imaging, PT protocol] --> TKR[TKR<br/>+ knee-specific prep]
    OB --> THR[THR<br/>+ hip-specific prep]
    OB --> SF[Spinal Fusion<br/>+ spine-specific prep]

    CB[CARDIAC_BASE<br/>Cardiac clearance, echo, stress test] --> CABG[CABG<br/>+ bypass-specific prep]
    CB --> HVR[Heart Valve<br/>+ valve-specific prep]

Parent templates define shared requirements (pre-op clearance, imaging protocols, recovery milestones). Child procedures inherit these and add procedure-specific requirements.

ORTHO_BASE = {
    "category": "orthopedics",
    "pre_op_requirements": [
        "Complete blood count",
        "Metabolic panel",
        "Chest X-ray",
        "EKG",
        "Orthopedic surgeon clearance",
        "Physical therapy assessment",
    ],
    "post_op_milestones": [
        "Pain management protocol initiated",
        "First physical therapy session",
        "Weight-bearing assessment",
        "Discharge evaluation",
        "Follow-up imaging",
    ],
    "typical_hospital_stay_days": 3,
    "typical_recovery_weeks": 8,
}

TKR = {
    **ORTHO_BASE,
    "name": "Total Knee Replacement",
    "icd_10": "0SRD0JZ",
    "snomed_ct": "609588000",
    "pre_op_requirements": ORTHO_BASE["pre_op_requirements"] + [
        "Knee MRI",
        "Weight-bearing X-rays",
        "Bone density scan (if over 65)",
    ],
    "typical_hospital_stay_days": 2,
    "typical_recovery_weeks": 6,
}

Provider-Procedure Links (38 Links)¶

38 links connect 15 providers to 10 procedures, including procedure-specific pricing.

Procedure	# Providers Linked	Price Range (USD)
Total Knee Replacement	6	$5,500 - $14,000
Total Hip Replacement	5	$6,000 - $15,000
CABG	4	$12,000 - $35,000
Dental Implants	4	$3,500 - $12,000
IVF Cycle	3	$4,000 - $8,500
Gastric Sleeve	4	$4,500 - $9,000
LASIK	3	$1,000 - $3,500
Heart Valve Replacement	3	$15,000 - $40,000
Spinal Fusion	3	$10,000 - $25,000
Dental Veneers	3	$2,000 - $8,000

Link Data Model¶

provider_procedure_link = {
    "provider_id": provider.id,
    "procedure_id": procedure.id,
    "price_amount": 850000,      # $8,500.00 in cents
    "price_currency": "USD",
    "estimated_duration_days": 14,  # Total trip including recovery
    "success_rate_source": "hospital_reported",
    "volume_per_year": 450,
    "is_active": True,
}

Demo Patient: Aisha Al-Rashid¶

The seed creates a demo patient for testing and demonstration purposes.

demo_patient = {
    "full_name": encrypt_field("Aisha Al-Rashid"),
    "email": encrypt_field("aisha.alrashid@example.com"),
    "phone": encrypt_field("+971-50-123-4567"),
    "date_of_birth": "1985-03-15",
    "country_of_residence": "UAE",
    "preferred_language": "en",
    "timezone": "Asia/Dubai",
    "tenant_id": "tenant-apollo-001",
}

Aisha's profile includes:

Field	Value
Name	Aisha Al-Rashid
Age	41
Country	UAE
Language	English
Timezone	Asia/Dubai
Tenant	tenant-apollo-001
Medical need	Total Knee Replacement

4. Neo4j Graph Population: `python -m app.seed_graph`¶

Populates the Neo4j knowledge graph with nodes and relationships derived from the PostgreSQL data.

Node Types¶

Node Label	Count	Source
Provider	42	providers table
Procedure	12	procedures table
Country	8	Derived from providers
Specialty	~20	Derived from provider specialties
Accreditation	~10	Derived from provider accreditations

Relationship Types¶

Relationship	From	To	Count
`OFFERS`	Provider	Procedure	38
`LOCATED_IN`	Provider	Country	42
`SPECIALIZES_IN`	Provider	Specialty	~80
`ACCREDITED_BY`	Provider	Accreditation	~60
`BELONGS_TO`	Procedure	Specialty	12

Graph Creation Queries¶

// Create Provider nodes
MERGE (p:Provider {id: $provider_id})
SET p.name = $name,
    p.city = $city,
    p.tier = $tier,
    p.bed_count = $bed_count,
    p.year_established = $year_established

// Create Country nodes and relationships
MERGE (c:Country {name: $country_name, iso_code: $iso_code})
MERGE (p)-[:LOCATED_IN]->(c)

// Create Procedure nodes and OFFERS relationships
MERGE (proc:Procedure {id: $procedure_id})
SET proc.name = $name,
    proc.category = $category,
    proc.icd_10 = $icd_10
MERGE (p)-[r:OFFERS]->(proc)
SET r.price_amount = $price,
    r.price_currency = $currency,
    r.volume_per_year = $volume

// Create Specialty nodes
MERGE (s:Specialty {name: $specialty_name})
MERGE (p)-[:SPECIALIZES_IN]->(s)

Graph Visualization¶

                         ┌──────────────┐
                         │   Country    │
                         │   (India)    │
                         └──────┬───────┘
                                │ LOCATED_IN
                    ┌───────────┴───────────┐
                    │                       │
              ┌─────┴──────┐         ┌──────┴─────┐
              │  Provider   │         │  Provider   │
              │  (Apollo)   │         │  (Fortis)   │
              └──┬───┬──┬──┘         └──┬───┬──────┘
                 │   │  │               │   │
    OFFERS ──────┘   │  └── SPEC_IN     │   └── OFFERS
                     │                  │
              ┌──────┴──────┐    ┌──────┴──────┐
              │  Procedure   │    │  Specialty   │
              │    (TKR)     │    │(Orthopedics) │
              └──────────────┘    └──────────────┘

5. Qdrant Embeddings: `python -m app.seed_embeddings`¶

Generates vector embeddings for semantic search using the Voyage AI voyage-3.5-lite model (1024 dimensions).

Embedding Collections¶

Collection	Vectors	Source	Embedding Content
providers	42	Provider profiles	Name + city + country + specialties + description
conditions	12	Procedure templates	Name + category + description + requirements
requirement_embeddings	70	Travel requirements	Requirement text + country + procedure context

Total: ~124 vectors

Embedding Generation¶

from voyageai import Client as VoyageClient

voyage = VoyageClient(api_key=VOYAGE_API_KEY)

async def generate_provider_embeddings():
    """Generate and upsert provider embeddings to Qdrant."""
    providers = await db.fetch_all("SELECT * FROM providers")

    for provider in providers:
        # Compose embedding text from multiple fields
        text = f"""
        {provider['name']} is a {provider['tier']} medical facility
        located in {provider['city']}, {provider['country']}.
        Specialties: {', '.join(provider['specialties'])}.
        Accreditations: {', '.join(provider['accreditations'])}.
        {provider['description']}
        """

        # Generate embedding via Voyage AI
        embedding = voyage.embed(
            texts=[text],
            model="voyage-3.5-lite",
        ).embeddings[0]  # 1024 dimensions

        # Upsert to Qdrant
        qdrant.upsert(
            collection_name="providers",
            points=[{
                "id": str(provider["id"]),
                "vector": embedding,
                "payload": {
                    "name": provider["name"],
                    "country": provider["country"],
                    "city": provider["city"],
                    "tier": provider["tier"],
                    "specialties": provider["specialties"],
                },
            }],
        )

Collection Configuration¶

from qdrant_client.models import Distance, VectorParams

# Create collections with cosine similarity
for collection_name in ["providers", "conditions", "requirement_embeddings"]:
    qdrant.recreate_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(
            size=1024,              # voyage-3.5-lite dimension
            distance=Distance.COSINE,
        ),
    )

Requirement Embeddings¶

The 70 requirement vectors encode travel-specific knowledge:

Category	Example Requirements	Count
Visa	"Medical visa requirements for India from USA"	16
Insurance	"International health insurance for cardiac surgery"	10
Pre-op	"Pre-operative requirements for knee replacement"	12
Travel	"Post-surgery flight restrictions for orthopedic patients"	10
Accommodation	"Recovery-friendly hotels near Bumrungrad Hospital"	8
Documents	"Medical records translation and apostille requirements"	8
Costs	"Payment methods and financing for medical travel"	6

6. Doctor Seeding: `python -m app.seed_doctors`¶

Seeds 8 doctors with varying levels of profile completeness across 6 providers. This was introduced in Session 26 to support the doctor-level matching feature.

Doctor Profiles¶

Doctor	Provider	Specialty	Status	Completeness
Dr. Rajesh Sharma	Apollo Chennai	Orthopedic Surgery	Verified	Full
Dr. Ayse Demir	Acibadem Istanbul	Cardiothoracic Surgery	Verified	Full
Dr. Somchai Patel	Bumrungrad Bangkok	Orthopedic Surgery	Complete	Full
Dr. Maria Garcia	Hospital Galenia Cancun	Bariatric Surgery	Complete	Full
Dr. Jin-Woo Park	Samsung Medical Center	Oncology	Complete	Standard
Dr. Carlos Mendez	CIMA San Jose	Fertility/IVF	Complete	Standard
Dr. Ahmed Hassan	Cleveland Clinic Abu Dhabi	Cardiology	Basic	Minimal
Dr. Elena Volkov	Quironsalud Barcelona	Ophthalmology	Basic	Minimal

Completeness Levels¶

Level	Fields Populated	Use Case
Verified (2)	All fields + board certifications + publications + verified badge	Full matching and display
Complete (4)	All clinical fields + credentials	Standard matching
Basic (2)	Name, specialty, provider affiliation	Placeholder for future enrichment

Doctor Data Model¶

doctor = {
    "name": "Dr. Rajesh Sharma",
    "provider_id": apollo_chennai.id,
    "specialty": "Orthopedic Surgery",
    "sub_specialty": "Joint Replacement",
    "years_experience": 22,
    "education": [
        {"degree": "MBBS", "institution": "AIIMS Delhi", "year": 1998},
        {"degree": "MS Orthopedics", "institution": "AIIMS Delhi", "year": 2002},
        {"degree": "Fellowship", "institution": "HSS New York", "year": 2004},
    ],
    "board_certifications": ["National Board of Examinations (India)", "AO Trauma"],
    "languages": ["en", "hi", "ta"],
    "procedures_performed": 3500,
    "publications_count": 45,
    "verified": True,
    "verification_date": "2025-11-15",
    "profile_image_url": None,  # To be added
    "bio": "Dr. Sharma is one of India's leading joint replacement surgeons...",
    "tenant_id": "tenant-apollo-001",
}

Neo4j Doctor Sync¶

The doctor seed script also syncs doctor nodes to Neo4j:

// Create Doctor node
MERGE (d:Doctor {id: $doctor_id})
SET d.name = $name,
    d.specialty = $specialty,
    d.sub_specialty = $sub_specialty,
    d.years_experience = $years_experience,
    d.verified = $verified

// Link to Provider
MATCH (p:Provider {id: $provider_id})
MERGE (d)-[:PRACTICES_AT]->(p)

// Link to Specialty
MERGE (s:Specialty {name: $specialty})
MERGE (d)-[:SPECIALIZES_IN]->(s)

// Link to Procedures
MATCH (proc:Procedure {name: $procedure_name})
MERGE (d)-[:PERFORMS]->(proc)

Graph After Doctor Seeding¶

    ┌────────────┐     PRACTICES_AT     ┌──────────────┐
    │   Doctor    │────────────────────>│   Provider    │
    │ (Dr.Sharma) │                     │   (Apollo)    │
    └──┬────┬────┘                     └──────────────┘
       │    │
       │    └── SPECIALIZES_IN ──> [Orthopedics]
       │
       └── PERFORMS ──> [TKR]  [THR]

Provider Language Services¶

6 providers have been updated with detailed language service data, enabling better matching for patients who need specific language support.

Language Service Types¶

Service Type	Description	Example
`interpreter`	On-site medical interpreter	Arabic interpreter at Bumrungrad
`coordinator`	International patient coordinator	English coordinator at Acibadem
`document_translation`	Medical document translation	Japanese documents at Samsung MC

Updated Providers¶

Provider	Interpreters	Coordinators	Document Languages
Bumrungrad Bangkok	en, ar, ja, zh	en, ar, ja	en, ar, ja, zh, de
Apollo Chennai	en, hi, ar, bn	en, ar	en, hi, ar, ta
Acibadem Istanbul	en, ar, de, ru	en, ar, de	en, ar, de, ru, fr
Samsung Medical Center	en, ja, zh, ru	en, ja, zh	en, ja, zh, ko
Cleveland Clinic Abu Dhabi	en, ar, hi, ur	en, ar	en, ar, hi, fr
Hospital Galenia Cancun	en, es	en, es	en, es, fr

Data Structure¶

language_services = {
    "interpreter_languages": ["en", "ar", "ja", "zh"],
    "coordinator_languages": ["en", "ar", "ja"],
    "document_translation_languages": ["en", "ar", "ja", "zh", "de"],
}

# Stored as JSONB on the providers table
await db.execute(
    """
    UPDATE providers SET language_services = :services
    WHERE id = :provider_id
    """,
    {"services": json.dumps(language_services), "provider_id": provider.id},
)

Seed Verification¶

After running all seeds, verify the data is consistent across stores.

Verification Checklist¶

# PostgreSQL counts
python -c "
from app.db import get_db
db = get_db()
print('Providers:', db.execute('SELECT COUNT(*) FROM providers').scalar())
print('Procedures:', db.execute('SELECT COUNT(*) FROM procedures').scalar())
print('Links:', db.execute('SELECT COUNT(*) FROM provider_procedures').scalar())
print('Doctors:', db.execute('SELECT COUNT(*) FROM doctors').scalar())
print('Patients:', db.execute('SELECT COUNT(*) FROM patients').scalar())
"

# Neo4j counts
python -c "
from app.graph import get_driver
with get_driver().session() as s:
    for label in ['Provider','Procedure','Doctor','Country','Specialty']:
        count = s.run(f'MATCH (n:{label}) RETURN count(n)').single()[0]
        print(f'{label}: {count}')
"

# Qdrant counts
python -c "
from app.vector import get_client
client = get_client()
for name in ['providers','conditions','requirement_embeddings']:
    info = client.get_collection(name)
    print(f'{name}: {info.points_count} vectors')
"

Expected Counts¶

Store	Entity	Expected Count
PostgreSQL	providers	42
PostgreSQL	procedures	12
PostgreSQL	provider_procedures	38
PostgreSQL	doctors	8
PostgreSQL	patients	1 (demo)
Neo4j	Provider nodes	42
Neo4j	Procedure nodes	12
Neo4j	Doctor nodes	8
Neo4j	Country nodes	8
Qdrant	providers	42 vectors
Qdrant	conditions	12 vectors
Qdrant	requirement_embeddings	70 vectors

Rebuilding Data¶

To completely rebuild all data (e.g., after schema changes):

# 1. Clear existing data (destructive!)
python -m app.clear_data          # Truncates PostgreSQL tables
python -m app.clear_graph         # Deletes all Neo4j nodes
python -m app.clear_embeddings    # Drops Qdrant collections

# 2. Re-run full seed sequence
python -m app.seed
python -m app.seed_providers
python -m app.seed_demo
python -m app.seed_graph
python -m app.seed_embeddings
python -m app.seed_doctors

Destructive Operation

The clear scripts permanently delete all data. Only run these in development or staging environments. Production data should be migrated, not re-seeded.