Data Ingestion¶
Curaway's data platform is populated through a series of seed scripts that load providers, procedures, doctors, and graph/vector data in a specific order. This document covers every seed script, what it creates, and the correct execution sequence.
Seed Execution Order¶
The seed scripts must run in a specific order due to foreign key dependencies between tables and cross-system references (PostgreSQL, Neo4j, Qdrant).
flowchart TD
A[1. seed<br/>Tenant + Consent + Legal] --> B[2. seed_providers<br/>42 Providers]
B --> C[3. seed_demo<br/>Procedures + Links + Demo Patient]
C --> D[4. seed_graph<br/>Neo4j Graph Population]
D --> E[5. seed_embeddings<br/>Qdrant Vectors]
E --> F[6. seed_doctors<br/>8 Doctors + Neo4j Sync]
Quick Start¶
# Full seed from scratch (run in order)
python -m app.seed # Tenant, consent purposes, legal agreements
python -m app.seed_providers # 42 providers across 8 countries
python -m app.seed_demo # Procedures, links, demo patient
python -m app.seed_graph # Neo4j nodes and relationships
python -m app.seed_embeddings # Qdrant vector embeddings
python -m app.seed_doctors # 8 doctors with Neo4j sync
Order Matters
Running seed_graph before seed_providers will create an empty graph. Running
seed_embeddings before seed_demo will miss procedure embeddings. Always follow
the sequence above.
1. Tenant Seeding: python -m app.seed¶
The foundation seed creates the tenant, consent purposes, and legal agreements that everything else depends on.
What It Creates¶
| Entity | Count | Details |
|---|---|---|
| Tenant | 1 | tenant-apollo-001 |
| Consent purposes | 6 | 4 required + 2 optional |
| Legal agreements | 2 | ToS v1, Privacy Policy v1 |
Consent Purposes¶
| Purpose | Required | Description |
|---|---|---|
data_processing |
Yes | Core data processing for service delivery |
medical_data_sharing |
Yes | Sharing medical records with matched providers |
cross_border_transfer |
Yes | Transferring data across international borders |
communication |
Yes | Essential service communications |
marketing |
No | Marketing emails and promotional content |
analytics |
No | Anonymous usage analytics |
Idempotency¶
The seed script checks for existing records before inserting. Running it multiple times is safe — it skips records that already exist.
async def seed_tenant():
existing = await db.fetch_one(
"SELECT id FROM tenants WHERE tenant_id = :tid",
{"tid": "tenant-apollo-001"},
)
if existing:
logger.info("Tenant already exists, skipping")
return
# Create tenant...
2. Provider Seeding: python -m app.seed_providers¶
Seeds 42 medical providers across 8 countries, representing the initial curated network.
Provider Distribution by Country¶
| Country | Count | Example Providers |
|---|---|---|
| India | 7 | Apollo Hospitals Chennai, Fortis Memorial Gurgaon, Max Super Speciality Delhi |
| Turkey | 6 | Acibadem Altunizade, Memorial Bahcelievler, Medicana International Istanbul |
| Thailand | 5 | Bumrungrad International, Bangkok Hospital, Samitivej Sukhumvit |
| Mexico | 5 | Hospital Galenia Cancun, Christus Muguerza Monterrey, Star Medica Merida |
| South Korea | 5 | Samsung Medical Center, Severance Hospital, Asan Medical Center |
| UAE | 5 | Cleveland Clinic Abu Dhabi, Mediclinic City Hospital, American Hospital Dubai |
| Spain | 5 | Quironsalud Barcelona, Hospital Universitario HM Madrid, Teknon Barcelona |
| Costa Rica | 4 | CIMA Hospital San Jose, Clinica Biblica, Hospital Metropolitano |
Provider Data Model¶
Each provider record includes:
provider = {
"name": "Bumrungrad International Hospital",
"country": "Thailand",
"city": "Bangkok",
"tier": "premium", # premium, standard, value
"accreditations": ["JCI", "TEMOS"],
"specialties": ["orthopedics", "cardiology", "oncology"],
"bed_count": 580,
"international_patient_center": True,
"languages_spoken": ["en", "th", "ar", "ja"],
"year_established": 1980,
"website": "https://www.bumrungrad.com",
"contact_email": "intl@bumrungrad.com",
"description": "One of Southeast Asia's largest private hospitals...",
"tenant_id": "tenant-apollo-001",
}
Country Coverage Map¶
Provider Network Coverage
Costa Rica (4) Spain (5) UAE (5)
| | |
Mexico (5) Turkey (6) India (7)
| |
South Korea (5) Thailand (5)
Total: 42 providers across 8 countries
3. Procedure Templates & Demo: python -m app.seed_demo¶
This script creates procedure templates with parent-child inheritance, links providers to procedures, and creates the demo patient.
Procedure Templates (12)¶
| Category | Procedure | Parent Template | ICD-10 |
|---|---|---|---|
| Orthopedics | Total Knee Replacement (TKR) | ORTHO_BASE | 0SRD0JZ |
| Orthopedics | Total Hip Replacement (THR) | ORTHO_BASE | 0SR9019 |
| Orthopedics | Spinal Fusion | ORTHO_BASE | 0SG0070 |
| Cardiology | Coronary Artery Bypass (CABG) | CARDIAC_BASE | 0210093 |
| Cardiology | Heart Valve Replacement | CARDIAC_BASE | 02RF07Z |
| Oncology | Mastectomy | ONCO_BASE | 0HTT0ZZ |
| Oncology | Prostatectomy | ONCO_BASE | 0VT00ZZ |
| Dental | Dental Implants (Full Arch) | DENTAL_BASE | 0DH607Z |
| Dental | Dental Veneers | DENTAL_BASE | - |
| Fertility | IVF Cycle | FERTILITY_BASE | 3E0P3GC |
| Bariatric | Gastric Sleeve | BARIATRIC_BASE | 0DB64Z3 |
| Ophthalmology | LASIK | OPHTH_BASE | 08B1XZZ |
Parent Template Inheritance¶
flowchart TD
OB[ORTHO_BASE<br/>Pre-op clearance, imaging, PT protocol] --> TKR[TKR<br/>+ knee-specific prep]
OB --> THR[THR<br/>+ hip-specific prep]
OB --> SF[Spinal Fusion<br/>+ spine-specific prep]
CB[CARDIAC_BASE<br/>Cardiac clearance, echo, stress test] --> CABG[CABG<br/>+ bypass-specific prep]
CB --> HVR[Heart Valve<br/>+ valve-specific prep]
Parent templates define shared requirements (pre-op clearance, imaging protocols, recovery milestones). Child procedures inherit these and add procedure-specific requirements.
ORTHO_BASE = {
"category": "orthopedics",
"pre_op_requirements": [
"Complete blood count",
"Metabolic panel",
"Chest X-ray",
"EKG",
"Orthopedic surgeon clearance",
"Physical therapy assessment",
],
"post_op_milestones": [
"Pain management protocol initiated",
"First physical therapy session",
"Weight-bearing assessment",
"Discharge evaluation",
"Follow-up imaging",
],
"typical_hospital_stay_days": 3,
"typical_recovery_weeks": 8,
}
TKR = {
**ORTHO_BASE,
"name": "Total Knee Replacement",
"icd_10": "0SRD0JZ",
"snomed_ct": "609588000",
"pre_op_requirements": ORTHO_BASE["pre_op_requirements"] + [
"Knee MRI",
"Weight-bearing X-rays",
"Bone density scan (if over 65)",
],
"typical_hospital_stay_days": 2,
"typical_recovery_weeks": 6,
}
Provider-Procedure Links (38 Links)¶
38 links connect 15 providers to 10 procedures, including procedure-specific pricing.
| Procedure | # Providers Linked | Price Range (USD) |
|---|---|---|
| Total Knee Replacement | 6 | $5,500 - $14,000 |
| Total Hip Replacement | 5 | $6,000 - $15,000 |
| CABG | 4 | $12,000 - $35,000 |
| Dental Implants | 4 | $3,500 - $12,000 |
| IVF Cycle | 3 | $4,000 - $8,500 |
| Gastric Sleeve | 4 | $4,500 - $9,000 |
| LASIK | 3 | $1,000 - $3,500 |
| Heart Valve Replacement | 3 | $15,000 - $40,000 |
| Spinal Fusion | 3 | $10,000 - $25,000 |
| Dental Veneers | 3 | $2,000 - $8,000 |
Link Data Model¶
provider_procedure_link = {
"provider_id": provider.id,
"procedure_id": procedure.id,
"price_amount": 850000, # $8,500.00 in cents
"price_currency": "USD",
"estimated_duration_days": 14, # Total trip including recovery
"success_rate_source": "hospital_reported",
"volume_per_year": 450,
"is_active": True,
}
Demo Patient: Aisha Al-Rashid¶
The seed creates a demo patient for testing and demonstration purposes.
demo_patient = {
"full_name": encrypt_field("Aisha Al-Rashid"),
"email": encrypt_field("aisha.alrashid@example.com"),
"phone": encrypt_field("+971-50-123-4567"),
"date_of_birth": "1985-03-15",
"country_of_residence": "UAE",
"preferred_language": "en",
"timezone": "Asia/Dubai",
"tenant_id": "tenant-apollo-001",
}
Aisha's profile includes:
| Field | Value |
|---|---|
| Name | Aisha Al-Rashid |
| Age | 41 |
| Country | UAE |
| Language | English |
| Timezone | Asia/Dubai |
| Tenant | tenant-apollo-001 |
| Medical need | Total Knee Replacement |
4. Neo4j Graph Population: python -m app.seed_graph¶
Populates the Neo4j knowledge graph with nodes and relationships derived from the PostgreSQL data.
Node Types¶
| Node Label | Count | Source |
|---|---|---|
| Provider | 42 | providers table |
| Procedure | 12 | procedures table |
| Country | 8 | Derived from providers |
| Specialty | ~20 | Derived from provider specialties |
| Accreditation | ~10 | Derived from provider accreditations |
Relationship Types¶
| Relationship | From | To | Count |
|---|---|---|---|
OFFERS |
Provider | Procedure | 38 |
LOCATED_IN |
Provider | Country | 42 |
SPECIALIZES_IN |
Provider | Specialty | ~80 |
ACCREDITED_BY |
Provider | Accreditation | ~60 |
BELONGS_TO |
Procedure | Specialty | 12 |
Graph Creation Queries¶
// Create Provider nodes
MERGE (p:Provider {id: $provider_id})
SET p.name = $name,
p.city = $city,
p.tier = $tier,
p.bed_count = $bed_count,
p.year_established = $year_established
// Create Country nodes and relationships
MERGE (c:Country {name: $country_name, iso_code: $iso_code})
MERGE (p)-[:LOCATED_IN]->(c)
// Create Procedure nodes and OFFERS relationships
MERGE (proc:Procedure {id: $procedure_id})
SET proc.name = $name,
proc.category = $category,
proc.icd_10 = $icd_10
MERGE (p)-[r:OFFERS]->(proc)
SET r.price_amount = $price,
r.price_currency = $currency,
r.volume_per_year = $volume
// Create Specialty nodes
MERGE (s:Specialty {name: $specialty_name})
MERGE (p)-[:SPECIALIZES_IN]->(s)
Graph Visualization¶
┌──────────────┐
│ Country │
│ (India) │
└──────┬───────┘
│ LOCATED_IN
┌───────────┴───────────┐
│ │
┌─────┴──────┐ ┌──────┴─────┐
│ Provider │ │ Provider │
│ (Apollo) │ │ (Fortis) │
└──┬───┬──┬──┘ └──┬───┬──────┘
│ │ │ │ │
OFFERS ──────┘ │ └── SPEC_IN │ └── OFFERS
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ Procedure │ │ Specialty │
│ (TKR) │ │(Orthopedics) │
└──────────────┘ └──────────────┘
5. Qdrant Embeddings: python -m app.seed_embeddings¶
Generates vector embeddings for semantic search using the Voyage AI voyage-3.5-lite
model (1024 dimensions).
Embedding Collections¶
| Collection | Vectors | Source | Embedding Content |
|---|---|---|---|
| providers | 42 | Provider profiles | Name + city + country + specialties + description |
| conditions | 12 | Procedure templates | Name + category + description + requirements |
| requirement_embeddings | 70 | Travel requirements | Requirement text + country + procedure context |
Total: ~124 vectors
Embedding Generation¶
from voyageai import Client as VoyageClient
voyage = VoyageClient(api_key=VOYAGE_API_KEY)
async def generate_provider_embeddings():
"""Generate and upsert provider embeddings to Qdrant."""
providers = await db.fetch_all("SELECT * FROM providers")
for provider in providers:
# Compose embedding text from multiple fields
text = f"""
{provider['name']} is a {provider['tier']} medical facility
located in {provider['city']}, {provider['country']}.
Specialties: {', '.join(provider['specialties'])}.
Accreditations: {', '.join(provider['accreditations'])}.
{provider['description']}
"""
# Generate embedding via Voyage AI
embedding = voyage.embed(
texts=[text],
model="voyage-3.5-lite",
).embeddings[0] # 1024 dimensions
# Upsert to Qdrant
qdrant.upsert(
collection_name="providers",
points=[{
"id": str(provider["id"]),
"vector": embedding,
"payload": {
"name": provider["name"],
"country": provider["country"],
"city": provider["city"],
"tier": provider["tier"],
"specialties": provider["specialties"],
},
}],
)
Collection Configuration¶
from qdrant_client.models import Distance, VectorParams
# Create collections with cosine similarity
for collection_name in ["providers", "conditions", "requirement_embeddings"]:
qdrant.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(
size=1024, # voyage-3.5-lite dimension
distance=Distance.COSINE,
),
)
Requirement Embeddings¶
The 70 requirement vectors encode travel-specific knowledge:
| Category | Example Requirements | Count |
|---|---|---|
| Visa | "Medical visa requirements for India from USA" | 16 |
| Insurance | "International health insurance for cardiac surgery" | 10 |
| Pre-op | "Pre-operative requirements for knee replacement" | 12 |
| Travel | "Post-surgery flight restrictions for orthopedic patients" | 10 |
| Accommodation | "Recovery-friendly hotels near Bumrungrad Hospital" | 8 |
| Documents | "Medical records translation and apostille requirements" | 8 |
| Costs | "Payment methods and financing for medical travel" | 6 |
6. Doctor Seeding: python -m app.seed_doctors¶
Seeds 8 doctors with varying levels of profile completeness across 6 providers. This was introduced in Session 26 to support the doctor-level matching feature.
Doctor Profiles¶
| Doctor | Provider | Specialty | Status | Completeness |
|---|---|---|---|---|
| Dr. Rajesh Sharma | Apollo Chennai | Orthopedic Surgery | Verified | Full |
| Dr. Ayse Demir | Acibadem Istanbul | Cardiothoracic Surgery | Verified | Full |
| Dr. Somchai Patel | Bumrungrad Bangkok | Orthopedic Surgery | Complete | Full |
| Dr. Maria Garcia | Hospital Galenia Cancun | Bariatric Surgery | Complete | Full |
| Dr. Jin-Woo Park | Samsung Medical Center | Oncology | Complete | Standard |
| Dr. Carlos Mendez | CIMA San Jose | Fertility/IVF | Complete | Standard |
| Dr. Ahmed Hassan | Cleveland Clinic Abu Dhabi | Cardiology | Basic | Minimal |
| Dr. Elena Volkov | Quironsalud Barcelona | Ophthalmology | Basic | Minimal |
Completeness Levels¶
| Level | Fields Populated | Use Case |
|---|---|---|
| Verified (2) | All fields + board certifications + publications + verified badge | Full matching and display |
| Complete (4) | All clinical fields + credentials | Standard matching |
| Basic (2) | Name, specialty, provider affiliation | Placeholder for future enrichment |
Doctor Data Model¶
doctor = {
"name": "Dr. Rajesh Sharma",
"provider_id": apollo_chennai.id,
"specialty": "Orthopedic Surgery",
"sub_specialty": "Joint Replacement",
"years_experience": 22,
"education": [
{"degree": "MBBS", "institution": "AIIMS Delhi", "year": 1998},
{"degree": "MS Orthopedics", "institution": "AIIMS Delhi", "year": 2002},
{"degree": "Fellowship", "institution": "HSS New York", "year": 2004},
],
"board_certifications": ["National Board of Examinations (India)", "AO Trauma"],
"languages": ["en", "hi", "ta"],
"procedures_performed": 3500,
"publications_count": 45,
"verified": True,
"verification_date": "2025-11-15",
"profile_image_url": None, # To be added
"bio": "Dr. Sharma is one of India's leading joint replacement surgeons...",
"tenant_id": "tenant-apollo-001",
}
Neo4j Doctor Sync¶
The doctor seed script also syncs doctor nodes to Neo4j:
// Create Doctor node
MERGE (d:Doctor {id: $doctor_id})
SET d.name = $name,
d.specialty = $specialty,
d.sub_specialty = $sub_specialty,
d.years_experience = $years_experience,
d.verified = $verified
// Link to Provider
MATCH (p:Provider {id: $provider_id})
MERGE (d)-[:PRACTICES_AT]->(p)
// Link to Specialty
MERGE (s:Specialty {name: $specialty})
MERGE (d)-[:SPECIALIZES_IN]->(s)
// Link to Procedures
MATCH (proc:Procedure {name: $procedure_name})
MERGE (d)-[:PERFORMS]->(proc)
Graph After Doctor Seeding¶
┌────────────┐ PRACTICES_AT ┌──────────────┐
│ Doctor │────────────────────>│ Provider │
│ (Dr.Sharma) │ │ (Apollo) │
└──┬────┬────┘ └──────────────┘
│ │
│ └── SPECIALIZES_IN ──> [Orthopedics]
│
└── PERFORMS ──> [TKR] [THR]
Provider Language Services¶
6 providers have been updated with detailed language service data, enabling better matching for patients who need specific language support.
Language Service Types¶
| Service Type | Description | Example |
|---|---|---|
interpreter |
On-site medical interpreter | Arabic interpreter at Bumrungrad |
coordinator |
International patient coordinator | English coordinator at Acibadem |
document_translation |
Medical document translation | Japanese documents at Samsung MC |
Updated Providers¶
| Provider | Interpreters | Coordinators | Document Languages |
|---|---|---|---|
| Bumrungrad Bangkok | en, ar, ja, zh | en, ar, ja | en, ar, ja, zh, de |
| Apollo Chennai | en, hi, ar, bn | en, ar | en, hi, ar, ta |
| Acibadem Istanbul | en, ar, de, ru | en, ar, de | en, ar, de, ru, fr |
| Samsung Medical Center | en, ja, zh, ru | en, ja, zh | en, ja, zh, ko |
| Cleveland Clinic Abu Dhabi | en, ar, hi, ur | en, ar | en, ar, hi, fr |
| Hospital Galenia Cancun | en, es | en, es | en, es, fr |
Data Structure¶
language_services = {
"interpreter_languages": ["en", "ar", "ja", "zh"],
"coordinator_languages": ["en", "ar", "ja"],
"document_translation_languages": ["en", "ar", "ja", "zh", "de"],
}
# Stored as JSONB on the providers table
await db.execute(
"""
UPDATE providers SET language_services = :services
WHERE id = :provider_id
""",
{"services": json.dumps(language_services), "provider_id": provider.id},
)
Seed Verification¶
After running all seeds, verify the data is consistent across stores.
Verification Checklist¶
# PostgreSQL counts
python -c "
from app.db import get_db
db = get_db()
print('Providers:', db.execute('SELECT COUNT(*) FROM providers').scalar())
print('Procedures:', db.execute('SELECT COUNT(*) FROM procedures').scalar())
print('Links:', db.execute('SELECT COUNT(*) FROM provider_procedures').scalar())
print('Doctors:', db.execute('SELECT COUNT(*) FROM doctors').scalar())
print('Patients:', db.execute('SELECT COUNT(*) FROM patients').scalar())
"
# Neo4j counts
python -c "
from app.graph import get_driver
with get_driver().session() as s:
for label in ['Provider','Procedure','Doctor','Country','Specialty']:
count = s.run(f'MATCH (n:{label}) RETURN count(n)').single()[0]
print(f'{label}: {count}')
"
# Qdrant counts
python -c "
from app.vector import get_client
client = get_client()
for name in ['providers','conditions','requirement_embeddings']:
info = client.get_collection(name)
print(f'{name}: {info.points_count} vectors')
"
Expected Counts¶
| Store | Entity | Expected Count |
|---|---|---|
| PostgreSQL | providers | 42 |
| PostgreSQL | procedures | 12 |
| PostgreSQL | provider_procedures | 38 |
| PostgreSQL | doctors | 8 |
| PostgreSQL | patients | 1 (demo) |
| Neo4j | Provider nodes | 42 |
| Neo4j | Procedure nodes | 12 |
| Neo4j | Doctor nodes | 8 |
| Neo4j | Country nodes | 8 |
| Qdrant | providers | 42 vectors |
| Qdrant | conditions | 12 vectors |
| Qdrant | requirement_embeddings | 70 vectors |
Rebuilding Data¶
To completely rebuild all data (e.g., after schema changes):
# 1. Clear existing data (destructive!)
python -m app.clear_data # Truncates PostgreSQL tables
python -m app.clear_graph # Deletes all Neo4j nodes
python -m app.clear_embeddings # Drops Qdrant collections
# 2. Re-run full seed sequence
python -m app.seed
python -m app.seed_providers
python -m app.seed_demo
python -m app.seed_graph
python -m app.seed_embeddings
python -m app.seed_doctors
Destructive Operation
The clear scripts permanently delete all data. Only run these in development or staging environments. Production data should be migrated, not re-seeded.