Skip to content

Provider Storefront Data Completeness Audit

Date: 2026-05-13 Issue: #709 — backlog: complete provider storefront data for 42 seeded providers Auditor: Claude Code (research-only, no data modified) Branch: chore/audit-provider-storefront-completeness


Methodology

All 42 providers were sourced from app/seed_providers.py (Phase 1 cutover, Session 91, 2026-05-04). Storefront enrichment was sourced from app/seed_storefront.py and app/seed_storefront_enrichment.py. The Provider model in app/models/provider.py was used as the canonical field reference. Per-field population counts were computed by inspecting the seed dictionaries (NULL / empty list / empty dict = missing). Completeness scores were recalculated using the scoring function defined in app/seed_storefront.py (compute_completeness). No live database was queried.

The ProviderListing frontend (apps/patient-app/src/pages/storefront/ProviderListing.tsx) was inspected to determine which fields affect renderability.


Field-by-Field Completeness Table

Total providers: 42

Field Model Column Populated Missing % Pop Source
name String(255) NOT NULL 42 0 100% seed_providers.py
slug String(100) NOT NULL 42 0 100% seed_providers.py
description Text nullable 42 0 100% seed_providers.py
website_url Text nullable 42 0 100% seed_providers.py
country_code String(3) NOT NULL 42 0 100% seed_providers.py
city String(100) NOT NULL 42 0 100% seed_providers.py
latitude Float nullable 42 0 100% seed_providers.py
longitude Float nullable 42 0 100% seed_providers.py
timezone String(50) NOT NULL 42 0 100% seed_providers.py
specialties FlexibleJSON NOT NULL 42 0 100% seed_providers.py
accreditations FlexibleJSON nullable 42 0 100% seed_providers.py
languages_supported FlexibleJSON NOT NULL 42 0 100% seed_providers.py
bed_count Integer nullable 42 0 100% seed_providers.py
annual_international_patients Integer nullable 42 0 100% seed_providers.py
cost_index Float NOT NULL default=1.0 42 0 100% seed_providers.py
procedure_costs FlexibleJSON nullable 42 0 100% seed_providers.py
outcome_score Float nullable 42 0 100% seed_providers.py
patient_satisfaction Float nullable 42 0 100% seed_providers.py
total_reviews Integer default=0 42 0 100% seed_providers.py
cultural_accommodations FlexibleJSON nullable 42 0 100% seed_providers.py
dietary_options FlexibleJSON nullable 42 0 100% seed_providers.py
logo_url Text nullable 0 42 0% not seeded
address Text nullable 0 42 0% not seeded
language_services FlexibleJSON nullable 0 42 0% not seeded
hero_image_url Text nullable 6 36 14% seed_storefront_enrichment.py (6 demo)
tagline String(500) nullable 6 36 14% seed_storefront.py (6 demo)
operating_theaters Integer nullable 6 36 14% seed_storefront.py (6 demo)
icu_beds Integer nullable 6 36 14% seed_storefront.py (6 demo)
countries_served Integer nullable 6 36 14% seed_storefront.py (6 demo)
cultural_support FlexibleJSON nullable 6 36 14% seed_storefront.py (6 demo)
travel_info FlexibleJSON nullable 6 36 14% seed_storefront.py (6 demo)

Notes on scoring-weight fields: - completeness_score and completeness_tier are computed columns; not independent data gaps. - is_active and is_verified default correctly from seed; not storefront content gaps.


Top Gaps (Most Missing Data)

Ranked by impact on patient-facing storefront quality:

Gap 1: logo_url — 42/42 missing (0% populated)

No provider has a logo URL. The ProviderCard component in ProviderListing.tsx does not explicitly render a logo in the current listing view, but ProviderDetail pages (not audited here) almost certainly reference logo_url. A blank logo degrades brand trust and makes providers look unvalidated. Sourcing: scrape from official websites or request files directly from provider contacts.

Gap 2: language_services — 42/42 missing (0% populated)

The language_services field (added Session 26) is a structured dict expected to hold interpreter availability, phone line details, and on-site language coverage. It contributes 5% to completeness_score. None of the 42 providers have it populated. While the listing page does not render it directly, it feeds matching-engine preference scoring (preferences weight 10% in the scoring formula). Missing data degrades match quality for patients who filter by language needs. Sourcing: AI extraction from provider website "International Patients" pages.

Gap 3: address — 42/42 missing (0% populated)

Full street-level address is absent for all providers. latitude/longitude are present (100%), so map pins work, but postal address is needed for travel bookings, insurance submissions, and visa applications. Not currently weighted in completeness_score but referenced in travel coordination flows. Sourcing: Google Maps API lookup using existing lat/lon, or scrape from website.

Gap 4: Storefront cluster (hero_image_url, tagline, operating_theaters, icu_beds, countries_served, cultural_support, travel_info) — 36/42 missing (14% populated each)

Only 6 "demo" providers received full storefront enrichment: apollo-chennai, fortis-gurgaon, bumrungrad-bangkok, medicana-ankara, cleveland-clinic-abudhabi, quironsalud-barcelona. The remaining 36 providers have NULL in all 7 storefront-tier fields. The scoring function shows these 36 providers cap at a computed completeness_score of ~0.696 (enhanced tier ceiling without storefront data), while the 6 demo providers score ~0.912 (premium tier). This is a uniform 2-tier gap for the non-demo majority.

Gap 5: procedure_costs — granularity gap (all 42 populated, but thin)

All 42 providers have procedure_costs populated (100%), but the non-demo 36 have only 2–4 procedure keys with min/max ranges. The ProviderProcedure table with richer per-procedure data (annual volume, success rate, lead surgeon, package includes) only exists for the 6 demo providers via seed_storefront_enrichment.py. This is not a NULL gap but a data-depth gap affecting matching confidence.


Per-Provider Review: Renderability in ProviderListing.tsx

ProviderListing.tsx renders a ProviderCard for each provider. The card requires:

  • Mandatory for non-blank card: name, city, country_code, specialties — all 42 have these.
  • Conditionally rendered (absent = omitted, not crash): accreditations, description, tagline, bed_count, doctor_count, year_established.

Conclusion: None of the 42 providers is unrenderable. The listing page is defensively coded — missing fields produce omitted sections, not JS errors. However, 36 providers will render a degraded card:

Degradation Type Affected Providers Count
No hero image (ProviderDetail will be blank) All except 6 demo 36
No tagline (falls back to description) All except 6 demo 36
No OR/ICU stats in capacity section All except 6 demo 36
No travel info (ProviderDetail travel tab empty) All except 6 demo 36
No logo (affects ProviderDetail header) All 42 42

Lowest-quality providers (combination of thin specialties, minimal reviews, and storefront gaps):

Provider Country total_reviews specialties Storefront
Prisma Dental CRI 3,200 dental only no tagline/hero
IVI Valencia ESP 6,200 fertility only no tagline/hero
JK Plastic Surgery Center KOR 4,500 cosmetic, dental no tagline/hero
BDMS Chiang Mai THA 620 4 no tagline/hero
Thumbay University Hospital ARE 420 4 no tagline/hero
HM Sanchinarro Madrid ESP 380 3 no tagline/hero

The last two (Thumbay, HM Sanchinarro) have the lowest review counts and fewest specialties, making them the weakest in match ranking and least informative to patients.


Priority 1: Hero images + taglines for 36 non-demo providers (HIGH IMPACT, LOW EFFORT)

Method: AI extraction from provider websites (Claude Haiku / web scrape + prompt). - For each of the 36 providers, fetch website_url, extract a hero/banner image URL from HTML <meta og:image> or first hero <img>, and generate a 1-sentence tagline from the About/Overview page copy. - Existing pattern: scripts/scrape_provider_images.py already exists — extend it to also extract taglines. - Estimate: 2–3 hours engineering + $0.05 LLM cost per provider = ~$1.80 total. - Risk: Low. Unsplash placeholder images acceptable as fallback (already used for demo providers).

Priority 2: language_services for all 42 providers (MEDIUM IMPACT, MEDIUM EFFORT)

Method: AI extraction from provider "International Patients" / "Languages" pages. - Structure: {"phone_interpretation": bool, "on_site_languages": [...], "written_translation": bool, "patient_liaison": bool}. - Most providers publish this in their international patient departments. - Estimate: 4–6 hours engineering. $0.10 LLM per provider = ~$4.20 total.

Priority 3: cultural_support + travel_info for 36 non-demo providers (MEDIUM IMPACT, MEDIUM EFFORT)

Method: Expand seed_storefront.py PROVIDER_STOREFRONT dict to cover all 42 providers using the same schema as the 6 demo providers. Data can be AI-extracted from provider websites. - cultural_support fields: prayer_facilities, halal_kitchen, interpreter_languages, visa_assistance, airport_pickup. - travel_info fields: nearest airport, km distance, recommended stay days, nearby hotels. - Estimate: 1 day engineering. Most fields can be inferred from cultural_accommodations (already populated) and geography.

Priority 4: logo_url for all 42 providers (MEDIUM IMPACT, LOW EFFORT)

Method: Scrape from <link rel="icon"> or <meta og:image> from website_url. - Existing script: scripts/scrape_provider_images.py may already attempt this. - Fallback: Use Clearbit Logo API (free tier, domain-based lookup). - Estimate: 1–2 hours engineering.

Priority 5: address for all 42 providers (LOW IMPACT, LOW EFFORT)

Method: Google Maps Geocoding API reverse-lookup from existing latitude/longitude. - Cost: $5 per 1,000 requests (free tier up to 200 requests/month). - 42 providers = well within free tier. - Estimate: 1 hour engineering.

Priority 6: operating_theaters, icu_beds, countries_served for 36 providers (LOW IMPACT, MEDIUM EFFORT)

Method: Manual data entry from provider annual reports / JCI inspection reports / official websites. - These are not rendered in the listing page (only in ProviderDetail capacity section). - Estimate: 1–2 days manual research or AI web extraction per provider.


Effort Estimate Summary

Priority Fields Providers Engineering LLM Cost Total Est.
P1 hero_image_url, tagline 36 2–3 h ~$2 1 day
P2 language_services 42 4–6 h ~$4 1 day
P3 cultural_support, travel_info 36 8 h ~$4 2 days
P4 logo_url 42 1–2 h $0 0.5 day
P5 address 42 1 h $0 0.25 day
P6 operating_theaters, icu_beds, countries_served 36 16 h ~$4 2–3 days
Total 9 fields 42 providers ~40 h ~$14 ~7 days

Appendix: Demo vs Non-Demo Completeness Score Gap

Tier Score Range Count Providers
premium 0.85–1.0 6 apollo-chennai, fortis-gurgaon, medicana-ankara, bumrungrad-bangkok, cleveland-clinic-abudhabi, quironsalud-barcelona
enhanced 0.65–0.85 36 All remaining 36 providers
standard 0.40–0.65 0
basic <0.40 0

The 6 premium providers score ~0.912. The 36 enhanced providers uniformly score ~0.696. The gap (0.216) is entirely explained by the 7 missing storefront fields (tagline, hero_image_url, operating_theaters, icu_beds, countries_served, cultural_support, travel_info) plus language_services. Closing these gaps for the 36 non-demo providers would move them to premium tier and bring the platform to uniform data quality.


Scoring Weight Reference

From app/seed_storefront.py::compute_completeness:

Domain Weight Key Fields
Core identity 15% name, description, logo_url, website_url
Location 10% country_code, city, latitude, longitude
Clinical 15% specialties, accreditations, languages_supported
Outcomes 15% outcome_score, patient_satisfaction, total_reviews
Costs 10% cost_index, procedure_costs
Capacity 10% bed_count, annual_international_patients, operating_theaters, icu_beds
Storefront 10% tagline, hero_image_url
Cultural & Travel 10% cultural_support, travel_info, cultural_accommodations
Language services 5% language_services

Research only — no seed data, models, or production DB modified. Source files inspected: app/models/provider.py, app/seed_providers.py, app/seed_storefront.py, app/seed_storefront_enrichment.py, apps/patient-app/src/pages/storefront/ProviderListing.tsx