Provider Storefront Data Completeness Audit¶
Date: 2026-05-13
Issue: #709 — backlog: complete provider storefront data for 42 seeded providers
Auditor: Claude Code (research-only, no data modified)
Branch: chore/audit-provider-storefront-completeness
Methodology¶
All 42 providers were sourced from app/seed_providers.py (Phase 1 cutover, Session 91, 2026-05-04). Storefront enrichment was sourced from app/seed_storefront.py and app/seed_storefront_enrichment.py. The Provider model in app/models/provider.py was used as the canonical field reference. Per-field population counts were computed by inspecting the seed dictionaries (NULL / empty list / empty dict = missing). Completeness scores were recalculated using the scoring function defined in app/seed_storefront.py (compute_completeness). No live database was queried.
The ProviderListing frontend (apps/patient-app/src/pages/storefront/ProviderListing.tsx) was inspected to determine which fields affect renderability.
Field-by-Field Completeness Table¶
Total providers: 42
| Field | Model Column | Populated | Missing | % Pop | Source |
|---|---|---|---|---|---|
name |
String(255) NOT NULL |
42 | 0 | 100% | seed_providers.py |
slug |
String(100) NOT NULL |
42 | 0 | 100% | seed_providers.py |
description |
Text nullable |
42 | 0 | 100% | seed_providers.py |
website_url |
Text nullable |
42 | 0 | 100% | seed_providers.py |
country_code |
String(3) NOT NULL |
42 | 0 | 100% | seed_providers.py |
city |
String(100) NOT NULL |
42 | 0 | 100% | seed_providers.py |
latitude |
Float nullable |
42 | 0 | 100% | seed_providers.py |
longitude |
Float nullable |
42 | 0 | 100% | seed_providers.py |
timezone |
String(50) NOT NULL |
42 | 0 | 100% | seed_providers.py |
specialties |
FlexibleJSON NOT NULL |
42 | 0 | 100% | seed_providers.py |
accreditations |
FlexibleJSON nullable |
42 | 0 | 100% | seed_providers.py |
languages_supported |
FlexibleJSON NOT NULL |
42 | 0 | 100% | seed_providers.py |
bed_count |
Integer nullable |
42 | 0 | 100% | seed_providers.py |
annual_international_patients |
Integer nullable |
42 | 0 | 100% | seed_providers.py |
cost_index |
Float NOT NULL default=1.0 |
42 | 0 | 100% | seed_providers.py |
procedure_costs |
FlexibleJSON nullable |
42 | 0 | 100% | seed_providers.py |
outcome_score |
Float nullable |
42 | 0 | 100% | seed_providers.py |
patient_satisfaction |
Float nullable |
42 | 0 | 100% | seed_providers.py |
total_reviews |
Integer default=0 |
42 | 0 | 100% | seed_providers.py |
cultural_accommodations |
FlexibleJSON nullable |
42 | 0 | 100% | seed_providers.py |
dietary_options |
FlexibleJSON nullable |
42 | 0 | 100% | seed_providers.py |
logo_url |
Text nullable |
0 | 42 | 0% | not seeded |
address |
Text nullable |
0 | 42 | 0% | not seeded |
language_services |
FlexibleJSON nullable |
0 | 42 | 0% | not seeded |
hero_image_url |
Text nullable |
6 | 36 | 14% | seed_storefront_enrichment.py (6 demo) |
tagline |
String(500) nullable |
6 | 36 | 14% | seed_storefront.py (6 demo) |
operating_theaters |
Integer nullable |
6 | 36 | 14% | seed_storefront.py (6 demo) |
icu_beds |
Integer nullable |
6 | 36 | 14% | seed_storefront.py (6 demo) |
countries_served |
Integer nullable |
6 | 36 | 14% | seed_storefront.py (6 demo) |
cultural_support |
FlexibleJSON nullable |
6 | 36 | 14% | seed_storefront.py (6 demo) |
travel_info |
FlexibleJSON nullable |
6 | 36 | 14% | seed_storefront.py (6 demo) |
Notes on scoring-weight fields:
- completeness_score and completeness_tier are computed columns; not independent data gaps.
- is_active and is_verified default correctly from seed; not storefront content gaps.
Top Gaps (Most Missing Data)¶
Ranked by impact on patient-facing storefront quality:
Gap 1: logo_url — 42/42 missing (0% populated)¶
No provider has a logo URL. The ProviderCard component in ProviderListing.tsx does not explicitly render a logo in the current listing view, but ProviderDetail pages (not audited here) almost certainly reference logo_url. A blank logo degrades brand trust and makes providers look unvalidated. Sourcing: scrape from official websites or request files directly from provider contacts.
Gap 2: language_services — 42/42 missing (0% populated)¶
The language_services field (added Session 26) is a structured dict expected to hold interpreter availability, phone line details, and on-site language coverage. It contributes 5% to completeness_score. None of the 42 providers have it populated. While the listing page does not render it directly, it feeds matching-engine preference scoring (preferences weight 10% in the scoring formula). Missing data degrades match quality for patients who filter by language needs. Sourcing: AI extraction from provider website "International Patients" pages.
Gap 3: address — 42/42 missing (0% populated)¶
Full street-level address is absent for all providers. latitude/longitude are present (100%), so map pins work, but postal address is needed for travel bookings, insurance submissions, and visa applications. Not currently weighted in completeness_score but referenced in travel coordination flows. Sourcing: Google Maps API lookup using existing lat/lon, or scrape from website.
Gap 4: Storefront cluster (hero_image_url, tagline, operating_theaters, icu_beds, countries_served, cultural_support, travel_info) — 36/42 missing (14% populated each)¶
Only 6 "demo" providers received full storefront enrichment: apollo-chennai, fortis-gurgaon, bumrungrad-bangkok, medicana-ankara, cleveland-clinic-abudhabi, quironsalud-barcelona. The remaining 36 providers have NULL in all 7 storefront-tier fields. The scoring function shows these 36 providers cap at a computed completeness_score of ~0.696 (enhanced tier ceiling without storefront data), while the 6 demo providers score ~0.912 (premium tier). This is a uniform 2-tier gap for the non-demo majority.
Gap 5: procedure_costs — granularity gap (all 42 populated, but thin)¶
All 42 providers have procedure_costs populated (100%), but the non-demo 36 have only 2–4 procedure keys with min/max ranges. The ProviderProcedure table with richer per-procedure data (annual volume, success rate, lead surgeon, package includes) only exists for the 6 demo providers via seed_storefront_enrichment.py. This is not a NULL gap but a data-depth gap affecting matching confidence.
Per-Provider Review: Renderability in ProviderListing.tsx¶
ProviderListing.tsx renders a ProviderCard for each provider. The card requires:
- Mandatory for non-blank card:
name,city,country_code,specialties— all 42 have these. - Conditionally rendered (absent = omitted, not crash):
accreditations,description,tagline,bed_count,doctor_count,year_established.
Conclusion: None of the 42 providers is unrenderable. The listing page is defensively coded — missing fields produce omitted sections, not JS errors. However, 36 providers will render a degraded card:
| Degradation Type | Affected Providers | Count |
|---|---|---|
| No hero image (ProviderDetail will be blank) | All except 6 demo | 36 |
No tagline (falls back to description) |
All except 6 demo | 36 |
| No OR/ICU stats in capacity section | All except 6 demo | 36 |
| No travel info (ProviderDetail travel tab empty) | All except 6 demo | 36 |
| No logo (affects ProviderDetail header) | All 42 | 42 |
Lowest-quality providers (combination of thin specialties, minimal reviews, and storefront gaps):
| Provider | Country | total_reviews | specialties | Storefront |
|---|---|---|---|---|
| Prisma Dental | CRI | 3,200 | dental only | no tagline/hero |
| IVI Valencia | ESP | 6,200 | fertility only | no tagline/hero |
| JK Plastic Surgery Center | KOR | 4,500 | cosmetic, dental | no tagline/hero |
| BDMS Chiang Mai | THA | 620 | 4 | no tagline/hero |
| Thumbay University Hospital | ARE | 420 | 4 | no tagline/hero |
| HM Sanchinarro Madrid | ESP | 380 | 3 | no tagline/hero |
The last two (Thumbay, HM Sanchinarro) have the lowest review counts and fewest specialties, making them the weakest in match ranking and least informative to patients.
Recommended Remediation Plan¶
Priority 1: Hero images + taglines for 36 non-demo providers (HIGH IMPACT, LOW EFFORT)¶
Method: AI extraction from provider websites (Claude Haiku / web scrape + prompt).
- For each of the 36 providers, fetch website_url, extract a hero/banner image URL from HTML <meta og:image> or first hero <img>, and generate a 1-sentence tagline from the About/Overview page copy.
- Existing pattern: scripts/scrape_provider_images.py already exists — extend it to also extract taglines.
- Estimate: 2–3 hours engineering + $0.05 LLM cost per provider = ~$1.80 total.
- Risk: Low. Unsplash placeholder images acceptable as fallback (already used for demo providers).
Priority 2: language_services for all 42 providers (MEDIUM IMPACT, MEDIUM EFFORT)¶
Method: AI extraction from provider "International Patients" / "Languages" pages.
- Structure: {"phone_interpretation": bool, "on_site_languages": [...], "written_translation": bool, "patient_liaison": bool}.
- Most providers publish this in their international patient departments.
- Estimate: 4–6 hours engineering. $0.10 LLM per provider = ~$4.20 total.
Priority 3: cultural_support + travel_info for 36 non-demo providers (MEDIUM IMPACT, MEDIUM EFFORT)¶
Method: Expand seed_storefront.py PROVIDER_STOREFRONT dict to cover all 42 providers using the same schema as the 6 demo providers. Data can be AI-extracted from provider websites.
- cultural_support fields: prayer_facilities, halal_kitchen, interpreter_languages, visa_assistance, airport_pickup.
- travel_info fields: nearest airport, km distance, recommended stay days, nearby hotels.
- Estimate: 1 day engineering. Most fields can be inferred from cultural_accommodations (already populated) and geography.
Priority 4: logo_url for all 42 providers (MEDIUM IMPACT, LOW EFFORT)¶
Method: Scrape from <link rel="icon"> or <meta og:image> from website_url.
- Existing script: scripts/scrape_provider_images.py may already attempt this.
- Fallback: Use Clearbit Logo API (free tier, domain-based lookup).
- Estimate: 1–2 hours engineering.
Priority 5: address for all 42 providers (LOW IMPACT, LOW EFFORT)¶
Method: Google Maps Geocoding API reverse-lookup from existing latitude/longitude.
- Cost: $5 per 1,000 requests (free tier up to 200 requests/month).
- 42 providers = well within free tier.
- Estimate: 1 hour engineering.
Priority 6: operating_theaters, icu_beds, countries_served for 36 providers (LOW IMPACT, MEDIUM EFFORT)¶
Method: Manual data entry from provider annual reports / JCI inspection reports / official websites. - These are not rendered in the listing page (only in ProviderDetail capacity section). - Estimate: 1–2 days manual research or AI web extraction per provider.
Effort Estimate Summary¶
| Priority | Fields | Providers | Engineering | LLM Cost | Total Est. |
|---|---|---|---|---|---|
| P1 | hero_image_url, tagline |
36 | 2–3 h | ~$2 | 1 day |
| P2 | language_services |
42 | 4–6 h | ~$4 | 1 day |
| P3 | cultural_support, travel_info |
36 | 8 h | ~$4 | 2 days |
| P4 | logo_url |
42 | 1–2 h | $0 | 0.5 day |
| P5 | address |
42 | 1 h | $0 | 0.25 day |
| P6 | operating_theaters, icu_beds, countries_served |
36 | 16 h | ~$4 | 2–3 days |
| Total | 9 fields | 42 providers | ~40 h | ~$14 | ~7 days |
Appendix: Demo vs Non-Demo Completeness Score Gap¶
| Tier | Score Range | Count | Providers |
|---|---|---|---|
premium |
0.85–1.0 | 6 | apollo-chennai, fortis-gurgaon, medicana-ankara, bumrungrad-bangkok, cleveland-clinic-abudhabi, quironsalud-barcelona |
enhanced |
0.65–0.85 | 36 | All remaining 36 providers |
standard |
0.40–0.65 | 0 | — |
basic |
<0.40 | 0 | — |
The 6 premium providers score ~0.912. The 36 enhanced providers uniformly score ~0.696. The gap (0.216) is entirely explained by the 7 missing storefront fields (tagline, hero_image_url, operating_theaters, icu_beds, countries_served, cultural_support, travel_info) plus language_services. Closing these gaps for the 36 non-demo providers would move them to premium tier and bring the platform to uniform data quality.
Scoring Weight Reference¶
From app/seed_storefront.py::compute_completeness:
| Domain | Weight | Key Fields |
|---|---|---|
| Core identity | 15% | name, description, logo_url, website_url |
| Location | 10% | country_code, city, latitude, longitude |
| Clinical | 15% | specialties, accreditations, languages_supported |
| Outcomes | 15% | outcome_score, patient_satisfaction, total_reviews |
| Costs | 10% | cost_index, procedure_costs |
| Capacity | 10% | bed_count, annual_international_patients, operating_theaters, icu_beds |
| Storefront | 10% | tagline, hero_image_url |
| Cultural & Travel | 10% | cultural_support, travel_info, cultural_accommodations |
| Language services | 5% | language_services |
Research only — no seed data, models, or production DB modified.
Source files inspected: app/models/provider.py, app/seed_providers.py, app/seed_storefront.py, app/seed_storefront_enrichment.py, apps/patient-app/src/pages/storefront/ProviderListing.tsx