Provider Storefront Data Completeness Audit¶

Date: 2026-05-13 Issue: #709 — backlog: complete provider storefront data for 42 seeded providers Auditor: Claude Code (research-only, no data modified) Branch: chore/audit-provider-storefront-completeness

Methodology¶

All 42 providers were sourced from app/seed_providers.py (Phase 1 cutover, Session 91, 2026-05-04). Storefront enrichment was sourced from app/seed_storefront.py and app/seed_storefront_enrichment.py. The Provider model in app/models/provider.py was used as the canonical field reference. Per-field population counts were computed by inspecting the seed dictionaries (NULL / empty list / empty dict = missing). Completeness scores were recalculated using the scoring function defined in app/seed_storefront.py (compute_completeness). No live database was queried.

The ProviderListing frontend (apps/patient-app/src/pages/storefront/ProviderListing.tsx) was inspected to determine which fields affect renderability.

Field-by-Field Completeness Table¶

Total providers: 42

Field	Model Column	Populated	Missing	% Pop	Source
`name`	`String(255) NOT NULL`	42	0	100%	seed_providers.py
`slug`	`String(100) NOT NULL`	42	0	100%	seed_providers.py
`description`	`Text nullable`	42	0	100%	seed_providers.py
`website_url`	`Text nullable`	42	0	100%	seed_providers.py
`country_code`	`String(3) NOT NULL`	42	0	100%	seed_providers.py
`city`	`String(100) NOT NULL`	42	0	100%	seed_providers.py
`latitude`	`Float nullable`	42	0	100%	seed_providers.py
`longitude`	`Float nullable`	42	0	100%	seed_providers.py
`timezone`	`String(50) NOT NULL`	42	0	100%	seed_providers.py
`specialties`	`FlexibleJSON NOT NULL`	42	0	100%	seed_providers.py
`accreditations`	`FlexibleJSON nullable`	42	0	100%	seed_providers.py
`languages_supported`	`FlexibleJSON NOT NULL`	42	0	100%	seed_providers.py
`bed_count`	`Integer nullable`	42	0	100%	seed_providers.py
`annual_international_patients`	`Integer nullable`	42	0	100%	seed_providers.py
`cost_index`	`Float NOT NULL default=1.0`	42	0	100%	seed_providers.py
`procedure_costs`	`FlexibleJSON nullable`	42	0	100%	seed_providers.py
`outcome_score`	`Float nullable`	42	0	100%	seed_providers.py
`patient_satisfaction`	`Float nullable`	42	0	100%	seed_providers.py
`total_reviews`	`Integer default=0`	42	0	100%	seed_providers.py
`cultural_accommodations`	`FlexibleJSON nullable`	42	0	100%	seed_providers.py
`dietary_options`	`FlexibleJSON nullable`	42	0	100%	seed_providers.py
`logo_url`	`Text nullable`	0	42	0%	not seeded
`address`	`Text nullable`	0	42	0%	not seeded
`language_services`	`FlexibleJSON nullable`	0	42	0%	not seeded
`hero_image_url`	`Text nullable`	6	36	14%	seed_storefront_enrichment.py (6 demo)
`tagline`	`String(500) nullable`	6	36	14%	seed_storefront.py (6 demo)
`operating_theaters`	`Integer nullable`	6	36	14%	seed_storefront.py (6 demo)
`icu_beds`	`Integer nullable`	6	36	14%	seed_storefront.py (6 demo)
`countries_served`	`Integer nullable`	6	36	14%	seed_storefront.py (6 demo)
`cultural_support`	`FlexibleJSON nullable`	6	36	14%	seed_storefront.py (6 demo)
`travel_info`	`FlexibleJSON nullable`	6	36	14%	seed_storefront.py (6 demo)

Notes on scoring-weight fields: - completeness_score and completeness_tier are computed columns; not independent data gaps. - is_active and is_verified default correctly from seed; not storefront content gaps.

Top Gaps (Most Missing Data)¶

Ranked by impact on patient-facing storefront quality:

Gap 1: `logo_url` — 42/42 missing (0% populated)¶

No provider has a logo URL. The ProviderCard component in ProviderListing.tsx does not explicitly render a logo in the current listing view, but ProviderDetail pages (not audited here) almost certainly reference logo_url. A blank logo degrades brand trust and makes providers look unvalidated. Sourcing: scrape from official websites or request files directly from provider contacts.

Gap 2: `language_services` — 42/42 missing (0% populated)¶

The language_services field (added Session 26) is a structured dict expected to hold interpreter availability, phone line details, and on-site language coverage. It contributes 5% to completeness_score. None of the 42 providers have it populated. While the listing page does not render it directly, it feeds matching-engine preference scoring (preferences weight 10% in the scoring formula). Missing data degrades match quality for patients who filter by language needs. Sourcing: AI extraction from provider website "International Patients" pages.

Gap 3: `address` — 42/42 missing (0% populated)¶

Full street-level address is absent for all providers. latitude/longitude are present (100%), so map pins work, but postal address is needed for travel bookings, insurance submissions, and visa applications. Not currently weighted in completeness_score but referenced in travel coordination flows. Sourcing: Google Maps API lookup using existing lat/lon, or scrape from website.

Gap 4: Storefront cluster (`hero_image_url`, `tagline`, `operating_theaters`, `icu_beds`, `countries_served`, `cultural_support`, `travel_info`) — 36/42 missing (14% populated each)¶

Only 6 "demo" providers received full storefront enrichment: apollo-chennai, fortis-gurgaon, bumrungrad-bangkok, medicana-ankara, cleveland-clinic-abudhabi, quironsalud-barcelona. The remaining 36 providers have NULL in all 7 storefront-tier fields. The scoring function shows these 36 providers cap at a computed completeness_score of ~0.696 (enhanced tier ceiling without storefront data), while the 6 demo providers score ~0.912 (premium tier). This is a uniform 2-tier gap for the non-demo majority.

Gap 5: `procedure_costs` — granularity gap (all 42 populated, but thin)¶

All 42 providers have procedure_costs populated (100%), but the non-demo 36 have only 2–4 procedure keys with min/max ranges. The ProviderProcedure table with richer per-procedure data (annual volume, success rate, lead surgeon, package includes) only exists for the 6 demo providers via seed_storefront_enrichment.py. This is not a NULL gap but a data-depth gap affecting matching confidence.

Per-Provider Review: Renderability in ProviderListing.tsx¶

ProviderListing.tsx renders a ProviderCard for each provider. The card requires:

Mandatory for non-blank card: name, city, country_code, specialties — all 42 have these.
Conditionally rendered (absent = omitted, not crash): accreditations, description, tagline, bed_count, doctor_count, year_established.

Conclusion: None of the 42 providers is unrenderable. The listing page is defensively coded — missing fields produce omitted sections, not JS errors. However, 36 providers will render a degraded card:

Degradation Type	Affected Providers	Count
No hero image (ProviderDetail will be blank)	All except 6 demo	36
No tagline (falls back to `description`)	All except 6 demo	36
No OR/ICU stats in capacity section	All except 6 demo	36
No travel info (ProviderDetail travel tab empty)	All except 6 demo	36
No logo (affects ProviderDetail header)	All 42	42

Lowest-quality providers (combination of thin specialties, minimal reviews, and storefront gaps):

Provider	Country	total_reviews	specialties	Storefront
Prisma Dental	CRI	3,200	dental only	no tagline/hero
IVI Valencia	ESP	6,200	fertility only	no tagline/hero
JK Plastic Surgery Center	KOR	4,500	cosmetic, dental	no tagline/hero
BDMS Chiang Mai	THA	620	4	no tagline/hero
Thumbay University Hospital	ARE	420	4	no tagline/hero
HM Sanchinarro Madrid	ESP	380	3	no tagline/hero

The last two (Thumbay, HM Sanchinarro) have the lowest review counts and fewest specialties, making them the weakest in match ranking and least informative to patients.

Recommended Remediation Plan¶

Priority 1: Hero images + taglines for 36 non-demo providers (HIGH IMPACT, LOW EFFORT)¶

Method: AI extraction from provider websites (Claude Haiku / web scrape + prompt). - For each of the 36 providers, fetch website_url, extract a hero/banner image URL from HTML <meta og:image> or first hero <img>, and generate a 1-sentence tagline from the About/Overview page copy. - Existing pattern: scripts/scrape_provider_images.py already exists — extend it to also extract taglines. - Estimate: 2–3 hours engineering + $0.05 LLM cost per provider = ~$1.80 total. - Risk: Low. Unsplash placeholder images acceptable as fallback (already used for demo providers).

Priority 2: `language_services` for all 42 providers (MEDIUM IMPACT, MEDIUM EFFORT)¶

Method: AI extraction from provider "International Patients" / "Languages" pages. - Structure: {"phone_interpretation": bool, "on_site_languages": [...], "written_translation": bool, "patient_liaison": bool}. - Most providers publish this in their international patient departments. - Estimate: 4–6 hours engineering. $0.10 LLM per provider = ~$4.20 total.

Priority 3: `cultural_support` + `travel_info` for 36 non-demo providers (MEDIUM IMPACT, MEDIUM EFFORT)¶

Method: Expand seed_storefront.py PROVIDER_STOREFRONT dict to cover all 42 providers using the same schema as the 6 demo providers. Data can be AI-extracted from provider websites. - cultural_support fields: prayer_facilities, halal_kitchen, interpreter_languages, visa_assistance, airport_pickup. - travel_info fields: nearest airport, km distance, recommended stay days, nearby hotels. - Estimate: 1 day engineering. Most fields can be inferred from cultural_accommodations (already populated) and geography.

Priority 4: `logo_url` for all 42 providers (MEDIUM IMPACT, LOW EFFORT)¶

Method: Scrape from <link rel="icon"> or <meta og:image> from website_url. - Existing script: scripts/scrape_provider_images.py may already attempt this. - Fallback: Use Clearbit Logo API (free tier, domain-based lookup). - Estimate: 1–2 hours engineering.

Priority 5: `address` for all 42 providers (LOW IMPACT, LOW EFFORT)¶

Method: Google Maps Geocoding API reverse-lookup from existing latitude/longitude. - Cost: $5 per 1,000 requests (free tier up to 200 requests/month). - 42 providers = well within free tier. - Estimate: 1 hour engineering.

Priority 6: `operating_theaters`, `icu_beds`, `countries_served` for 36 providers (LOW IMPACT, MEDIUM EFFORT)¶

Method: Manual data entry from provider annual reports / JCI inspection reports / official websites. - These are not rendered in the listing page (only in ProviderDetail capacity section). - Estimate: 1–2 days manual research or AI web extraction per provider.

Effort Estimate Summary¶

Priority	Fields	Providers	Engineering	LLM Cost	Total Est.
P1	`hero_image_url`, `tagline`	36	2–3 h	~$2	1 day
P2	`language_services`	42	4–6 h	~$4	1 day
P3	`cultural_support`, `travel_info`	36	8 h	~$4	2 days
P4	`logo_url`	42	1–2 h	$0	0.5 day
P5	`address`	42	1 h	$0	0.25 day
P6	`operating_theaters`, `icu_beds`, `countries_served`	36	16 h	~$4	2–3 days
Total	9 fields	42 providers	~40 h	~$14	~7 days

Appendix: Demo vs Non-Demo Completeness Score Gap¶

Tier	Score Range	Count	Providers
`premium`	0.85–1.0	6	apollo-chennai, fortis-gurgaon, medicana-ankara, bumrungrad-bangkok, cleveland-clinic-abudhabi, quironsalud-barcelona
`enhanced`	0.65–0.85	36	All remaining 36 providers
`standard`	0.40–0.65	0	—
`basic`	<0.40	0	—

The 6 premium providers score ~0.912. The 36 enhanced providers uniformly score ~0.696. The gap (0.216) is entirely explained by the 7 missing storefront fields (tagline, hero_image_url, operating_theaters, icu_beds, countries_served, cultural_support, travel_info) plus language_services. Closing these gaps for the 36 non-demo providers would move them to premium tier and bring the platform to uniform data quality.

Scoring Weight Reference¶

From app/seed_storefront.py::compute_completeness:

Domain	Weight	Key Fields
Core identity	15%	name, description, logo_url, website_url
Location	10%	country_code, city, latitude, longitude
Clinical	15%	specialties, accreditations, languages_supported
Outcomes	15%	outcome_score, patient_satisfaction, total_reviews
Costs	10%	cost_index, procedure_costs
Capacity	10%	bed_count, annual_international_patients, operating_theaters, icu_beds
Storefront	10%	tagline, hero_image_url
Cultural & Travel	10%	cultural_support, travel_info, cultural_accommodations
Language services	5%	language_services

Research only — no seed data, models, or production DB modified. Source files inspected: app/models/provider.py, app/seed_providers.py, app/seed_storefront.py, app/seed_storefront_enrichment.py, apps/patient-app/src/pages/storefront/ProviderListing.tsx

Provider Storefront Data Completeness Audit¶

Methodology¶

Field-by-Field Completeness Table¶

Top Gaps (Most Missing Data)¶

Gap 1: logo_url — 42/42 missing (0% populated)¶

Gap 2: language_services — 42/42 missing (0% populated)¶

Gap 3: address — 42/42 missing (0% populated)¶

Gap 4: Storefront cluster (hero_image_url, tagline, operating_theaters, icu_beds, countries_served, cultural_support, travel_info) — 36/42 missing (14% populated each)¶

Gap 5: procedure_costs — granularity gap (all 42 populated, but thin)¶