Skip to content

Runbook — Matching Engine

Operational guide for the matching engine: how to flip strategies, roll out the registry-driven scorer, and recover from a regression.

Components

  • Postgres providers table — canonical source of provider data (ADR-0026).
  • Neo4j projection — graph view rebuilt from Postgres + reference YAML. Maintained via the doctor-graph sync (#770) and admin rebuild endpoint.
  • config/matching/parameters/<domain>.yaml — 147-parameter registry (Phase 1 / #767). Loaded by app/services/matching/registry.py.
  • app/services/matching_engine.py — strategy implementations (WeightedScoringV1, GraphEnhancedWeightedV1). Registry-driven scorer lands in PR-B.
  • app/services/match_service.py — orchestrates input gathering, strategy selection, and response shaping.

Feature flags

Flag Default Effect
matching_strategy weighted_v2_1 Selects which strategy runs for matching. Values: weighted_v1 (graph-enhanced), weighted_v1_legacy, ml_ranking_v2, hybrid_v3.
matching_engine_v2 false When true, the match service routes through the registry-driven scorer instead of matching_strategy. Phase 1 of #767.
matching_weights_v1 "" Optional JSON override of the legacy weight dict. Empty string = code defaults.
matching_max_providers "3" Cap on results returned per case.
matching_shadow_mode false Log v2 scores without surfacing them to patients.
agent_enhanced_matching false LLM rerank on top of weighted output.

Rolling out matching_engine_v2

The new scorer reads only status: active parameters from the registry and emits final_score, match_confidence, and domain_breakdown. Default is OFF so the legacy path remains canonical at merge time — this is the one-flip rollback contract for Phase 1.

Rollout sequence (per tenant)

  1. Pre-flight checklist
  2. Tenant has at least 5 providers with outcome_score, cost_index, accreditations, and languages_supported populated.
  3. Spot-check a known-good case: legacy strategy returns sensible ranking. Capture the top-3 provider IDs as the "before" baseline.
  4. Enable in Flagsmith (identity-override path)
  5. Flagsmith → Identities → search for the tenant's identity (typically tenant:<tenant-slug>).
  6. Add an override for matching_engine_v2true.
  7. Use Token <api-key> auth, NOT Api-Key — see scripts/sync_flagsmith.py for the canonical header shape.
  8. Verify
  9. Re-run the same case via POST /api/v1/cases/{id}/matches.
  10. Confirm response includes match_confidence and domain_breakdown keys (envelope shape change is the v2 signature).
  11. Confirm top-3 ranking has not regressed for fully-populated providers; sparse providers should rank LOWER than they did under the legacy default-fill behavior. This is the intended behavior change, not a bug.
  12. Monitor for 24h
  13. Watch for elevated MATCH_* error codes in Sentry / Telegram.
  14. Watch the match_confidence distribution in Metabase: most tenants should see median ~0.6-0.8 once parameters are seeded. If median sits at <0.3, parameter coverage is too thin and the v2 envelope will visibly underperform legacy on patient UX.

Rollback (one flip)

If matching quality regresses for sparse-catalog tenants, flip the flag back to false per tenant — or globally:

# Flagsmith CLI (or dashboard → Flags → matching_engine_v2)
flagsmith flag update matching_engine_v2 --enabled false

The legacy WeightedScoringV1 / GraphEnhancedWeightedV1 path resumes on the next request (no deploy needed). No data migration is required — the v2 scorer reads the same Postgres / Neo4j stores.

Phase 2 (PG → Neo4j projection worker) is when the legacy path is deleted; until then, the legacy path stays alive as a rollback.

Updating the registry

  1. Edit the relevant config/matching/parameters/<domain>.yaml.
  2. Run python scripts/generate_matching_parameters_reference.py to refresh the human-readable doc.
  3. Run pytest tests/test_matching_registry.py locally.
  4. PR + merge — CI re-validates the registry on every change.

A parameter graduates from seeded to active by:

  • Verifying ≥50% provider coverage for the underlying field (Metabase → matching coverage dashboard).
  • Adding a normalizer block to the parameter entry.
  • Bumping the active-count assertion in tests/test_matching_registry.py (currently pinned at 14).
  • Updating docs/architecture/matching-engine.md.

Common failures

Symptom Likely cause Fix
CI: "domain_weight_share sum != 1.0" YAML edit didn't rebalance other params in the file Adjust other entries in the same file so the sum returns to 1.0.
CI: "source_path references unknown attribute" Provider field renamed; registry stale Update source_path to the new ORM attribute name.
matching_engine_v2=true but envelope unchanged Backend cached old flag Wait FLAGSMITH_CACHE_TTL (default 60s) or restart the backend.
Sparse provider scores 0.0 instead of low-but-nonzero New scorer working as designed; legacy default-fill removed Expected. Inform the tenant; suggest provider data import.
  • ADR-0026 — Matching framework architecture
  • 765 — Parent epic

  • 767 — Phase 1 (this runbook)

  • 770 — Phase 0 doctor-graph sync