Runbook — Matching Engine¶
Operational guide for the matching engine: how to flip strategies, roll out the registry-driven scorer, and recover from a regression.
Components¶
- Postgres
providerstable — canonical source of provider data (ADR-0026). - Neo4j projection — graph view rebuilt from Postgres + reference YAML. Maintained via the doctor-graph sync (#770) and admin rebuild endpoint.
config/matching/parameters/<domain>.yaml— 147-parameter registry (Phase 1 / #767). Loaded byapp/services/matching/registry.py.app/services/matching_engine.py— strategy implementations (WeightedScoringV1,GraphEnhancedWeightedV1). Registry-driven scorer lands in PR-B.app/services/match_service.py— orchestrates input gathering, strategy selection, and response shaping.
Feature flags¶
| Flag | Default | Effect |
|---|---|---|
matching_strategy |
weighted_v2_1 |
Selects which strategy runs for matching. Values: weighted_v1 (graph-enhanced), weighted_v1_legacy, ml_ranking_v2, hybrid_v3. |
matching_engine_v2 |
false |
When true, the match service routes through the registry-driven scorer instead of matching_strategy. Phase 1 of #767. |
matching_weights_v1 |
"" |
Optional JSON override of the legacy weight dict. Empty string = code defaults. |
matching_max_providers |
"3" |
Cap on results returned per case. |
matching_shadow_mode |
false |
Log v2 scores without surfacing them to patients. |
agent_enhanced_matching |
false |
LLM rerank on top of weighted output. |
Rolling out matching_engine_v2¶
The new scorer reads only status: active parameters from the registry
and emits final_score, match_confidence, and domain_breakdown.
Default is OFF so the legacy path remains canonical at merge time —
this is the one-flip rollback contract for Phase 1.
Rollout sequence (per tenant)¶
- Pre-flight checklist
- Tenant has at least 5 providers with
outcome_score,cost_index,accreditations, andlanguages_supportedpopulated. - Spot-check a known-good case: legacy strategy returns sensible ranking. Capture the top-3 provider IDs as the "before" baseline.
- Enable in Flagsmith (identity-override path)
- Flagsmith → Identities → search for the tenant's identity
(typically
tenant:<tenant-slug>). - Add an override for
matching_engine_v2→true. - Use
Token <api-key>auth, NOTApi-Key— seescripts/sync_flagsmith.pyfor the canonical header shape. - Verify
- Re-run the same case via
POST /api/v1/cases/{id}/matches. - Confirm response includes
match_confidenceanddomain_breakdownkeys (envelope shape change is the v2 signature). - Confirm top-3 ranking has not regressed for fully-populated providers; sparse providers should rank LOWER than they did under the legacy default-fill behavior. This is the intended behavior change, not a bug.
- Monitor for 24h
- Watch for elevated
MATCH_*error codes in Sentry / Telegram. - Watch the
match_confidencedistribution in Metabase: most tenants should see median ~0.6-0.8 once parameters are seeded. If median sits at <0.3, parameter coverage is too thin and the v2 envelope will visibly underperform legacy on patient UX.
Rollback (one flip)¶
If matching quality regresses for sparse-catalog tenants, flip the
flag back to false per tenant — or globally:
# Flagsmith CLI (or dashboard → Flags → matching_engine_v2)
flagsmith flag update matching_engine_v2 --enabled false
The legacy WeightedScoringV1 / GraphEnhancedWeightedV1 path resumes
on the next request (no deploy needed). No data migration is required —
the v2 scorer reads the same Postgres / Neo4j stores.
Phase 2 (PG → Neo4j projection worker) is when the legacy path is deleted; until then, the legacy path stays alive as a rollback.
Updating the registry¶
- Edit the relevant
config/matching/parameters/<domain>.yaml. - Run
python scripts/generate_matching_parameters_reference.pyto refresh the human-readable doc. - Run
pytest tests/test_matching_registry.pylocally. - PR + merge — CI re-validates the registry on every change.
A parameter graduates from seeded to active by:
- Verifying ≥50% provider coverage for the underlying field (Metabase → matching coverage dashboard).
- Adding a
normalizerblock to the parameter entry. - Bumping the active-count assertion in
tests/test_matching_registry.py(currently pinned at 14). - Updating
docs/architecture/matching-engine.md.
Common failures¶
| Symptom | Likely cause | Fix |
|---|---|---|
| CI: "domain_weight_share sum != 1.0" | YAML edit didn't rebalance other params in the file | Adjust other entries in the same file so the sum returns to 1.0. |
| CI: "source_path references unknown attribute" | Provider field renamed; registry stale | Update source_path to the new ORM attribute name. |
matching_engine_v2=true but envelope unchanged |
Backend cached old flag | Wait FLAGSMITH_CACHE_TTL (default 60s) or restart the backend. |
| Sparse provider scores 0.0 instead of low-but-nonzero | New scorer working as designed; legacy default-fill removed | Expected. Inform the tenant; suggest provider data import. |