Runbook — Matching Engine¶

Operational guide for the matching engine: how to flip strategies, roll out the registry-driven scorer, and recover from a regression.

Components¶

Postgres providers table — canonical source of provider data (ADR-0026).
Neo4j projection — graph view rebuilt from Postgres + reference YAML. Maintained via the doctor-graph sync (#770) and admin rebuild endpoint.
config/matching/parameters/<domain>.yaml — 147-parameter registry (Phase 1 / #767). Loaded by app/services/matching/registry.py.
app/services/matching_engine.py — strategy implementations (WeightedScoringV1, GraphEnhancedWeightedV1). Registry-driven scorer lands in PR-B.
app/services/match_service.py — orchestrates input gathering, strategy selection, and response shaping.

Feature flags¶

Flag	Default	Effect
`matching_strategy`	`weighted_v2_1`	Selects which strategy runs for matching. Values: `weighted_v1` (graph-enhanced), `weighted_v1_legacy`, `ml_ranking_v2`, `hybrid_v3`.
`matching_engine_v2`	`false`	When true, the match service routes through the registry-driven scorer instead of `matching_strategy`. Phase 1 of #767.
`matching_weights_v1`	`""`	Optional JSON override of the legacy weight dict. Empty string = code defaults.
`matching_max_providers`	`"3"`	Cap on results returned per case.
`matching_shadow_mode`	`false`	Log v2 scores without surfacing them to patients.
`agent_enhanced_matching`	`false`	LLM rerank on top of weighted output.

Rolling out `matching_engine_v2`¶

The new scorer reads only status: active parameters from the registry and emits final_score, match_confidence, and domain_breakdown. Default is OFF so the legacy path remains canonical at merge time — this is the one-flip rollback contract for Phase 1.

Rollout sequence (per tenant)¶

Pre-flight checklist
Tenant has at least 5 providers with outcome_score, cost_index, accreditations, and languages_supported populated.
Spot-check a known-good case: legacy strategy returns sensible ranking. Capture the top-3 provider IDs as the "before" baseline.
Enable in Flagsmith (identity-override path)
Flagsmith → Identities → search for the tenant's identity (typically tenant:<tenant-slug>).
Add an override for matching_engine_v2 → true.
Use Token <api-key> auth, NOT Api-Key — see scripts/sync_flagsmith.py for the canonical header shape.
Verify
Re-run the same case via POST /api/v1/cases/{id}/matches.
Confirm response includes match_confidence and domain_breakdown keys (envelope shape change is the v2 signature).
Confirm top-3 ranking has not regressed for fully-populated providers; sparse providers should rank LOWER than they did under the legacy default-fill behavior. This is the intended behavior change, not a bug.
Monitor for 24h
Watch for elevated MATCH_* error codes in Sentry / Telegram.
Watch the match_confidence distribution in Metabase: most tenants should see median ~0.6-0.8 once parameters are seeded. If median sits at <0.3, parameter coverage is too thin and the v2 envelope will visibly underperform legacy on patient UX.

Rollback (one flip)¶

If matching quality regresses for sparse-catalog tenants, flip the flag back to false per tenant — or globally:

# Flagsmith CLI (or dashboard → Flags → matching_engine_v2)
flagsmith flag update matching_engine_v2 --enabled false

The legacy WeightedScoringV1 / GraphEnhancedWeightedV1 path resumes on the next request (no deploy needed). No data migration is required — the v2 scorer reads the same Postgres / Neo4j stores.

Phase 2 (PG → Neo4j projection worker) is when the legacy path is deleted; until then, the legacy path stays alive as a rollback.

Updating the registry¶

Edit the relevant config/matching/parameters/<domain>.yaml.
Run python scripts/generate_matching_parameters_reference.py to refresh the human-readable doc.
Run pytest tests/test_matching_registry.py locally.
PR + merge — CI re-validates the registry on every change.

A parameter graduates from seeded to active by:

Verifying ≥50% provider coverage for the underlying field (Metabase → matching coverage dashboard).
Adding a normalizer block to the parameter entry.
Bumping the active-count assertion in tests/test_matching_registry.py (currently pinned at 14).
Updating docs/architecture/matching-engine.md.

Common failures¶

Symptom	Likely cause	Fix
CI: "domain_weight_share sum != 1.0"	YAML edit didn't rebalance other params in the file	Adjust other entries in the same file so the sum returns to 1.0.
CI: "source_path references unknown attribute"	Provider field renamed; registry stale	Update `source_path` to the new ORM attribute name.
`matching_engine_v2=true` but envelope unchanged	Backend cached old flag	Wait `FLAGSMITH_CACHE_TTL` (default 60s) or restart the backend.
Sparse provider scores 0.0 instead of low-but-nonzero	New scorer working as designed; legacy default-fill removed	Expected. Inform the tenant; suggest provider data import.

ADR-0026 — Matching framework architecture
765 — Parent epic¶
767 — Phase 1 (this runbook)¶
770 — Phase 0 doctor-graph sync¶

Runbook — Matching Engine¶

Components¶

Feature flags¶

Rolling out matching_engine_v2¶

Rollout sequence (per tenant)¶

Rollback (one flip)¶

Updating the registry¶

Common failures¶

Related¶

765 — Parent epic¶

767 — Phase 1 (this runbook)¶

770 — Phase 0 doctor-graph sync¶

Rolling out `matching_engine_v2`¶