Skip to content

Sequence — Auth + Tenant Resolution

Every authenticated request to the Curaway API passes through RBACMiddleware. This middleware verifies the Clerk JWT, resolves the tenant context, populates request.state with roles + permissions, and (for portal users) JIT-syncs Clerk org-role claims into the user_roles table.

This is the hottest path in the system — runs on every request — and the most critical for migration since GCP Cloud Run cold starts and IAM round-trips will land here first.

Audience: Engineering team + security review. Companion to ADR-0018 (multi-tenancy) and ADR-0021 (Clerk org-role mapping).


Sequence

sequenceDiagram
    autonumber
    actor User as User<br/>(patient or portal)
    participant FE as Frontend<br/>(Vercel)
    participant API as Curaway API
    participant MW as RBACMiddleware
    participant Redis as Upstash Redis<br/>(5-min TTL cache)
    participant PG as PostgreSQL
    participant Sync as clerk_role_sync
    participant FS as Flagsmith<br/>(runtime flags)
    participant Telegram as Telegram<br/>(alerts)

    User->>FE: action requiring auth
    FE->>API: GET /api/v1/cases<br/>Authorization: Bearer <Clerk JWT>

    API->>MW: dispatch(request)
    MW->>MW: skip if path is /health, /docs, /api/v1/public/*

    MW->>MW: _extract_user_id(JWT)<br/>(JWT verified via Clerk public key)
    MW->>MW: _extract_org_context(JWT)<br/>(org_id, org_role)

    alt portal user (org_id present)
        MW->>Redis: GET org_tenant:{org_id}
        alt cache hit
            Redis-->>MW: { tenant_id, portal_type }
        else cache miss
            MW->>PG: SELECT tenant_id, org_role<br/>FROM tenant_org_mappings<br/>WHERE clerk_org_id = ?
            PG-->>MW: row
            MW->>Redis: SET (TTL 300s)
        end

        opt header was also sent
            MW->>MW: assert X-Tenant-ID == org-resolved<br/>(spoofing guard)
        end

        opt JIT sync (clerk_role_auto_assign_enabled = on)
            MW->>FS: is_feature_enabled("clerk_role_auto_assign_enabled")
            FS-->>MW: true
            MW->>Sync: reconcile(user, tenant, jwt_role, org_id, portal_type)
            Sync->>Sync: lookup_key = portal_type + ":" + jwt_role<br/>(strip "org:" prefix)
            alt key in clerk_role_mapping.yaml
                Sync->>PG: SELECT existing user_roles<br/>where granted_by='system:clerk_jwt'
                alt drift (existing role != desired)
                    Sync->>PG: UPDATE old row → is_active=false
                    Sync->>PG: INSERT new auto row
                else no row
                    Sync->>PG: INSERT auto row
                else manual row exists
                    Note over Sync,PG: manual rows preserved (sticky)
                else null mapping (silent skip)
                    Note over Sync: e.g. patient personal org
                end
            else key not in mapping
                Sync--xTelegram: WARNING alert<br/>("Unknown Clerk org context")
            end
        end
    else patient (no org_id)
        MW->>MW: tenant_id = X-Tenant-ID header
    end

    MW->>PG: SELECT roles + permissions<br/>FROM user_roles JOIN roles<br/>WHERE user_id = ? AND tenant_id = ?
    PG-->>MW: roles + permissions
    MW->>MW: set request.state.{user_id, tenant_id, roles, permissions}

    MW->>API: call_next(request)
    Note over API: route handler runs<br/>@require_permission decorator<br/>checks request.state.permissions
    API-->>FE: response

Migration callouts

Concern Today GCP target Notes
JWT verify Clerk SDK (network call to fetch JWKS, then local verify) Unchanged Verify Cloud Run cold-start time on first JWKS fetch — could add 50-150ms latency tail to first request. Pre-warm or extend keep-alive.
Tenant cache Upstash Redis Memorystore 5-min TTL keeps the DB lookup off the hot path. Memorystore latency from Cloud Run is ~1ms intra-region.
Tenant DB lookup Cloud SQL Postgres tenant_org_mappings Unchanged Index on clerk_org_id already present.
In-memory fallback dict Hardcoded org IDs in middleware Remove post-migration The fallback is only reached when both Redis AND Cloud SQL are down; on GCP that's a region-wide outage where falling through is unsafe. Make this a hard-fail.
JIT sync trigger Flagsmith flag check Unchanged Flagsmith Cloud SaaS — FLAGSMITH_SERVER_KEY env var.
YAML mapping load config/clerk_role_mapping.yaml read at module import Unchanged Bundled with deploy artifact.
Telegram alert (unknown role) HTTP POST to Telegram bot API Unchanged Replace with Cloud Logging + Pub/Sub-driven Telegram dispatcher if Cloud Run egress to Telegram is restricted by org policy.
Permission lookup Cloud SQL JOIN on user_roles × roles Unchanged Two tables, both small (< 10K rows). Cache-friendly.

Critical migration risks

  1. JWKS warm-cache on cold start. Clerk JWT verification fetches the JWKS once per process. Cloud Run cold start = first request waits for JWKS fetch (~50-150ms over public internet). Mitigation: enable Cloud Run min-instances ≥ 1 on the API service; the cost (~$5-10/mo for a small instance) is worth it for p99 latency.

  2. Redis cache stampede on deploy. When all Cloud Run instances start fresh, every authenticated request from a portal user triggers a Redis miss → DB query in parallel. The current 5-min TTL means under a deploy storm, every org gets queried at least once. Mitigation: pre-warm the cache during health-check init.

  3. Unbounded JIT failure mode. If clerk_role_sync.reconcile() raises an unhandled exception, the middleware swallows it (broad except) and returns the user with empty permissions. The user then sees 403 everywhere. No alert fires for this case. Migration prep: replace the broad except with specific handlers for LookupError, OperationalError, IntegrityError; route everything else to a CRITICAL alert.

  4. Tenant ID spoofing via header. When both Clerk org and X-Tenant-ID header are present, the middleware logs a warning but uses the org-derived value. Verify on Cloud Run that headers can't be injected by an upstream proxy bypassing this guard (e.g. Cloud CDN, Cloud Armor). Document the trust boundary.

  5. tenant_org_mappings is the new SPOF. Pre-migration, the in-memory fallback dict masked DB outages. Post-migration (when we recommend removing the dict), Cloud SQL availability becomes the auth path's hard floor. Argument for Cloud SQL with regional replica + automatic failover.


Code references