Skip to Content
InternalDocsOperationsPublic API Launch Runbook

Public API Launch Runbook

Source: docs/operations/public-api-launch-runbook.md

# Public API Launch Operations Runbook **Owner:** platform-oncall **Created:** 2026-02-24 **Last validated:** 2026-02-24 **Plan ID:** P11-G **Exit gate:** Launch incident/rollback/comms runbooks exist and have completed tabletop evidence --- ## A. Known Launch Blockers / Risks Items that must be resolved before the public launch gate can be approved. P11-H go/no-go packet links here directly. | # | Blocker | Current State | Owner | Target Date | Source | |---|---------|--------------|-------|-------------|--------| | 1 | **7-day canary clean run** | Window started 2026-02-24 (P11-F #426 merge). Planned completion 2026-03-03. | platform-oncall | 2026-03-03 | P11-F exit gate | | 2 | **P11-H approval record completion** | Go/no-go packet draft exists, but approvals, launch window, and rollback authority still require final signoff entries. | Dan | Before launch decision | P11-H exit gate | **Hard rule:** If blocker #1 (7-day canary clean run) is incomplete, P11-H **cannot** be marked approved for broad public launch. ### Recently Resolved Blocker (for audit traceability) - `integration_keys` FORCE RLS defense-in-depth issue is now resolved via constrained pre-auth policy and FORCE RLS re-enable. See `docs/BACKLOG.md` line 102 and PRs #431 + #433. ### Other Open Items (Non-Blocking) | Item | Severity | Notes | |------|----------|-------| | `logIntegrationAuthFailure()` pre-auth audit calls | LOW | Will fail silently under FORCE RLS (already try/catch). Failed auth attempts also logged via structured logger. | | Improve `verify-force-rls-staging.sh` failure output | LOW | Include exact failing table names in stderr/Slack. | | Unit tests for deep health check logic | LOW | DB check behavior, migration drift detection — covered E2E by staging health workflow. | --- ## B. Incident Triage Failure modes are ordered by likelihood during launch window. Each follows the symptom → diagnostics → root cause → resolution pattern from the P11-F canary runbook. ### B1. High Error Rate (5xx spike) **Symptom:** Error rate exceeds 5% of requests over a 5-minute window. Canary may report `FAIL_ENQUEUE` or `FAIL_AUTH`. Slack alerts fire from canary workflow. **Diagnostics:** 1. Check API health: `curl https://rgl8r-staging-api.onrender.com/health` 2. Check deep health (DB + migrations): `curl https://rgl8r-staging-api.onrender.com/health?deep=true` 3. Check Render dashboard for deploy status, crash loops, memory pressure 4. Check recent deploy history — was a new commit deployed in the last 30 minutes? **Root Cause Checklist:** - [ ] Bad deploy (new code introduced regression) - [ ] Database connectivity issue (connection pool exhaustion, Postgres restart) - [ ] External dependency failure (if applicable) - [ ] Memory/CPU exhaustion on Render instance - [ ] Expired or misconfigured environment variables **Resolution:** 1. If bad deploy → Render rollback (see [Lever 1](#lever-1-render-rollback)) 2. If DB connectivity → Check Render Postgres dashboard, verify `DATABASE_URL` 3. If resource exhaustion → Scale Render instance or restart service 4. If env var issue → Fix in Render dashboard, redeploy ### B2. Degraded Latency **Symptom:** API responses exceed SLO thresholds (auth >3s, enqueue >5s, job completion >90s). Canary reports SLO breach. Response times degraded but requests succeed. By default, SLO-only breaches fail the workflow but do not page. **Diagnostics:** 1. Check canary timing output: auth, enqueue, and completion latency values 2. Check Render metrics: CPU, memory, response time 3. Check concurrent job count: `GET /api/jobs?status=PROCESSING` (count of in-flight jobs) 4. Check if Render free-tier cold start is the cause (first request after idle period) **Root Cause Checklist:** - [ ] Render instance cold start (free tier spins down after inactivity) - [ ] High concurrent job load (job processor contention) - [ ] Database query performance degradation (missing indexes, table bloat) - [ ] Network latency between Render and Postgres **Resolution:** 1. If cold start → Expected on free tier; document as transient, close if single occurrence 2. If job contention → Tighten queue admission caps (see [Lever 6](#lever-6-queue-admission-cap)) 3. If DB performance → Check for long-running queries, consider VACUUM ANALYZE 4. If persistent → Consider Render plan upgrade for launch window ### B3. Tenant Isolation Breach **Symptom:** Data from one tenant visible to another tenant's API calls. This is a **severity BLOCK** incident. **Diagnostics:** 1. Immediately run RLS verification: `gh workflow run staging-force-rls-checks.yml` 2. Check if the affected endpoint uses `withTenant()` wrapper 3. Check recent deploys for changes to RLS-scoped routes 4. Verify `current_setting('app.current_tenant_id')` is set correctly in request context **Root Cause Checklist:** - [ ] Missing `withTenant()` wrapper on a new or modified route - [ ] Raw `prisma` query bypassing tenant GUC - [ ] RLS policy dropped or altered by a migration - [ ] FORCE RLS disabled on a table **Resolution:** 1. **Immediately** disable the affected endpoint if possible (return 503) 2. Run emergency RLS verification (see [Lever 4](#lever-4-emergency-rls-verification)) 3. Fix the root cause (add `withTenant()`, restore RLS policy) 4. Re-run verification workflow to confirm fix 5. **This is a mandatory post-incident review trigger** (see [Section G](#g-post-incident-review)) ### B4. Job Queue Stall **Symptom:** Jobs stuck in `PENDING` or `PROCESSING` state. No new jobs completing. Upload/enqueue requests succeed but results never arrive. **Diagnostics:** 1. Check job status distribution: `GET /api/jobs` (look for many PENDING/PROCESSING, no recent COMPLETED) 2. Check if job processor is running (Render service logs — look for `[job-processor]` entries) 3. Check for stuck transactions in Postgres 4. Check queue admission state: are caps being hit? **Root Cause Checklist:** - [ ] Job processor crashed or stopped polling - [ ] Database lock contention (long-running transaction blocking job claims) - [ ] Job processor failing silently (processing but erroring without marking FAILED) - [ ] Queue admission caps too tight (new jobs rejected) **Resolution:** 1. If processor stopped → Restart Render service 2. If lock contention → Identify and terminate blocking query 3. If silent failures → Check job processor logs, fix error handling 4. If admission caps → Adjust caps (see [Lever 6](#lever-6-queue-admission-cap)) ### B5. Auth Failures **Symptom:** Integration key exchange (`POST /api/auth/token/integration`) returning 401/403. Canary reports `FAIL_AUTH`. Legitimate clients cannot authenticate. **Diagnostics:** 1. Check API health endpoint (is the service up?) 2. Check if JWT signing keys are valid: verify `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` env vars 3. Check if the integration key is valid (not revoked/expired) 4. Check recent deploys for auth middleware changes **Root Cause Checklist:** - [ ] JWT keys expired or misconfigured after deploy - [ ] Auth middleware regression in new deploy - [ ] Integration key revoked or rotated without updating client - [ ] Clock skew causing JWT validation failures **Resolution:** 1. If key misconfiguration → Fix env vars in Render, redeploy 2. If auth regression → Render rollback (see [Lever 1](#lever-1-render-rollback)) 3. If key revoked → Issue new key, update client configuration 4. If clock skew → Restart service (NTP sync) ### B6. Quota Exhaustion **Symptom:** Legitimate clients receiving `429 RATE_LIMITED` responses. Canary may report `FAIL_ENQUEUE` with 429 status. **Diagnostics:** 1. Check which quota is being hit (per-key rate limit vs. queue admission cap) 2. Check `Retry-After` header value in 429 responses 3. Check if a single client is consuming disproportionate resources 4. Check abuse guard middleware logs for blocked keys **Root Cause Checklist:** - [ ] Client retry loop (agent not respecting `Retry-After`) - [ ] Quota limits too aggressive for actual traffic patterns - [ ] Burst of legitimate traffic exceeding defaults - [ ] Abusive client consuming shared capacity **Resolution:** 1. If abusive client → Revoke integration key (see [Lever 5](#lever-5-integration-key-revocation)) 2. If limits too tight → Adjust env vars and redeploy (see [Lever 6](#lever-6-queue-admission-cap)) 3. If legitimate burst → Temporarily increase limits, monitor 4. If retry loop → Contact client, point to API docs retry guidance --- ## C. Rollback Levers Each lever is documented as a verified run step with exact command/path, expected time-to-effect, blast radius, and last validated date. ### Lever 1: Render Rollback **Action:** Roll back to a previous known-good deploy via Render dashboard. **Exact UI path:** 1. Navigate to Render dashboard → select `rgl8r-staging-api` service 2. Click **Manual Deploy** in the top-right 3. Select the prior commit SHA from the dropdown (the last known-good deploy) 4. Click **Deploy** > Note: This is a UI-driven action. There is no CLI equivalent for Render deploys at this time. **Time-to-effect:** 2–5 minutes (Render build + deploy cycle) **Blast radius:** - Rolls back ALL application code to the selected commit - Does NOT roll back database migrations (Prisma migrations are forward-only) - Does NOT affect environment variables (those persist independently) - Affects all tenants on the Render service **Last validated:** 2026-02-24 --- ### Lever 2: Feature Flag Toggles **Action:** Disable specific feature subsystems via environment variables without a code rollback. **Exact commands (set in Render dashboard → Environment):** | Flag | Value to Disable | Effect | |------|-----------------|--------| | `SIMA_CRE_MODE` | `off` | Disables CRE attribute-based rule refinement; SIMA returns raw measure outcomes only | | `TRADE_DETECTOR_MODE` | `legacy` | Reverts TRADE detection to hardcoded legacy path | | `SHIP_CARRIER_CONFIG_SOURCE` | `hardcoded` | Reverts SHIP carrier rules to hardcoded defaults, bypasses DB-loaded rules | After changing any flag: trigger a manual deploy in Render (or the service auto-restarts on env var change, depending on Render config). **Time-to-effect:** 1–3 minutes (env var update + service restart) **Blast radius:** - Each flag is scoped to its subsystem only - Other modules continue operating normally - No data loss — flags control read-path behavior, not write-path **Last validated:** 2026-02-24 --- ### Lever 3: Disable Canary Workflow **Action:** Stop the canary from running (useful when canary infrastructure itself is broken, not the API). **Exact command:** ```bash gh workflow disable public-api-canary.yml ``` **Re-enable:** ```bash gh workflow enable public-api-canary.yml ``` **Time-to-effect:** Immediate (next scheduled run will not fire) **Blast radius:** - Stops all canary monitoring — no Slack alerts, no GitHub issue upserts - Does NOT affect the API itself — only the monitoring workflow stops - The 7-day clean window resets if disabled for significant time **Last validated:** 2026-02-24 --- ### Lever 4: Emergency RLS Verification **Action:** Run the staging FORCE RLS verification workflow to confirm tenant isolation is intact. **Exact command:** ```bash gh workflow run staging-force-rls-checks.yml ``` **Check results:** ```bash gh run list --workflow='staging-force-rls-checks.yml' --limit 5 ``` **Time-to-effect:** 1–2 minutes (workflow execution) **Blast radius:** - Read-only verification — does not modify any database state - Reports which tables have FORCE RLS enabled/disabled - Alerts to Slack if configured **Last validated:** 2026-02-24 --- ### Lever 5: Integration Key Revocation **Action:** Revoke a specific integration key to cut off a misbehaving or compromised client. **Exact command:** ```bash curl -X POST "https://rgl8r-staging-api.onrender.com/api/integration-keys/<KEY_ID>/revoke" \ -H "Authorization: Bearer <jwt>" \ -H "Content-Type: application/json" ``` > Note: This is a tenant-scoped route (`POST /api/integration-keys/:id/revoke`) that requires admin privileges (`admin:keys` scope). The JWT must belong to a Clerk-authenticated admin of the tenant that owns the key. **Time-to-effect:** Immediate (next auth attempt with the revoked key fails) **Blast radius:** - Only affects the specific integration key revoked - Other keys for the same tenant continue working - The revoked key cannot be un-revoked — a new key must be created **Last validated:** 2026-02-24 --- ### Lever 6: Queue Admission Cap Override **Action:** Tighten queue admission caps to throttle inbound job volume during incidents. **Exact commands (set in Render dashboard → Environment):** | Variable | Default | Emergency Value | Effect | |----------|---------|----------------|--------| | `ENQUEUE_MAX_INFLIGHT_TOTAL` | 80 | 10 | Max total in-flight jobs across all types | | `ENQUEUE_MAX_INFLIGHT_UPLOADS` | 30 | 5 | Max in-flight upload jobs | | `ENQUEUE_MAX_INFLIGHT_SIMA` | 20 | 3 | Max in-flight SIMA batch jobs | After changing: redeploy or wait for Render auto-restart. **Time-to-effect:** 1–3 minutes (env var update + service restart) **Blast radius:** - New enqueue requests exceeding caps receive `429 RATE_LIMITED` with `Retry-After` header - Already in-flight jobs continue processing normally - Does NOT affect read endpoints (results, evidence, jobs list) **Last validated:** 2026-02-24 --- ## D. Degraded-Mode Operation ### When to Activate Enter degraded mode when ANY of the following conditions persist for 10+ minutes: - Error rate >5% of total requests - Latency >3x SLO targets (auth >9s, enqueue >15s, completion >270s) - Job queue stall (zero completions in 10 minutes with pending jobs) ### What Degrades vs. Hard-Fails | Category | Behavior in Degraded Mode | |----------|--------------------------| | **Read endpoints** (results, evidence, jobs list, summary) | Stay up — these are stateless DB reads | | **Auth endpoints** (token exchange) | Stay up — lightweight, no queue dependency | | **Health endpoint** | Stay up — always responds (may report `degraded` status) | | **Write/enqueue endpoints** (upload, batch, catalog) | May throttle — queue admission caps tightened, 429s expected | | **Job processing** | May slow — processor continues but at reduced throughput | | **Canary workflow** | Continues running — failures expected, Slack alerts continue | ### Customer Experience During Degradation - **Reads:** Normal response times, full data access - **Writes:** Some requests may receive `429 RATE_LIMITED` with `Retry-After` header. Clients following the API contract (`Retry-After` backoff) will succeed on retry. - **Jobs:** Longer completion times. Polling clients see `PENDING`/`PROCESSING` for extended periods but jobs eventually complete. - **No data loss:** All accepted writes are durable. Throttled requests are rejected cleanly (not silently dropped). --- ## E. On-Call Ownership ### Launch Window (First 7 Days Post-Public) | Role | Owner | Contact | |------|-------|---------| | **Primary on-call** | Dan | Slack: `#rgl8r-ops` (primary), phone: `<PLACEHOLDER>` | | **Backup on-call** | `<PLACEHOLDER>` | Slack: `#rgl8r-ops`, phone: `<PLACEHOLDER>` | | **Engineering lead** | Dan | Slack: `#rgl8r-eng`, phone: `<PLACEHOLDER>` | ### Escalation Ladder Consistent with P11-F canary runbook's L1/L2/L3 pattern. | Level | Trigger | Action | Response Time | |-------|---------|--------|---------------| | **L1** | Any incident detected (canary failure, customer report, monitoring alert) | Post to `#rgl8r-ops` Slack channel. Begin triage using Section B. | < 15 minutes | | **L2** | 3+ consecutive canary failures OR incident unresolved after 30 minutes | Page primary on-call. If no response in 15 min, page backup. | < 15 minutes after L1 | | **L3** | Sustained outage (>2 hours) OR tenant isolation breach OR data integrity issue | Escalate to engineering lead. Begin customer comms (Section F). | Immediate after L2 | ### Contact Channels | Channel | Purpose | |---------|---------| | `#rgl8r-ops` (Slack) | Primary incident coordination. All L1 triage starts here. | | `#rgl8r-eng` (Slack) | Engineering discussion, root cause analysis | | Phone tree | L2/L3 escalation when Slack response time exceeds 15 minutes | | GitHub Issues | Persistent incident tracking (hard canary failures auto-create via workflow; SLO-only breaches auto-create only if `CANARY_PAGE_ON_SLO_BREACH=true`) | ### Handoff Procedure When transferring on-call responsibility between shifts: 1. **Outgoing** posts a handoff summary to `#rgl8r-ops`: - Current system status (healthy / degraded / incident) - Open incidents with links to GitHub issues - Pending actions or watch items - Any recent changes (deploys, config updates) 2. **Incoming** acknowledges the handoff in `#rgl8r-ops` 3. **Incoming** verifies access to Render dashboard, GitHub repo, Slack channels 4. Both confirm the handoff is complete --- ## F. External Comms Templates All templates use `<PLACEHOLDER>` syntax for fields that must be filled per-incident. ### Status Update Template Use for public-facing status updates during an incident. Post to status page and/or `#rgl8r-ops`. ``` **[<STATUS>]** <INCIDENT_TITLE> **Time:** <TIMESTAMP_UTC> **Affected services:** <SERVICE_LIST> **Impact:** <IMPACT_DESCRIPTION> **Update:** <DESCRIPTION_OF_CURRENT_STATE> **Next update:** <NEXT_UPDATE_TIME_UTC> or when status changes. ``` Where `<STATUS>` is one of: - **INVESTIGATING** — Aware of the issue, diagnosing root cause - **IDENTIFIED** — Root cause found, working on fix - **MONITORING** — Fix deployed, monitoring for recurrence - **RESOLVED** — Incident resolved, normal operations restored ### Customer Notification: Incident Start ``` Subject: [RGL8R] Service Degradation — <INCIDENT_TITLE> Hi <CUSTOMER_NAME>, We are currently experiencing <IMPACT_DESCRIPTION> affecting <SERVICE_LIST>. What you may notice: - <SYMPTOM_1> - <SYMPTOM_2> Our team is actively investigating. We will provide an update by <NEXT_UPDATE_TIME_UTC>. If you have questions, please reach out to <SUPPORT_CONTACT>. — RGL8R Platform Team ``` ### Customer Notification: Progress Update ``` Subject: [RGL8R] Update — <INCIDENT_TITLE> Hi <CUSTOMER_NAME>, Update on the <INCIDENT_TITLE> incident: **Status:** <STATUS> **Root cause:** <ROOT_CAUSE_SUMMARY> **Current action:** <WHAT_WE_ARE_DOING> **Expected resolution:** <ETA_OR_NEXT_STEPS> We will provide another update by <NEXT_UPDATE_TIME_UTC>. — RGL8R Platform Team ``` ### Customer Notification: Resolution ``` Subject: [RGL8R] Resolved — <INCIDENT_TITLE> Hi <CUSTOMER_NAME>, The <INCIDENT_TITLE> incident has been resolved. **Timeline:** - <START_TIME_UTC>: Issue detected - <IDENTIFIED_TIME_UTC>: Root cause identified - <RESOLVED_TIME_UTC>: Fix deployed and verified **Root cause:** <ROOT_CAUSE_SUMMARY> **Impact:** <IMPACT_SUMMARY> **Prevention:** <WHAT_WE_ARE_DOING_TO_PREVENT_RECURRENCE> We apologize for any inconvenience. If you have questions, please reach out to <SUPPORT_CONTACT>. — RGL8R Platform Team ``` --- ## G. Post-Incident Review ### Mandatory Review Triggers A post-incident review is **required** for: - Any tenant isolation breach (regardless of duration) - Any outage exceeding 1 hour - Any data integrity issue - Any incident requiring customer notification ### Incident Summary Template Complete within 48 hours of incident resolution. ```markdown # Post-Incident Review: <INCIDENT_TITLE> **Date:** <DATE> **Duration:** <START_TIME> — <END_TIME> (<DURATION>) **Severity:** <BLOCK / HIGH / MED> **Author:** <NAME> **Reviewers:** <NAMES> ## Timeline | Time (UTC) | Event | |------------|-------| | HH:MM | <EVENT_DESCRIPTION> | | HH:MM | <EVENT_DESCRIPTION> | | ... | ... | ## Impact - **Users affected:** <COUNT_OR_DESCRIPTION> - **Services affected:** <SERVICE_LIST> - **Data impact:** <NONE / DESCRIPTION> - **SLO impact:** <METRICS> ## Root Cause <DESCRIPTION_OF_ROOT_CAUSE> ## Resolution <WHAT_WAS_DONE_TO_FIX> ## Contributing Factors - <FACTOR_1> - <FACTOR_2> ## Follow-Up Actions | Action | Owner | Target Date | Status | |--------|-------|-------------|--------| | <ACTION> | <OWNER> | <DATE> | Open | | <ACTION> | <OWNER> | <DATE> | Open | ## Lessons Learned - **What went well:** <DESCRIPTION> - **What could be improved:** <DESCRIPTION> - **Where we got lucky:** <DESCRIPTION> ``` ### Blameless Postmortem Guidelines 1. **Focus on systems, not individuals.** The goal is to improve processes and tooling, not assign blame. 2. **Assume everyone acted with the best information available** at the time of the incident. 3. **Document contributing factors**, not "human error." If a person made a mistake, ask what about the system made that mistake possible or likely. 4. **Prioritize follow-up actions** by impact and feasibility. Every action gets an owner and a target date. 5. **Share the review** with the team. Transparency builds trust and ensures lessons are broadly applied. 6. **Track follow-up actions** in `docs/BACKLOG.md` or GitHub issues. Reviews without follow-through are theater. --- ## Related Documents - Canary runbook: `docs/operations/public-api-canary-runbook.md` - SLO baseline: `docs/operations/public-api-slo-baseline.md` - Staging health runbook: `docs/operations/staging-health-monitoring-runbook.md` - Staging RLS verification: `docs/operations/staging-force-rls-runbook.md` - Environment variables: `docs/ENVIRONMENT_VARIABLES.md` - Tabletop evidence: `docs/operations/p11-g-tabletop-evidence-2026-02-24.md` - Backlog: `docs/BACKLOG.md` - Execution plan: `docs/EXECUTION_PLAN.md`