Public API Launch Runbook
Source: docs/operations/public-api-launch-runbook.md
# Public API Launch Operations Runbook
**Owner:** platform-oncall
**Created:** 2026-02-24
**Last validated:** 2026-02-24
**Plan ID:** P11-G
**Exit gate:** Launch incident/rollback/comms runbooks exist and have completed tabletop evidence
---
## A. Known Launch Blockers / Risks
Items that must be resolved before the public launch gate can be approved. P11-H go/no-go packet links here directly.
| # | Blocker | Current State | Owner | Target Date | Source |
|---|---------|--------------|-------|-------------|--------|
| 1 | **7-day canary clean run** | Window started 2026-02-24 (P11-F #426 merge). Planned completion 2026-03-03. | platform-oncall | 2026-03-03 | P11-F exit gate |
| 2 | **P11-H approval record completion** | Go/no-go packet draft exists, but approvals, launch window, and rollback authority still require final signoff entries. | Dan | Before launch decision | P11-H exit gate |
**Hard rule:** If blocker #1 (7-day canary clean run) is incomplete, P11-H **cannot** be marked approved for broad public launch.
### Recently Resolved Blocker (for audit traceability)
- `integration_keys` FORCE RLS defense-in-depth issue is now resolved via constrained pre-auth policy and FORCE RLS re-enable. See `docs/BACKLOG.md` line 102 and PRs #431 + #433.
### Other Open Items (Non-Blocking)
| Item | Severity | Notes |
|------|----------|-------|
| `logIntegrationAuthFailure()` pre-auth audit calls | LOW | Will fail silently under FORCE RLS (already try/catch). Failed auth attempts also logged via structured logger. |
| Improve `verify-force-rls-staging.sh` failure output | LOW | Include exact failing table names in stderr/Slack. |
| Unit tests for deep health check logic | LOW | DB check behavior, migration drift detection — covered E2E by staging health workflow. |
---
## B. Incident Triage
Failure modes are ordered by likelihood during launch window. Each follows the symptom → diagnostics → root cause → resolution pattern from the P11-F canary runbook.
### B1. High Error Rate (5xx spike)
**Symptom:** Error rate exceeds 5% of requests over a 5-minute window. Canary may report `FAIL_ENQUEUE` or `FAIL_AUTH`. Slack alerts fire from canary workflow.
**Diagnostics:**
1. Check API health: `curl https://rgl8r-staging-api.onrender.com/health`
2. Check deep health (DB + migrations): `curl https://rgl8r-staging-api.onrender.com/health?deep=true`
3. Check Render dashboard for deploy status, crash loops, memory pressure
4. Check recent deploy history — was a new commit deployed in the last 30 minutes?
**Root Cause Checklist:**
- [ ] Bad deploy (new code introduced regression)
- [ ] Database connectivity issue (connection pool exhaustion, Postgres restart)
- [ ] External dependency failure (if applicable)
- [ ] Memory/CPU exhaustion on Render instance
- [ ] Expired or misconfigured environment variables
**Resolution:**
1. If bad deploy → Render rollback (see [Lever 1](#lever-1-render-rollback))
2. If DB connectivity → Check Render Postgres dashboard, verify `DATABASE_URL`
3. If resource exhaustion → Scale Render instance or restart service
4. If env var issue → Fix in Render dashboard, redeploy
### B2. Degraded Latency
**Symptom:** API responses exceed SLO thresholds (auth >3s, enqueue >5s, job completion >90s). Canary reports SLO breach. Response times degraded but requests succeed. By default, SLO-only breaches fail the workflow but do not page.
**Diagnostics:**
1. Check canary timing output: auth, enqueue, and completion latency values
2. Check Render metrics: CPU, memory, response time
3. Check concurrent job count: `GET /api/jobs?status=PROCESSING` (count of in-flight jobs)
4. Check if Render free-tier cold start is the cause (first request after idle period)
**Root Cause Checklist:**
- [ ] Render instance cold start (free tier spins down after inactivity)
- [ ] High concurrent job load (job processor contention)
- [ ] Database query performance degradation (missing indexes, table bloat)
- [ ] Network latency between Render and Postgres
**Resolution:**
1. If cold start → Expected on free tier; document as transient, close if single occurrence
2. If job contention → Tighten queue admission caps (see [Lever 6](#lever-6-queue-admission-cap))
3. If DB performance → Check for long-running queries, consider VACUUM ANALYZE
4. If persistent → Consider Render plan upgrade for launch window
### B3. Tenant Isolation Breach
**Symptom:** Data from one tenant visible to another tenant's API calls. This is a **severity BLOCK** incident.
**Diagnostics:**
1. Immediately run RLS verification: `gh workflow run staging-force-rls-checks.yml`
2. Check if the affected endpoint uses `withTenant()` wrapper
3. Check recent deploys for changes to RLS-scoped routes
4. Verify `current_setting('app.current_tenant_id')` is set correctly in request context
**Root Cause Checklist:**
- [ ] Missing `withTenant()` wrapper on a new or modified route
- [ ] Raw `prisma` query bypassing tenant GUC
- [ ] RLS policy dropped or altered by a migration
- [ ] FORCE RLS disabled on a table
**Resolution:**
1. **Immediately** disable the affected endpoint if possible (return 503)
2. Run emergency RLS verification (see [Lever 4](#lever-4-emergency-rls-verification))
3. Fix the root cause (add `withTenant()`, restore RLS policy)
4. Re-run verification workflow to confirm fix
5. **This is a mandatory post-incident review trigger** (see [Section G](#g-post-incident-review))
### B4. Job Queue Stall
**Symptom:** Jobs stuck in `PENDING` or `PROCESSING` state. No new jobs completing. Upload/enqueue requests succeed but results never arrive.
**Diagnostics:**
1. Check job status distribution: `GET /api/jobs` (look for many PENDING/PROCESSING, no recent COMPLETED)
2. Check if job processor is running (Render service logs — look for `[job-processor]` entries)
3. Check for stuck transactions in Postgres
4. Check queue admission state: are caps being hit?
**Root Cause Checklist:**
- [ ] Job processor crashed or stopped polling
- [ ] Database lock contention (long-running transaction blocking job claims)
- [ ] Job processor failing silently (processing but erroring without marking FAILED)
- [ ] Queue admission caps too tight (new jobs rejected)
**Resolution:**
1. If processor stopped → Restart Render service
2. If lock contention → Identify and terminate blocking query
3. If silent failures → Check job processor logs, fix error handling
4. If admission caps → Adjust caps (see [Lever 6](#lever-6-queue-admission-cap))
### B5. Auth Failures
**Symptom:** Integration key exchange (`POST /api/auth/token/integration`) returning 401/403. Canary reports `FAIL_AUTH`. Legitimate clients cannot authenticate.
**Diagnostics:**
1. Check API health endpoint (is the service up?)
2. Check if JWT signing keys are valid: verify `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` env vars
3. Check if the integration key is valid (not revoked/expired)
4. Check recent deploys for auth middleware changes
**Root Cause Checklist:**
- [ ] JWT keys expired or misconfigured after deploy
- [ ] Auth middleware regression in new deploy
- [ ] Integration key revoked or rotated without updating client
- [ ] Clock skew causing JWT validation failures
**Resolution:**
1. If key misconfiguration → Fix env vars in Render, redeploy
2. If auth regression → Render rollback (see [Lever 1](#lever-1-render-rollback))
3. If key revoked → Issue new key, update client configuration
4. If clock skew → Restart service (NTP sync)
### B6. Quota Exhaustion
**Symptom:** Legitimate clients receiving `429 RATE_LIMITED` responses. Canary may report `FAIL_ENQUEUE` with 429 status.
**Diagnostics:**
1. Check which quota is being hit (per-key rate limit vs. queue admission cap)
2. Check `Retry-After` header value in 429 responses
3. Check if a single client is consuming disproportionate resources
4. Check abuse guard middleware logs for blocked keys
**Root Cause Checklist:**
- [ ] Client retry loop (agent not respecting `Retry-After`)
- [ ] Quota limits too aggressive for actual traffic patterns
- [ ] Burst of legitimate traffic exceeding defaults
- [ ] Abusive client consuming shared capacity
**Resolution:**
1. If abusive client → Revoke integration key (see [Lever 5](#lever-5-integration-key-revocation))
2. If limits too tight → Adjust env vars and redeploy (see [Lever 6](#lever-6-queue-admission-cap))
3. If legitimate burst → Temporarily increase limits, monitor
4. If retry loop → Contact client, point to API docs retry guidance
---
## C. Rollback Levers
Each lever is documented as a verified run step with exact command/path, expected time-to-effect, blast radius, and last validated date.
### Lever 1: Render Rollback
**Action:** Roll back to a previous known-good deploy via Render dashboard.
**Exact UI path:**
1. Navigate to Render dashboard → select `rgl8r-staging-api` service
2. Click **Manual Deploy** in the top-right
3. Select the prior commit SHA from the dropdown (the last known-good deploy)
4. Click **Deploy**
> Note: This is a UI-driven action. There is no CLI equivalent for Render deploys at this time.
**Time-to-effect:** 2–5 minutes (Render build + deploy cycle)
**Blast radius:**
- Rolls back ALL application code to the selected commit
- Does NOT roll back database migrations (Prisma migrations are forward-only)
- Does NOT affect environment variables (those persist independently)
- Affects all tenants on the Render service
**Last validated:** 2026-02-24
---
### Lever 2: Feature Flag Toggles
**Action:** Disable specific feature subsystems via environment variables without a code rollback.
**Exact commands (set in Render dashboard → Environment):**
| Flag | Value to Disable | Effect |
|------|-----------------|--------|
| `SIMA_CRE_MODE` | `off` | Disables CRE attribute-based rule refinement; SIMA returns raw measure outcomes only |
| `TRADE_DETECTOR_MODE` | `legacy` | Reverts TRADE detection to hardcoded legacy path |
| `SHIP_CARRIER_CONFIG_SOURCE` | `hardcoded` | Reverts SHIP carrier rules to hardcoded defaults, bypasses DB-loaded rules |
After changing any flag: trigger a manual deploy in Render (or the service auto-restarts on env var change, depending on Render config).
**Time-to-effect:** 1–3 minutes (env var update + service restart)
**Blast radius:**
- Each flag is scoped to its subsystem only
- Other modules continue operating normally
- No data loss — flags control read-path behavior, not write-path
**Last validated:** 2026-02-24
---
### Lever 3: Disable Canary Workflow
**Action:** Stop the canary from running (useful when canary infrastructure itself is broken, not the API).
**Exact command:**
```bash
gh workflow disable public-api-canary.yml
```
**Re-enable:**
```bash
gh workflow enable public-api-canary.yml
```
**Time-to-effect:** Immediate (next scheduled run will not fire)
**Blast radius:**
- Stops all canary monitoring — no Slack alerts, no GitHub issue upserts
- Does NOT affect the API itself — only the monitoring workflow stops
- The 7-day clean window resets if disabled for significant time
**Last validated:** 2026-02-24
---
### Lever 4: Emergency RLS Verification
**Action:** Run the staging FORCE RLS verification workflow to confirm tenant isolation is intact.
**Exact command:**
```bash
gh workflow run staging-force-rls-checks.yml
```
**Check results:**
```bash
gh run list --workflow='staging-force-rls-checks.yml' --limit 5
```
**Time-to-effect:** 1–2 minutes (workflow execution)
**Blast radius:**
- Read-only verification — does not modify any database state
- Reports which tables have FORCE RLS enabled/disabled
- Alerts to Slack if configured
**Last validated:** 2026-02-24
---
### Lever 5: Integration Key Revocation
**Action:** Revoke a specific integration key to cut off a misbehaving or compromised client.
**Exact command:**
```bash
curl -X POST "https://rgl8r-staging-api.onrender.com/api/integration-keys/<KEY_ID>/revoke" \
-H "Authorization: Bearer <jwt>" \
-H "Content-Type: application/json"
```
> Note: This is a tenant-scoped route (`POST /api/integration-keys/:id/revoke`) that requires admin privileges (`admin:keys` scope). The JWT must belong to a Clerk-authenticated admin of the tenant that owns the key.
**Time-to-effect:** Immediate (next auth attempt with the revoked key fails)
**Blast radius:**
- Only affects the specific integration key revoked
- Other keys for the same tenant continue working
- The revoked key cannot be un-revoked — a new key must be created
**Last validated:** 2026-02-24
---
### Lever 6: Queue Admission Cap Override
**Action:** Tighten queue admission caps to throttle inbound job volume during incidents.
**Exact commands (set in Render dashboard → Environment):**
| Variable | Default | Emergency Value | Effect |
|----------|---------|----------------|--------|
| `ENQUEUE_MAX_INFLIGHT_TOTAL` | 80 | 10 | Max total in-flight jobs across all types |
| `ENQUEUE_MAX_INFLIGHT_UPLOADS` | 30 | 5 | Max in-flight upload jobs |
| `ENQUEUE_MAX_INFLIGHT_SIMA` | 20 | 3 | Max in-flight SIMA batch jobs |
After changing: redeploy or wait for Render auto-restart.
**Time-to-effect:** 1–3 minutes (env var update + service restart)
**Blast radius:**
- New enqueue requests exceeding caps receive `429 RATE_LIMITED` with `Retry-After` header
- Already in-flight jobs continue processing normally
- Does NOT affect read endpoints (results, evidence, jobs list)
**Last validated:** 2026-02-24
---
## D. Degraded-Mode Operation
### When to Activate
Enter degraded mode when ANY of the following conditions persist for 10+ minutes:
- Error rate >5% of total requests
- Latency >3x SLO targets (auth >9s, enqueue >15s, completion >270s)
- Job queue stall (zero completions in 10 minutes with pending jobs)
### What Degrades vs. Hard-Fails
| Category | Behavior in Degraded Mode |
|----------|--------------------------|
| **Read endpoints** (results, evidence, jobs list, summary) | Stay up — these are stateless DB reads |
| **Auth endpoints** (token exchange) | Stay up — lightweight, no queue dependency |
| **Health endpoint** | Stay up — always responds (may report `degraded` status) |
| **Write/enqueue endpoints** (upload, batch, catalog) | May throttle — queue admission caps tightened, 429s expected |
| **Job processing** | May slow — processor continues but at reduced throughput |
| **Canary workflow** | Continues running — failures expected, Slack alerts continue |
### Customer Experience During Degradation
- **Reads:** Normal response times, full data access
- **Writes:** Some requests may receive `429 RATE_LIMITED` with `Retry-After` header. Clients following the API contract (`Retry-After` backoff) will succeed on retry.
- **Jobs:** Longer completion times. Polling clients see `PENDING`/`PROCESSING` for extended periods but jobs eventually complete.
- **No data loss:** All accepted writes are durable. Throttled requests are rejected cleanly (not silently dropped).
---
## E. On-Call Ownership
### Launch Window (First 7 Days Post-Public)
| Role | Owner | Contact |
|------|-------|---------|
| **Primary on-call** | Dan | Slack: `#rgl8r-ops` (primary), phone: `<PLACEHOLDER>` |
| **Backup on-call** | `<PLACEHOLDER>` | Slack: `#rgl8r-ops`, phone: `<PLACEHOLDER>` |
| **Engineering lead** | Dan | Slack: `#rgl8r-eng`, phone: `<PLACEHOLDER>` |
### Escalation Ladder
Consistent with P11-F canary runbook's L1/L2/L3 pattern.
| Level | Trigger | Action | Response Time |
|-------|---------|--------|---------------|
| **L1** | Any incident detected (canary failure, customer report, monitoring alert) | Post to `#rgl8r-ops` Slack channel. Begin triage using Section B. | < 15 minutes |
| **L2** | 3+ consecutive canary failures OR incident unresolved after 30 minutes | Page primary on-call. If no response in 15 min, page backup. | < 15 minutes after L1 |
| **L3** | Sustained outage (>2 hours) OR tenant isolation breach OR data integrity issue | Escalate to engineering lead. Begin customer comms (Section F). | Immediate after L2 |
### Contact Channels
| Channel | Purpose |
|---------|---------|
| `#rgl8r-ops` (Slack) | Primary incident coordination. All L1 triage starts here. |
| `#rgl8r-eng` (Slack) | Engineering discussion, root cause analysis |
| Phone tree | L2/L3 escalation when Slack response time exceeds 15 minutes |
| GitHub Issues | Persistent incident tracking (hard canary failures auto-create via workflow; SLO-only breaches auto-create only if `CANARY_PAGE_ON_SLO_BREACH=true`) |
### Handoff Procedure
When transferring on-call responsibility between shifts:
1. **Outgoing** posts a handoff summary to `#rgl8r-ops`:
- Current system status (healthy / degraded / incident)
- Open incidents with links to GitHub issues
- Pending actions or watch items
- Any recent changes (deploys, config updates)
2. **Incoming** acknowledges the handoff in `#rgl8r-ops`
3. **Incoming** verifies access to Render dashboard, GitHub repo, Slack channels
4. Both confirm the handoff is complete
---
## F. External Comms Templates
All templates use `<PLACEHOLDER>` syntax for fields that must be filled per-incident.
### Status Update Template
Use for public-facing status updates during an incident. Post to status page and/or `#rgl8r-ops`.
```
**[<STATUS>]** <INCIDENT_TITLE>
**Time:** <TIMESTAMP_UTC>
**Affected services:** <SERVICE_LIST>
**Impact:** <IMPACT_DESCRIPTION>
**Update:** <DESCRIPTION_OF_CURRENT_STATE>
**Next update:** <NEXT_UPDATE_TIME_UTC> or when status changes.
```
Where `<STATUS>` is one of:
- **INVESTIGATING** — Aware of the issue, diagnosing root cause
- **IDENTIFIED** — Root cause found, working on fix
- **MONITORING** — Fix deployed, monitoring for recurrence
- **RESOLVED** — Incident resolved, normal operations restored
### Customer Notification: Incident Start
```
Subject: [RGL8R] Service Degradation — <INCIDENT_TITLE>
Hi <CUSTOMER_NAME>,
We are currently experiencing <IMPACT_DESCRIPTION> affecting <SERVICE_LIST>.
What you may notice:
- <SYMPTOM_1>
- <SYMPTOM_2>
Our team is actively investigating. We will provide an update by <NEXT_UPDATE_TIME_UTC>.
If you have questions, please reach out to <SUPPORT_CONTACT>.
— RGL8R Platform Team
```
### Customer Notification: Progress Update
```
Subject: [RGL8R] Update — <INCIDENT_TITLE>
Hi <CUSTOMER_NAME>,
Update on the <INCIDENT_TITLE> incident:
**Status:** <STATUS>
**Root cause:** <ROOT_CAUSE_SUMMARY>
**Current action:** <WHAT_WE_ARE_DOING>
**Expected resolution:** <ETA_OR_NEXT_STEPS>
We will provide another update by <NEXT_UPDATE_TIME_UTC>.
— RGL8R Platform Team
```
### Customer Notification: Resolution
```
Subject: [RGL8R] Resolved — <INCIDENT_TITLE>
Hi <CUSTOMER_NAME>,
The <INCIDENT_TITLE> incident has been resolved.
**Timeline:**
- <START_TIME_UTC>: Issue detected
- <IDENTIFIED_TIME_UTC>: Root cause identified
- <RESOLVED_TIME_UTC>: Fix deployed and verified
**Root cause:** <ROOT_CAUSE_SUMMARY>
**Impact:** <IMPACT_SUMMARY>
**Prevention:** <WHAT_WE_ARE_DOING_TO_PREVENT_RECURRENCE>
We apologize for any inconvenience. If you have questions, please reach out to <SUPPORT_CONTACT>.
— RGL8R Platform Team
```
---
## G. Post-Incident Review
### Mandatory Review Triggers
A post-incident review is **required** for:
- Any tenant isolation breach (regardless of duration)
- Any outage exceeding 1 hour
- Any data integrity issue
- Any incident requiring customer notification
### Incident Summary Template
Complete within 48 hours of incident resolution.
```markdown
# Post-Incident Review: <INCIDENT_TITLE>
**Date:** <DATE>
**Duration:** <START_TIME> — <END_TIME> (<DURATION>)
**Severity:** <BLOCK / HIGH / MED>
**Author:** <NAME>
**Reviewers:** <NAMES>
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | <EVENT_DESCRIPTION> |
| HH:MM | <EVENT_DESCRIPTION> |
| ... | ... |
## Impact
- **Users affected:** <COUNT_OR_DESCRIPTION>
- **Services affected:** <SERVICE_LIST>
- **Data impact:** <NONE / DESCRIPTION>
- **SLO impact:** <METRICS>
## Root Cause
<DESCRIPTION_OF_ROOT_CAUSE>
## Resolution
<WHAT_WAS_DONE_TO_FIX>
## Contributing Factors
- <FACTOR_1>
- <FACTOR_2>
## Follow-Up Actions
| Action | Owner | Target Date | Status |
|--------|-------|-------------|--------|
| <ACTION> | <OWNER> | <DATE> | Open |
| <ACTION> | <OWNER> | <DATE> | Open |
## Lessons Learned
- **What went well:** <DESCRIPTION>
- **What could be improved:** <DESCRIPTION>
- **Where we got lucky:** <DESCRIPTION>
```
### Blameless Postmortem Guidelines
1. **Focus on systems, not individuals.** The goal is to improve processes and tooling, not assign blame.
2. **Assume everyone acted with the best information available** at the time of the incident.
3. **Document contributing factors**, not "human error." If a person made a mistake, ask what about the system made that mistake possible or likely.
4. **Prioritize follow-up actions** by impact and feasibility. Every action gets an owner and a target date.
5. **Share the review** with the team. Transparency builds trust and ensures lessons are broadly applied.
6. **Track follow-up actions** in `docs/BACKLOG.md` or GitHub issues. Reviews without follow-through are theater.
---
## Related Documents
- Canary runbook: `docs/operations/public-api-canary-runbook.md`
- SLO baseline: `docs/operations/public-api-slo-baseline.md`
- Staging health runbook: `docs/operations/staging-health-monitoring-runbook.md`
- Staging RLS verification: `docs/operations/staging-force-rls-runbook.md`
- Environment variables: `docs/ENVIRONMENT_VARIABLES.md`
- Tabletop evidence: `docs/operations/p11-g-tabletop-evidence-2026-02-24.md`
- Backlog: `docs/BACKLOG.md`
- Execution plan: `docs/EXECUTION_PLAN.md`