Public API Launch Runbook

Source: docs/operations/public-api-launch-runbook.md

# Public API Launch Operations Runbook
 
**Owner:** platform-oncall
**Created:** 2026-02-24
**Last validated:** 2026-02-24
**Plan ID:** P11-G
**Exit gate:** Launch incident/rollback/comms runbooks exist and have completed tabletop evidence
 
---
 
## A. Known Launch Blockers / Risks
 
Items that must be resolved before the public launch gate can be approved. P11-H go/no-go packet links here directly.
 
| # | Blocker | Current State | Owner | Target Date | Source |
|---|---------|--------------|-------|-------------|--------|
| 1 | **7-day canary clean run** | Window started 2026-02-24 (P11-F #426 merge). Planned completion 2026-03-03. | platform-oncall | 2026-03-03 | P11-F exit gate |
| 2 | **P11-H approval record completion** | Go/no-go packet draft exists, but approvals, launch window, and rollback authority still require final signoff entries. | Dan | Before launch decision | P11-H exit gate |
 
**Hard rule:** If blocker #1 (7-day canary clean run) is incomplete, P11-H **cannot** be marked approved for broad public launch.
 
### Recently Resolved Blocker (for audit traceability)
 
- `integration_keys` FORCE RLS defense-in-depth issue is now resolved via constrained pre-auth policy and FORCE RLS re-enable. See `docs/BACKLOG.md` line 102 and PRs #431 + #433.
 
### Other Open Items (Non-Blocking)
 
| Item | Severity | Notes |
|------|----------|-------|
| `logIntegrationAuthFailure()` pre-auth audit calls | LOW | Will fail silently under FORCE RLS (already try/catch). Failed auth attempts also logged via structured logger. |
| Improve `verify-force-rls-staging.sh` failure output | LOW | Include exact failing table names in stderr/Slack. |
| Unit tests for deep health check logic | LOW | DB check behavior, migration drift detection — covered E2E by staging health workflow. |
 
---
 
## B. Incident Triage
 
Failure modes are ordered by likelihood during launch window. Each follows the symptom → diagnostics → root cause → resolution pattern from the P11-F canary runbook.
 
### B1. High Error Rate (5xx spike)
 
**Symptom:** Error rate exceeds 5% of requests over a 5-minute window. Canary may report `FAIL_ENQUEUE` or `FAIL_AUTH`. Slack alerts fire from canary workflow.
 
**Diagnostics:**
1. Check API health: `curl https://rgl8r-staging-api.onrender.com/health`
2. Check deep health (DB + migrations): `curl https://rgl8r-staging-api.onrender.com/health?deep=true`
3. Check Render dashboard for deploy status, crash loops, memory pressure
4. Check recent deploy history — was a new commit deployed in the last 30 minutes?
 
**Root Cause Checklist:**
- [ ] Bad deploy (new code introduced regression)
- [ ] Database connectivity issue (connection pool exhaustion, Postgres restart)
- [ ] External dependency failure (if applicable)
- [ ] Memory/CPU exhaustion on Render instance
- [ ] Expired or misconfigured environment variables
 
**Resolution:**
1. If bad deploy → Render rollback (see [Lever 1](#lever-1-render-rollback))
2. If DB connectivity → Check Render Postgres dashboard, verify `DATABASE_URL`
3. If resource exhaustion → Scale Render instance or restart service
4. If env var issue → Fix in Render dashboard, redeploy
 
### B2. Degraded Latency
 
**Symptom:** API responses exceed SLO thresholds (auth >3s, enqueue >5s, job completion >90s). Canary reports SLO breach. Response times degraded but requests succeed. By default, SLO-only breaches fail the workflow but do not page.
 
**Diagnostics:**
1. Check canary timing output: auth, enqueue, and completion latency values
2. Check Render metrics: CPU, memory, response time
3. Check concurrent job count: `GET /api/jobs?status=PROCESSING` (count of in-flight jobs)
4. Check if Render free-tier cold start is the cause (first request after idle period)
 
**Root Cause Checklist:**
- [ ] Render instance cold start (free tier spins down after inactivity)
- [ ] High concurrent job load (job processor contention)
- [ ] Database query performance degradation (missing indexes, table bloat)
- [ ] Network latency between Render and Postgres
 
**Resolution:**
1. If cold start → Expected on free tier; document as transient, close if single occurrence
2. If job contention → Tighten queue admission caps (see [Lever 6](#lever-6-queue-admission-cap))
3. If DB performance → Check for long-running queries, consider VACUUM ANALYZE
4. If persistent → Consider Render plan upgrade for launch window
 
### B3. Tenant Isolation Breach
 
**Symptom:** Data from one tenant visible to another tenant's API calls. This is a **severity BLOCK** incident.
 
**Diagnostics:**
1. Immediately run RLS verification: `gh workflow run staging-force-rls-checks.yml`
2. Check if the affected endpoint uses `withTenant()` wrapper
3. Check recent deploys for changes to RLS-scoped routes
4. Verify `current_setting('app.current_tenant_id')` is set correctly in request context
 
**Root Cause Checklist:**
- [ ] Missing `withTenant()` wrapper on a new or modified route
- [ ] Raw `prisma` query bypassing tenant GUC
- [ ] RLS policy dropped or altered by a migration
- [ ] FORCE RLS disabled on a table
 
**Resolution:**
1. **Immediately** disable the affected endpoint if possible (return 503)
2. Run emergency RLS verification (see [Lever 4](#lever-4-emergency-rls-verification))
3. Fix the root cause (add `withTenant()`, restore RLS policy)
4. Re-run verification workflow to confirm fix
5. **This is a mandatory post-incident review trigger** (see [Section G](#g-post-incident-review))
 
### B4. Job Queue Stall
 
**Symptom:** Jobs stuck in `PENDING` or `PROCESSING` state. No new jobs completing. Upload/enqueue requests succeed but results never arrive.
 
**Diagnostics:**
1. Check job status distribution: `GET /api/jobs` (look for many PENDING/PROCESSING, no recent COMPLETED)
2. Check if job processor is running (Render service logs — look for `[job-processor]` entries)
3. Check for stuck transactions in Postgres
4. Check queue admission state: are caps being hit?
 
**Root Cause Checklist:**
- [ ] Job processor crashed or stopped polling
- [ ] Database lock contention (long-running transaction blocking job claims)
- [ ] Job processor failing silently (processing but erroring without marking FAILED)
- [ ] Queue admission caps too tight (new jobs rejected)
 
**Resolution:**
1. If processor stopped → Restart Render service
2. If lock contention → Identify and terminate blocking query
3. If silent failures → Check job processor logs, fix error handling
4. If admission caps → Adjust caps (see [Lever 6](#lever-6-queue-admission-cap))
 
### B5. Auth Failures
 
**Symptom:** Integration key exchange (`POST /api/auth/token/integration`) returning 401/403. Canary reports `FAIL_AUTH`. Legitimate clients cannot authenticate.
 
**Diagnostics:**
1. Check API health endpoint (is the service up?)
2. Check if JWT signing keys are valid: verify `JWT_PRIVATE_KEY` and `JWT_PUBLIC_KEY` env vars
3. Check if the integration key is valid (not revoked/expired)
4. Check recent deploys for auth middleware changes
 
**Root Cause Checklist:**
- [ ] JWT keys expired or misconfigured after deploy
- [ ] Auth middleware regression in new deploy
- [ ] Integration key revoked or rotated without updating client
- [ ] Clock skew causing JWT validation failures
 
**Resolution:**
1. If key misconfiguration → Fix env vars in Render, redeploy
2. If auth regression → Render rollback (see [Lever 1](#lever-1-render-rollback))
3. If key revoked → Issue new key, update client configuration
4. If clock skew → Restart service (NTP sync)
 
### B6. Quota Exhaustion
 
**Symptom:** Legitimate clients receiving `429 RATE_LIMITED` responses. Canary may report `FAIL_ENQUEUE` with 429 status.
 
**Diagnostics:**
1. Check which quota is being hit (per-key rate limit vs. queue admission cap)
2. Check `Retry-After` header value in 429 responses
3. Check if a single client is consuming disproportionate resources
4. Check abuse guard middleware logs for blocked keys
 
**Root Cause Checklist:**
- [ ] Client retry loop (agent not respecting `Retry-After`)
- [ ] Quota limits too aggressive for actual traffic patterns
- [ ] Burst of legitimate traffic exceeding defaults
- [ ] Abusive client consuming shared capacity
 
**Resolution:**
1. If abusive client → Revoke integration key (see [Lever 5](#lever-5-integration-key-revocation))
2. If limits too tight → Adjust env vars and redeploy (see [Lever 6](#lever-6-queue-admission-cap))
3. If legitimate burst → Temporarily increase limits, monitor
4. If retry loop → Contact client, point to API docs retry guidance
 
---
 
## C. Rollback Levers
 
Each lever is documented as a verified run step with exact command/path, expected time-to-effect, blast radius, and last validated date.
 
### Lever 1: Render Rollback
 
**Action:** Roll back to a previous known-good deploy via Render dashboard.
 
**Exact UI path:**
1. Navigate to Render dashboard → select `rgl8r-staging-api` service
2. Click **Manual Deploy** in the top-right
3. Select the prior commit SHA from the dropdown (the last known-good deploy)
4. Click **Deploy**
 
> Note: This is a UI-driven action. There is no CLI equivalent for Render deploys at this time.
 
**Time-to-effect:** 2–5 minutes (Render build + deploy cycle)
 
**Blast radius:**
- Rolls back ALL application code to the selected commit
- Does NOT roll back database migrations (Prisma migrations are forward-only)
- Does NOT affect environment variables (those persist independently)
- Affects all tenants on the Render service
 
**Last validated:** 2026-02-24
 
---
 
### Lever 2: Feature Flag Toggles
 
**Action:** Disable specific feature subsystems via environment variables without a code rollback.
 
**Exact commands (set in Render dashboard → Environment):**
 
| Flag | Value to Disable | Effect |
|------|-----------------|--------|
| `SIMA_CRE_MODE` | `off` | Disables CRE attribute-based rule refinement; SIMA returns raw measure outcomes only |
| `TRADE_DETECTOR_MODE` | `legacy` | Reverts TRADE detection to hardcoded legacy path |
| `SHIP_CARRIER_CONFIG_SOURCE` | `hardcoded` | Reverts SHIP carrier rules to hardcoded defaults, bypasses DB-loaded rules |
 
After changing any flag: trigger a manual deploy in Render (or the service auto-restarts on env var change, depending on Render config).
 
**Time-to-effect:** 1–3 minutes (env var update + service restart)
 
**Blast radius:**
- Each flag is scoped to its subsystem only
- Other modules continue operating normally
- No data loss — flags control read-path behavior, not write-path
 
**Last validated:** 2026-02-24
 
---
 
### Lever 3: Disable Canary Workflow
 
**Action:** Stop the canary from running (useful when canary infrastructure itself is broken, not the API).
 
**Exact command:**
```bash
gh workflow disable public-api-canary.yml
```
 
**Re-enable:**
```bash
gh workflow enable public-api-canary.yml
```
 
**Time-to-effect:** Immediate (next scheduled run will not fire)
 
**Blast radius:**
- Stops all canary monitoring — no Slack alerts, no GitHub issue upserts
- Does NOT affect the API itself — only the monitoring workflow stops
- The 7-day clean window resets if disabled for significant time
 
**Last validated:** 2026-02-24
 
---
 
### Lever 4: Emergency RLS Verification
 
**Action:** Run the staging FORCE RLS verification workflow to confirm tenant isolation is intact.
 
**Exact command:**
```bash
gh workflow run staging-force-rls-checks.yml
```
 
**Check results:**
```bash
gh run list --workflow='staging-force-rls-checks.yml' --limit 5
```
 
**Time-to-effect:** 1–2 minutes (workflow execution)
 
**Blast radius:**
- Read-only verification — does not modify any database state
- Reports which tables have FORCE RLS enabled/disabled
- Alerts to Slack if configured
 
**Last validated:** 2026-02-24
 
---
 
### Lever 5: Integration Key Revocation
 
**Action:** Revoke a specific integration key to cut off a misbehaving or compromised client.
 
**Exact command:**
```bash
curl -X POST "https://rgl8r-staging-api.onrender.com/api/integration-keys/<KEY_ID>/revoke" \
  -H "Authorization: Bearer <jwt>" \
  -H "Content-Type: application/json"
```
 
> Note: This is a tenant-scoped route (`POST /api/integration-keys/:id/revoke`) that requires admin privileges (`admin:keys` scope). The JWT must belong to a Clerk-authenticated admin of the tenant that owns the key.
 
**Time-to-effect:** Immediate (next auth attempt with the revoked key fails)
 
**Blast radius:**
- Only affects the specific integration key revoked
- Other keys for the same tenant continue working
- The revoked key cannot be un-revoked — a new key must be created
 
**Last validated:** 2026-02-24
 
---
 
### Lever 6: Queue Admission Cap Override
 
**Action:** Tighten queue admission caps to throttle inbound job volume during incidents.
 
**Exact commands (set in Render dashboard → Environment):**
 
| Variable | Default | Emergency Value | Effect |
|----------|---------|----------------|--------|
| `ENQUEUE_MAX_INFLIGHT_TOTAL` | 80 | 10 | Max total in-flight jobs across all types |
| `ENQUEUE_MAX_INFLIGHT_UPLOADS` | 30 | 5 | Max in-flight upload jobs |
| `ENQUEUE_MAX_INFLIGHT_SIMA` | 20 | 3 | Max in-flight SIMA batch jobs |
 
After changing: redeploy or wait for Render auto-restart.
 
**Time-to-effect:** 1–3 minutes (env var update + service restart)
 
**Blast radius:**
- New enqueue requests exceeding caps receive `429 RATE_LIMITED` with `Retry-After` header
- Already in-flight jobs continue processing normally
- Does NOT affect read endpoints (results, evidence, jobs list)
 
**Last validated:** 2026-02-24
 
---
 
## D. Degraded-Mode Operation
 
### When to Activate
 
Enter degraded mode when ANY of the following conditions persist for 10+ minutes:
- Error rate >5% of total requests
- Latency >3x SLO targets (auth >9s, enqueue >15s, completion >270s)
- Job queue stall (zero completions in 10 minutes with pending jobs)
 
### What Degrades vs. Hard-Fails
 
| Category | Behavior in Degraded Mode |
|----------|--------------------------|
| **Read endpoints** (results, evidence, jobs list, summary) | Stay up — these are stateless DB reads |
| **Auth endpoints** (token exchange) | Stay up — lightweight, no queue dependency |
| **Health endpoint** | Stay up — always responds (may report `degraded` status) |
| **Write/enqueue endpoints** (upload, batch, catalog) | May throttle — queue admission caps tightened, 429s expected |
| **Job processing** | May slow — processor continues but at reduced throughput |
| **Canary workflow** | Continues running — failures expected, Slack alerts continue |
 
### Customer Experience During Degradation
 
- **Reads:** Normal response times, full data access
- **Writes:** Some requests may receive `429 RATE_LIMITED` with `Retry-After` header. Clients following the API contract (`Retry-After` backoff) will succeed on retry.
- **Jobs:** Longer completion times. Polling clients see `PENDING`/`PROCESSING` for extended periods but jobs eventually complete.
- **No data loss:** All accepted writes are durable. Throttled requests are rejected cleanly (not silently dropped).
 
---
 
## E. On-Call Ownership
 
### Launch Window (First 7 Days Post-Public)
 
| Role | Owner | Contact |
|------|-------|---------|
| **Primary on-call** | Dan | Slack: `#rgl8r-ops` (primary), phone: `<PLACEHOLDER>` |
| **Backup on-call** | `<PLACEHOLDER>` | Slack: `#rgl8r-ops`, phone: `<PLACEHOLDER>` |
| **Engineering lead** | Dan | Slack: `#rgl8r-eng`, phone: `<PLACEHOLDER>` |
 
### Escalation Ladder
 
Consistent with P11-F canary runbook's L1/L2/L3 pattern.
 
| Level | Trigger | Action | Response Time |
|-------|---------|--------|---------------|
| **L1** | Any incident detected (canary failure, customer report, monitoring alert) | Post to `#rgl8r-ops` Slack channel. Begin triage using Section B. | < 15 minutes |
| **L2** | 3+ consecutive canary failures OR incident unresolved after 30 minutes | Page primary on-call. If no response in 15 min, page backup. | < 15 minutes after L1 |
| **L3** | Sustained outage (>2 hours) OR tenant isolation breach OR data integrity issue | Escalate to engineering lead. Begin customer comms (Section F). | Immediate after L2 |
 
### Contact Channels
 
| Channel | Purpose |
|---------|---------|
| `#rgl8r-ops` (Slack) | Primary incident coordination. All L1 triage starts here. |
| `#rgl8r-eng` (Slack) | Engineering discussion, root cause analysis |
| Phone tree | L2/L3 escalation when Slack response time exceeds 15 minutes |
| GitHub Issues | Persistent incident tracking (hard canary failures auto-create via workflow; SLO-only breaches auto-create only if `CANARY_PAGE_ON_SLO_BREACH=true`) |
 
### Handoff Procedure
 
When transferring on-call responsibility between shifts:
 
1. **Outgoing** posts a handoff summary to `#rgl8r-ops`:
   - Current system status (healthy / degraded / incident)
   - Open incidents with links to GitHub issues
   - Pending actions or watch items
   - Any recent changes (deploys, config updates)
2. **Incoming** acknowledges the handoff in `#rgl8r-ops`
3. **Incoming** verifies access to Render dashboard, GitHub repo, Slack channels
4. Both confirm the handoff is complete
 
---
 
## F. External Comms Templates
 
All templates use `<PLACEHOLDER>` syntax for fields that must be filled per-incident.
 
### Status Update Template
 
Use for public-facing status updates during an incident. Post to status page and/or `#rgl8r-ops`.
 
```
**[<STATUS>]** <INCIDENT_TITLE>
 
**Time:** <TIMESTAMP_UTC>
**Affected services:** <SERVICE_LIST>
**Impact:** <IMPACT_DESCRIPTION>
 
**Update:** <DESCRIPTION_OF_CURRENT_STATE>
 
**Next update:** <NEXT_UPDATE_TIME_UTC> or when status changes.
```
 
Where `<STATUS>` is one of:
- **INVESTIGATING** — Aware of the issue, diagnosing root cause
- **IDENTIFIED** — Root cause found, working on fix
- **MONITORING** — Fix deployed, monitoring for recurrence
- **RESOLVED** — Incident resolved, normal operations restored
 
### Customer Notification: Incident Start
 
```
Subject: [RGL8R] Service Degradation — <INCIDENT_TITLE>
 
Hi <CUSTOMER_NAME>,
 
We are currently experiencing <IMPACT_DESCRIPTION> affecting <SERVICE_LIST>.
 
What you may notice:
- <SYMPTOM_1>
- <SYMPTOM_2>
 
Our team is actively investigating. We will provide an update by <NEXT_UPDATE_TIME_UTC>.
 
If you have questions, please reach out to <SUPPORT_CONTACT>.
 
— RGL8R Platform Team
```
 
### Customer Notification: Progress Update
 
```
Subject: [RGL8R] Update — <INCIDENT_TITLE>
 
Hi <CUSTOMER_NAME>,
 
Update on the <INCIDENT_TITLE> incident:
 
**Status:** <STATUS>
**Root cause:** <ROOT_CAUSE_SUMMARY>
**Current action:** <WHAT_WE_ARE_DOING>
**Expected resolution:** <ETA_OR_NEXT_STEPS>
 
We will provide another update by <NEXT_UPDATE_TIME_UTC>.
 
— RGL8R Platform Team
```
 
### Customer Notification: Resolution
 
```
Subject: [RGL8R] Resolved — <INCIDENT_TITLE>
 
Hi <CUSTOMER_NAME>,
 
The <INCIDENT_TITLE> incident has been resolved.
 
**Timeline:**
- <START_TIME_UTC>: Issue detected
- <IDENTIFIED_TIME_UTC>: Root cause identified
- <RESOLVED_TIME_UTC>: Fix deployed and verified
 
**Root cause:** <ROOT_CAUSE_SUMMARY>
**Impact:** <IMPACT_SUMMARY>
**Prevention:** <WHAT_WE_ARE_DOING_TO_PREVENT_RECURRENCE>
 
We apologize for any inconvenience. If you have questions, please reach out to <SUPPORT_CONTACT>.
 
— RGL8R Platform Team
```
 
---
 
## G. Post-Incident Review
 
### Mandatory Review Triggers
 
A post-incident review is **required** for:
- Any tenant isolation breach (regardless of duration)
- Any outage exceeding 1 hour
- Any data integrity issue
- Any incident requiring customer notification
 
### Incident Summary Template
 
Complete within 48 hours of incident resolution.
 
```markdown
# Post-Incident Review: <INCIDENT_TITLE>
 
**Date:** <DATE>
**Duration:** <START_TIME> — <END_TIME> (<DURATION>)
**Severity:** <BLOCK / HIGH / MED>
**Author:** <NAME>
**Reviewers:** <NAMES>
 
## Timeline
 
| Time (UTC) | Event |
|------------|-------|
| HH:MM | <EVENT_DESCRIPTION> |
| HH:MM | <EVENT_DESCRIPTION> |
| ...   | ... |
 
## Impact
 
- **Users affected:** <COUNT_OR_DESCRIPTION>
- **Services affected:** <SERVICE_LIST>
- **Data impact:** <NONE / DESCRIPTION>
- **SLO impact:** <METRICS>
 
## Root Cause
 
<DESCRIPTION_OF_ROOT_CAUSE>
 
## Resolution
 
<WHAT_WAS_DONE_TO_FIX>
 
## Contributing Factors
 
- <FACTOR_1>
- <FACTOR_2>
 
## Follow-Up Actions
 
| Action | Owner | Target Date | Status |
|--------|-------|-------------|--------|
| <ACTION> | <OWNER> | <DATE> | Open |
| <ACTION> | <OWNER> | <DATE> | Open |
 
## Lessons Learned
 
- **What went well:** <DESCRIPTION>
- **What could be improved:** <DESCRIPTION>
- **Where we got lucky:** <DESCRIPTION>
```
 
### Blameless Postmortem Guidelines
 
1. **Focus on systems, not individuals.** The goal is to improve processes and tooling, not assign blame.
2. **Assume everyone acted with the best information available** at the time of the incident.
3. **Document contributing factors**, not "human error." If a person made a mistake, ask what about the system made that mistake possible or likely.
4. **Prioritize follow-up actions** by impact and feasibility. Every action gets an owner and a target date.
5. **Share the review** with the team. Transparency builds trust and ensures lessons are broadly applied.
6. **Track follow-up actions** in `docs/BACKLOG.md` or GitHub issues. Reviews without follow-through are theater.
 
---
 
## Related Documents
 
- Canary runbook: `docs/operations/public-api-canary-runbook.md`
- SLO baseline: `docs/operations/public-api-slo-baseline.md`
- Staging health runbook: `docs/operations/staging-health-monitoring-runbook.md`
- Staging RLS verification: `docs/operations/staging-force-rls-runbook.md`
- Environment variables: `docs/ENVIRONMENT_VARIABLES.md`
- Tabletop evidence: `docs/operations/p11-g-tabletop-evidence-2026-02-24.md`
- Backlog: `docs/BACKLOG.md`
- Execution plan: `docs/EXECUTION_PLAN.md`