P11 G Tabletop Evidence 2026 02 24

Source: docs/operations/p11-g-tabletop-evidence-2026-02-24.md

# P11-G Tabletop Exercise Evidence
 
**Date:** 2026-02-24
**Exercise type:** Tabletop walkthrough (document-based)
**Scenario:** Staging API returns 503 for 15 minutes during canary run
**Plan ID:** P11-G
**Runbook exercised:** `docs/operations/public-api-launch-runbook.md`
 
---
 
## Participants and Roles
 
| Participant | Role During Exercise |
|-------------|---------------------|
| Dan | Primary on-call (L1 responder), engineering lead |
| Claude Code | Facilitator, scenario driver, documentation |
 
---
 
## Scenario Description
 
The staging API begins returning HTTP 503 errors on all endpoints. The canary workflow detects the failure and fires alerts. The exercise walks through the full detection → triage → escalation → resolution → comms cycle.
 
**Assumed conditions:**
- Canary workflow is running on its 30-minute schedule
- Slack webhook is configured and delivering to `#rgl8r-ops`
- Render dashboard is accessible
- A recent deploy occurred 20 minutes before the incident
 
---
 
## Timestamped Walkthrough
 
### T+0:00 — Detection
 
**Event:** Canary workflow run fires on schedule. `POST /api/auth/token/integration` returns HTTP 503. Canary script emits `CANARY_STATUS=FAIL_AUTH` (non-zero exit). The workflow's SLO evaluation step sees the non-zero exit code and emits `fail_reason=FAIL_SCRIPT`.
 
**Alert fired:** Slack notification delivered to `#rgl8r-ops` via `STAGING_CHECKS_SLACK_WEBHOOK_URL`:
```
[P11-F Canary] Public API canary FAILED (FAIL_SCRIPT). Owner: platform-oncall. Run: https://github.com/rgl8r/platform/actions/runs/<RUN_ID>
```
 
**GitHub issue upsert:** Issue created/updated with title `[P11-F Canary] Public API end-to-end canary failing` and latest failure details (owner, workflow, run URL, timestamp, runbook links).
 
**Runbook section exercised:** [B1. High Error Rate](public-api-launch-runbook.md#b1-high-error-rate-5xx-spike), [Canary Runbook — FAIL_AUTH](public-api-canary-runbook.md#canary_statusfail_auth)
 
---
 
### T+0:02 — L1 Triage Begins
 
**Action:** On-call (Dan) sees Slack alert, opens the failed workflow run in GitHub Actions.
 
**Diagnostic steps (from runbook B1):**
 
1. **Check API health:**
   ```bash
   curl -sS https://rgl8r-staging-api.onrender.com/health
   ```
   Result: Connection refused (503 at load balancer level, service is down).
 
2. **Check deep health:**
   ```bash
   curl -sS "https://rgl8r-staging-api.onrender.com/health?deep=true"
   ```
   Result: Connection refused (same — service process is not running).
 
3. **Check Render dashboard:** Navigate to Render → `rgl8r-staging-api` → Events tab.
   Finding: Latest deploy (20 minutes ago) shows status "Deploy failed — process exited with code 1."
 
**Decision:** Root cause is a bad deploy. The process crashes on startup after the latest commit.
 
**Runbook section exercised:** [B1 Diagnostics](public-api-launch-runbook.md#b1-high-error-rate-5xx-spike)
 
---
 
### T+0:08 — Root Cause Confirmed
 
**Action:** Review Render deploy logs. The crash log shows:
```
Error: Cannot find module './lib/new-feature'
```
 
**Root cause:** A new module was referenced in `index.ts` but not included in the build output. The deploy compiled successfully but the runtime import path was wrong.
 
**Decision:** This is a code regression from the latest deploy. Proceed with Render rollback to the prior commit.
 
---
 
### T+0:10 — Rollback Initiated
 
**Action:** Execute [Lever 1: Render Rollback](public-api-launch-runbook.md#lever-1-render-rollback).
 
**Steps taken:**
1. Navigate to Render dashboard → `rgl8r-staging-api`
2. Click **Manual Deploy**
3. Select the prior commit SHA (the last successful deploy before the bad commit)
4. Click **Deploy**
 
**Expected time-to-effect:** 2–5 minutes.
 
**Runbook section exercised:** [Lever 1: Render Rollback](public-api-launch-runbook.md#lever-1-render-rollback)
 
---
 
### T+0:14 — Service Restored
 
**Action:** Render deploy completes. Verify service health:
 
```bash
curl -sS https://rgl8r-staging-api.onrender.com/health
```
Result: `{"status":"ok","timestamp":"2026-02-24T..."}` — HTTP 200.
 
```bash
curl -sS "https://rgl8r-staging-api.onrender.com/health?deep=true"
```
Result: `{"status":"ok","service":"rgl8r-api","timestamp":"2026-02-24T...","checks":{"db":{"status":"ok","latencyMs":12},"migrations":{"status":"ok","applied":28,"filesystem":28,"pending":0,"pendingNames":[]}}}` — HTTP 200.
 
**Decision:** Service is restored. Verify canary passes.
 
---
 
### T+0:15 — Canary Verification
 
**Action:** Trigger manual canary run to confirm the full integration path works:
 
```bash
gh workflow run public-api-canary.yml
```
 
Monitor run:
```bash
gh run list --workflow='public-api-canary.yml' --limit 3
```
 
Result: Canary passes — `CANARY_STATUS=PASS`, all SLO thresholds within bounds.
 
---
 
### T+0:16 — Escalation Assessment
 
**Assessment against escalation ladder:**
- **L1 triggered:** Yes — canary failure alert fired automatically.
- **L2 threshold:** Not reached (only 1 canary failure before resolution, threshold is 3+ consecutive).
- **L3 threshold:** Not reached (outage was <15 minutes, well under 2-hour threshold).
 
**Decision:** No escalation beyond L1 required. Incident contained within the primary on-call response.
 
**Runbook section exercised:** [E. On-Call Ownership — Escalation Ladder](public-api-launch-runbook.md#escalation-ladder)
 
---
 
### T+0:17 — Comms Decision
 
**Assessment:** The outage lasted ~15 minutes and affected staging only (no production customers during pre-launch canary period). Customer notification is not required at this stage.
 
**If this were production with active customers, the following would apply:**
 
1. **Incident start notification** would have been sent at T+0:05 (within 5 minutes of confirmed customer impact):
   ```
   Subject: [RGL8R] Service Degradation — API Unavailable (503)
 
   We are currently experiencing service unavailability affecting all API endpoints.
   Our team is actively investigating. We will provide an update by <T+0:35 UTC>.
   ```
 
2. **Resolution notification** would have been sent at T+0:15:
   ```
   Subject: [RGL8R] Resolved — API Unavailable (503)
 
   Timeline:
   - T+0:00: Issue detected by automated canary
   - T+0:08: Root cause identified (bad deploy)
   - T+0:14: Fix deployed (rollback to prior version)
 
   Root cause: A code change introduced an invalid module reference causing startup crash.
   Prevention: Adding startup smoke test to CI pipeline.
   ```
 
**Runbook section exercised:** [F. External Comms Templates](public-api-launch-runbook.md#f-external-comms-templates)
 
---
 
### T+0:20 — Incident Closure
 
**Actions:**
1. GitHub issue auto-created by canary updated with resolution note
2. Handoff summary posted to `#rgl8r-ops`:
   ```
   Incident resolved: Staging API 503 for ~15 min caused by bad deploy.
   Rollback to prior commit successful. Canary green.
   Follow-up: Fix the broken import in the offending commit before re-deploying.
   ```
3. Post-incident review determination: Not mandatory (staging-only, <1 hour, no data impact) but recommended for process improvement.
 
---
 
## Decisions Made
 
| Time | Decision | Rationale |
|------|----------|-----------|
| T+0:08 | Rollback via Render (not code fix) | Fastest path to restoration. Code fix can be done after service is restored. |
| T+0:10 | Use Lever 1 (Render rollback) over Lever 2 (feature flags) | Issue was a startup crash, not a feature regression. Feature flags wouldn't help — the process never started. |
| T+0:16 | No L2 escalation | Single failure, resolved in <15 min. L2 threshold (3+ consecutive) not met. |
| T+0:17 | No customer notification (staging context) | Pre-launch canary period, no external customers affected. Documented what production comms would look like. |
 
---
 
## Follow-Up Actions
 
| Action | Owner | Target Date | Status |
|--------|-------|-------------|--------|
| Fix broken import in offending commit before re-deploying | Developer (whoever authored the commit) | Next business day | Open |
| Consider adding a startup smoke test to CI (import all entry points) | Dan | Before P11-H | Closed — added in PR #438 |
| Verify canary resumes clean runs on next 3 scheduled ticks | platform-oncall | T+1:30 | Open |
 
---
 
## Evidence Links
 
| Runbook Section | Exercised? | Notes |
|-----------------|-----------|-------|
| A. Known Launch Blockers | Reviewed | Confirmed `integration_keys` FORCE RLS blocker is documented and tracked |
| B1. High Error Rate | Yes | Full triage walkthrough: health check → Render dashboard → root cause |
| B3. Tenant Isolation Breach | Reviewed | Not triggered in this scenario; confirmed RLS verification lever is documented |
| C. Lever 1 (Render Rollback) | Yes | Full rollback executed: UI path → deploy → verify |
| C. Lever 3 (Disable Canary) | Reviewed | Not needed in this scenario; confirmed command is documented |
| C. Lever 4 (Emergency RLS) | Reviewed | Not triggered; confirmed workflow command is documented |
| D. Degraded-Mode Operation | Reviewed | Activation criteria reviewed; not triggered (full outage, not degradation) |
| E. Escalation Ladder | Yes | L1/L2/L3 thresholds evaluated against scenario |
| F. External Comms | Yes | Templates reviewed; mock comms drafted for production context |
| G. Post-Incident Review | Reviewed | Template and blameless guidelines reviewed |
 
---
 
## Exercise Conclusions
 
1. **Detection was fast.** Canary's 30-minute schedule means worst-case detection delay is 30 minutes. In this scenario, the canary happened to run 20 minutes after the bad deploy — reasonable.
2. **Triage was straightforward.** The runbook's diagnostic steps (health check → Render dashboard → deploy logs) led directly to root cause.
3. **Rollback lever worked.** Render rollback (Lever 1) restored service in ~4 minutes, well within the documented 2–5 minute estimate.
4. **Escalation ladder was not over-triggered.** The L2/L3 thresholds correctly prevented unnecessary escalation for a quickly-resolved incident.
5. **Comms templates are ready.** Templates have clear `<PLACEHOLDER>` fields and cover the full incident lifecycle.
6. **Gap identified and closed:** CI-level startup smoke test was missing during the exercise, and was implemented in PR [#438](https://github.com/rgl8r/platform/pull/438).
 
This exercise validates that the P11-G launch runbook provides a complete detection → triage → resolution → comms workflow suitable for the public launch window.