Public API Canary Runbook

Source: docs/operations/public-api-canary-runbook.md

# Public API Canary Runbook
 
**Owner:** platform-oncall
**Workflow:** `.github/workflows/public-api-canary.yml`
**Script:** `scripts/canary-public-api.sh`
**SLO baseline:** `docs/operations/public-api-slo-baseline.md`
 
---
 
## What the Canary Tests
 
The canary exercises the exact public customer integration path every 30 minutes:
 
1. **Auth** — `POST /api/auth/token/integration` (integration key → JWT)
2. **Enqueue** — `POST /api/sima/batch` (bounded SKU set, explicit `runPolicy: "always"` + `screeningAuthority: "CA"`, slot-based idempotency key)
3. **Poll** — `GET /api/jobs/:id` (exponential backoff, 120s timeout)
4. **Validate** — `GET /api/sima/results/:sku/evidence` (canary SKU in results)
 
**Canary tenant:** `00000000-0000-0000-0000-000000000099` (staging)
**Canary SKU:** `CANARY-SKU-001` (must exist in staging seed data)
 
---
 
## Prerequisites
 
### Canary SKU in Staging Seed Data
 
The canary SKU `CANARY-SKU-001` must exist in staging seed data so that the SIMA batch job can produce results for it.
 
**Verify (after at least one successful canary run):**
```bash
curl -sS "https://rgl8r-staging-api.onrender.com/api/sima/results/CANARY-SKU-001/evidence" \
  -H "Authorization: Bearer <jwt>"
```
 
Expected: HTTP 200 with evidence data. If 404, the SKU either hasn't been processed yet (run the canary once first) or is missing from seed data.
 
**Note:** This endpoint returns 404 until the canary has run at least once and created SIMA results for the SKU. A 404 before the first canary run is expected — it does not indicate missing seed data. To bootstrap, run the canary once manually (`gh workflow run public-api-canary.yml`) and let it complete. The first run creates the SIMA results that subsequent runs validate.
 
### Required Repo Configuration
 
| Type | Name | Description |
|------|------|-------------|
| Secret | `CANARY_INTEGRATION_KEY` | Dedicated canary integration key (`sk_int_...`) |
| Variable | `CANARY_API_BASE_URL` | Staging API base URL |
| Secret | `STAGING_CHECKS_SLACK_WEBHOOK_URL` | Slack webhook (shared with health checks) |
| Variable | `STAGING_HEALTH_ALERT_OWNER` | Alert owner GitHub username |
 
---
 
## Failure Triage
 
### `CANARY_STATUS=FAIL_AUTH`
 
**Symptom:** Token exchange returned non-200.
 
**Steps:**
1. Check staging API health: `curl https://rgl8r-staging-api.onrender.com/health`
2. If API is down: check Render dashboard for deploy/crash status
3. If API is up: verify `CANARY_INTEGRATION_KEY` secret is valid (not expired/revoked)
4. Check if the integration key tenant (`00000000-0000-0000-0000-000000000099`) exists in staging DB
5. Check recent deploys that may have changed auth middleware
 
### `CANARY_STATUS=FAIL_ENQUEUE`
 
**Symptom:** SIMA batch enqueue returned non-202.
 
**Steps:**
1. Check HTTP status code in canary output:
   - **400**: Request payload changed upstream. Check `POST /api/sima/batch` route for body schema changes.
   - **409**: Idempotency key conflict (payload mismatch). This shouldn't happen with the canary's deterministic payload — investigate if the slot calculation changed.
   - **429**: Rate limit or queue admission cap hit. Check if another process is flooding the staging tenant.
   - **401/403**: JWT expired or tenant scope issue. Try re-running the canary.
2. Check staging API logs for the rejected request
 
### `CANARY_STATUS=FAIL_TIMEOUT`
 
**Symptom:** Job did not reach terminal state within timeout (default 120s).
 
**Steps:**
1. Check job status in staging: `GET /api/jobs/<jobId>`
2. If PENDING: Job processor may be stuck/down. Check worker logs.
3. If PROCESSING: Job is running but slow. Check DB load, concurrent job count.
4. Check queue admission state: are there many in-flight jobs for the tenant?
5. Check if the timeout is appropriate — can increase via `CANARY_JOB_TIMEOUT_SECONDS` variable
 
### `CANARY_STATUS=FAIL_JOB`
 
**Symptom:** Job reached FAILED state.
 
**Steps:**
1. Get job details: `GET /api/jobs/<jobId>` — check `error` field
2. Common causes:
   - Missing SIMA measures data in staging DB
   - Worker crash during processing
   - DB connectivity issue during job execution
3. Check staging API logs for the job execution
 
### `CANARY_STATUS=FAIL_SKU_MISSING`
 
**Symptom:** Job completed but canary SKU not found in SIMA results after one built-in retry.
 
**Steps:**
1. Check canary diagnostics in artifact/summary:
   - `CANARY_JOB_STATUS`
   - `CANARY_JOB_PROCESSED`
   - `CANARY_JOB_TOTAL_SKUS`
   - `CANARY_JOB_ERRORS_SKIPPED`
   - `CANARY_JOB_FILE_DUPLICATE`
   - `CANARY_JOB_COLUMN_VALIDATION_FAILED`
   - `CANARY_SKU_MISSING_COUNT`
   - `CANARY_SKU_VERIFY_RETRIES`
2. If job diagnostics indicate parser/schema issues (`columnValidationFailed=true`, `processed=0` with non-zero totals), treat as data/ingestion regression and inspect upload + parser paths first.
3. If diagnostics look healthy but evidence is still 404, verify seed/bootstrap state:
   - This endpoint returns 404 until at least one canary run has produced SIMA results for the SKU.
   - Re-run canary once manually and re-check.
4. Verify staging seed data includes `CANARY-SKU-001`:
   ```bash
   curl -sS "https://rgl8r-staging-api.onrender.com/api/sima/results/CANARY-SKU-001/evidence" \
     -H "Authorization: Bearer <jwt>"
   ```
5. If still 404: Add canary SKU to seed script and re-seed staging.
6. If data exists but evidence endpoint changed: check for route/response format changes.
 
### `CANARY_STATUS=FAIL_TRANSPORT`
 
**Symptom:** curl transport-level failure (DNS resolution, connection refused, TLS handshake, timeout). The script emits `FAIL_TRANSPORT` with the curl exit code and error message.
 
**Steps:**
1. Check the curl exit code in the canary output artifact (common: 6=DNS, 7=connect refused, 28=timeout, 35=TLS)
2. Verify staging API is reachable: `curl -sS https://rgl8r-staging-api.onrender.com/health`
3. If API is down: check Render dashboard for deploy/crash/scaling status
4. If API is up from your location: check GitHub Actions runner network (rare, but possible runner-to-Render connectivity issue)
5. If transient (single occurrence after a period of green): likely infrastructure blip — document and close
 
### SLO Breach (Latency/Completion Threshold Exceeded)
 
**Symptom:** Canary script passed (job completed) but a latency or completion threshold was exceeded. In week-1 strict mode, this still causes a workflow hard failure for launch-gate integrity.
 
**Alert policy:** SLO-only breaches do **not** page by default (to reduce false pages from transient spikes). Enable paging with repo variable `CANARY_PAGE_ON_SLO_BREACH=true`.
 
**Steps:**
1. Check which threshold was breached (auth/enqueue/completion) in workflow step summary
2. Check staging API performance: is it under load from other sources?
3. Check Render instance health: free-tier cold starts? memory pressure?
4. If transient: document as expected infrastructure variance, close issue
5. If persistent: investigate root cause, consider threshold adjustment
 
### Unexpected Replay on Scheduled Run
 
**Symptom:** `CANARY_REPLAYED=true` on a `schedule` trigger.
 
**Steps:**
1. Check if previous scheduled run overlapped (ran > 30 minutes?)
2. Check GitHub Actions runner clock skew
3. Verify idempotency key slot calculation matches expectations
4. If persistent: may need to adjust slot window or check concurrency settings
 
---
 
## Escalation Path
 
| Level | Trigger | Action |
|-------|---------|--------|
| L1 | Hard canary failure (`FAIL_AUTH`, `FAIL_ENQUEUE`, `FAIL_TIMEOUT`, `FAIL_JOB`, `FAIL_REPLAY_ON_SCHEDULE`, etc.) | Automatic Slack notification + GitHub issue upsert |
| L1 | SLO-only breach (`FAIL_SLO_BREACH`) | Workflow fails (launch-gate impact), no page by default |
| L2 | 3+ consecutive failures | Page on-call owner per staging health runbook |
| L3 | Sustained outage (>2 hours) | Escalate to engineering lead |
 
---
 
## Rollback Levers
 
### Disable Canary Workflow
```bash
gh workflow disable public-api-canary.yml
```
Use when: canary infrastructure itself is broken (not staging API).
 
### Verify Staging Health Independently
```bash
curl https://rgl8r-staging-api.onrender.com/health
scripts/check-staging-health.sh
```
 
### Force-Verify RLS
```bash
gh workflow run staging-force-rls-checks.yml
```
 
---
 
## Canary Key Management
 
### Key Properties
- **Tenant:** `00000000-0000-0000-0000-000000000099` (staging)
- **Name:** `P11-F Canary (least-privilege)`
- **Scopes:** `sima:write`, `jobs:read` (intent-documenting; runtime scope enforcement is minimal in v1)
- **Recommended expiry:** 90 days (quarterly rotation)
 
### Rotation Procedure
 
1. **Create new key:**
   ```bash
   curl -X POST https://rgl8r-staging-api.onrender.com/api/admin/tenants/00000000-0000-0000-0000-000000000099/integration-keys \
     -H "Authorization: Bearer <admin-jwt>" \
     -H "Content-Type: application/json" \
     -d '{"name": "P11-F Canary (least-privilege)", "scopes": ["sima:write", "jobs:read"]}'
   ```
2. **Update repo secret:**
   ```bash
   gh secret set CANARY_INTEGRATION_KEY --body '<new-key>'
   ```
3. **Verify canary passes with new key:**
   ```bash
   gh workflow run public-api-canary.yml
   ```
4. **Revoke old key** after confirming new key works
5. **Set calendar reminder** for next rotation (quarterly)
 
### Rotation Schedule
 
| Quarter | Action |
|---------|--------|
| Q2 2026 | Initial key created with PR merge |
| Q3 2026 | First rotation |
| Ongoing | Rotate quarterly |
 
---
 
## Related Documents
 
- SLO baseline: `docs/operations/public-api-slo-baseline.md`
- Staging health runbook: `docs/operations/staging-health-monitoring-runbook.md`
- Environment variables: `docs/ENVIRONMENT_VARIABLES.md` (canary section)