Skip to Content
InternalDocsOperationsPublic API Canary Runbook

Public API Canary Runbook

Source: docs/operations/public-api-canary-runbook.md

# Public API Canary Runbook **Owner:** platform-oncall **Workflow:** `.github/workflows/public-api-canary.yml` **Script:** `scripts/canary-public-api.sh` **SLO baseline:** `docs/operations/public-api-slo-baseline.md` --- ## What the Canary Tests The canary exercises the exact public customer integration path every 30 minutes: 1. **Auth** — `POST /api/auth/token/integration` (integration key → JWT) 2. **Enqueue** — `POST /api/sima/batch` (bounded SKU set, explicit `runPolicy: "always"` + `screeningAuthority: "CA"`, slot-based idempotency key) 3. **Poll** — `GET /api/jobs/:id` (exponential backoff, 120s timeout) 4. **Validate** — `GET /api/sima/results/:sku/evidence` (canary SKU in results) **Canary tenant:** `00000000-0000-0000-0000-000000000099` (staging) **Canary SKU:** `CANARY-SKU-001` (must exist in staging seed data) --- ## Prerequisites ### Canary SKU in Staging Seed Data The canary SKU `CANARY-SKU-001` must exist in staging seed data so that the SIMA batch job can produce results for it. **Verify (after at least one successful canary run):** ```bash curl -sS "https://rgl8r-staging-api.onrender.com/api/sima/results/CANARY-SKU-001/evidence" \ -H "Authorization: Bearer <jwt>" ``` Expected: HTTP 200 with evidence data. If 404, the SKU either hasn't been processed yet (run the canary once first) or is missing from seed data. **Note:** This endpoint returns 404 until the canary has run at least once and created SIMA results for the SKU. A 404 before the first canary run is expected — it does not indicate missing seed data. To bootstrap, run the canary once manually (`gh workflow run public-api-canary.yml`) and let it complete. The first run creates the SIMA results that subsequent runs validate. ### Required Repo Configuration | Type | Name | Description | |------|------|-------------| | Secret | `CANARY_INTEGRATION_KEY` | Dedicated canary integration key (`sk_int_...`) | | Variable | `CANARY_API_BASE_URL` | Staging API base URL | | Secret | `STAGING_CHECKS_SLACK_WEBHOOK_URL` | Slack webhook (shared with health checks) | | Variable | `STAGING_HEALTH_ALERT_OWNER` | Alert owner GitHub username | --- ## Failure Triage ### `CANARY_STATUS=FAIL_AUTH` **Symptom:** Token exchange returned non-200. **Steps:** 1. Check staging API health: `curl https://rgl8r-staging-api.onrender.com/health` 2. If API is down: check Render dashboard for deploy/crash status 3. If API is up: verify `CANARY_INTEGRATION_KEY` secret is valid (not expired/revoked) 4. Check if the integration key tenant (`00000000-0000-0000-0000-000000000099`) exists in staging DB 5. Check recent deploys that may have changed auth middleware ### `CANARY_STATUS=FAIL_ENQUEUE` **Symptom:** SIMA batch enqueue returned non-202. **Steps:** 1. Check HTTP status code in canary output: - **400**: Request payload changed upstream. Check `POST /api/sima/batch` route for body schema changes. - **409**: Idempotency key conflict (payload mismatch). This shouldn't happen with the canary's deterministic payload — investigate if the slot calculation changed. - **429**: Rate limit or queue admission cap hit. Check if another process is flooding the staging tenant. - **401/403**: JWT expired or tenant scope issue. Try re-running the canary. 2. Check staging API logs for the rejected request ### `CANARY_STATUS=FAIL_TIMEOUT` **Symptom:** Job did not reach terminal state within timeout (default 120s). **Steps:** 1. Check job status in staging: `GET /api/jobs/<jobId>` 2. If PENDING: Job processor may be stuck/down. Check worker logs. 3. If PROCESSING: Job is running but slow. Check DB load, concurrent job count. 4. Check queue admission state: are there many in-flight jobs for the tenant? 5. Check if the timeout is appropriate — can increase via `CANARY_JOB_TIMEOUT_SECONDS` variable ### `CANARY_STATUS=FAIL_JOB` **Symptom:** Job reached FAILED state. **Steps:** 1. Get job details: `GET /api/jobs/<jobId>` — check `error` field 2. Common causes: - Missing SIMA measures data in staging DB - Worker crash during processing - DB connectivity issue during job execution 3. Check staging API logs for the job execution ### `CANARY_STATUS=FAIL_SKU_MISSING` **Symptom:** Job completed but canary SKU not found in SIMA results after one built-in retry. **Steps:** 1. Check canary diagnostics in artifact/summary: - `CANARY_JOB_STATUS` - `CANARY_JOB_PROCESSED` - `CANARY_JOB_TOTAL_SKUS` - `CANARY_JOB_ERRORS_SKIPPED` - `CANARY_JOB_FILE_DUPLICATE` - `CANARY_JOB_COLUMN_VALIDATION_FAILED` - `CANARY_SKU_MISSING_COUNT` - `CANARY_SKU_VERIFY_RETRIES` 2. If job diagnostics indicate parser/schema issues (`columnValidationFailed=true`, `processed=0` with non-zero totals), treat as data/ingestion regression and inspect upload + parser paths first. 3. If diagnostics look healthy but evidence is still 404, verify seed/bootstrap state: - This endpoint returns 404 until at least one canary run has produced SIMA results for the SKU. - Re-run canary once manually and re-check. 4. Verify staging seed data includes `CANARY-SKU-001`: ```bash curl -sS "https://rgl8r-staging-api.onrender.com/api/sima/results/CANARY-SKU-001/evidence" \ -H "Authorization: Bearer <jwt>" ``` 5. If still 404: Add canary SKU to seed script and re-seed staging. 6. If data exists but evidence endpoint changed: check for route/response format changes. ### `CANARY_STATUS=FAIL_TRANSPORT` **Symptom:** curl transport-level failure (DNS resolution, connection refused, TLS handshake, timeout). The script emits `FAIL_TRANSPORT` with the curl exit code and error message. **Steps:** 1. Check the curl exit code in the canary output artifact (common: 6=DNS, 7=connect refused, 28=timeout, 35=TLS) 2. Verify staging API is reachable: `curl -sS https://rgl8r-staging-api.onrender.com/health` 3. If API is down: check Render dashboard for deploy/crash/scaling status 4. If API is up from your location: check GitHub Actions runner network (rare, but possible runner-to-Render connectivity issue) 5. If transient (single occurrence after a period of green): likely infrastructure blip — document and close ### SLO Breach (Latency/Completion Threshold Exceeded) **Symptom:** Canary script passed (job completed) but a latency or completion threshold was exceeded. In week-1 strict mode, this still causes a workflow hard failure for launch-gate integrity. **Alert policy:** SLO-only breaches do **not** page by default (to reduce false pages from transient spikes). Enable paging with repo variable `CANARY_PAGE_ON_SLO_BREACH=true`. **Steps:** 1. Check which threshold was breached (auth/enqueue/completion) in workflow step summary 2. Check staging API performance: is it under load from other sources? 3. Check Render instance health: free-tier cold starts? memory pressure? 4. If transient: document as expected infrastructure variance, close issue 5. If persistent: investigate root cause, consider threshold adjustment ### Unexpected Replay on Scheduled Run **Symptom:** `CANARY_REPLAYED=true` on a `schedule` trigger. **Steps:** 1. Check if previous scheduled run overlapped (ran > 30 minutes?) 2. Check GitHub Actions runner clock skew 3. Verify idempotency key slot calculation matches expectations 4. If persistent: may need to adjust slot window or check concurrency settings --- ## Escalation Path | Level | Trigger | Action | |-------|---------|--------| | L1 | Hard canary failure (`FAIL_AUTH`, `FAIL_ENQUEUE`, `FAIL_TIMEOUT`, `FAIL_JOB`, `FAIL_REPLAY_ON_SCHEDULE`, etc.) | Automatic Slack notification + GitHub issue upsert | | L1 | SLO-only breach (`FAIL_SLO_BREACH`) | Workflow fails (launch-gate impact), no page by default | | L2 | 3+ consecutive failures | Page on-call owner per staging health runbook | | L3 | Sustained outage (>2 hours) | Escalate to engineering lead | --- ## Rollback Levers ### Disable Canary Workflow ```bash gh workflow disable public-api-canary.yml ``` Use when: canary infrastructure itself is broken (not staging API). ### Verify Staging Health Independently ```bash curl https://rgl8r-staging-api.onrender.com/health scripts/check-staging-health.sh ``` ### Force-Verify RLS ```bash gh workflow run staging-force-rls-checks.yml ``` --- ## Canary Key Management ### Key Properties - **Tenant:** `00000000-0000-0000-0000-000000000099` (staging) - **Name:** `P11-F Canary (least-privilege)` - **Scopes:** `sima:write`, `jobs:read` (intent-documenting; runtime scope enforcement is minimal in v1) - **Recommended expiry:** 90 days (quarterly rotation) ### Rotation Procedure 1. **Create new key:** ```bash curl -X POST https://rgl8r-staging-api.onrender.com/api/admin/tenants/00000000-0000-0000-0000-000000000099/integration-keys \ -H "Authorization: Bearer <admin-jwt>" \ -H "Content-Type: application/json" \ -d '{"name": "P11-F Canary (least-privilege)", "scopes": ["sima:write", "jobs:read"]}' ``` 2. **Update repo secret:** ```bash gh secret set CANARY_INTEGRATION_KEY --body '<new-key>' ``` 3. **Verify canary passes with new key:** ```bash gh workflow run public-api-canary.yml ``` 4. **Revoke old key** after confirming new key works 5. **Set calendar reminder** for next rotation (quarterly) ### Rotation Schedule | Quarter | Action | |---------|--------| | Q2 2026 | Initial key created with PR merge | | Q3 2026 | First rotation | | Ongoing | Rotate quarterly | --- ## Related Documents - SLO baseline: `docs/operations/public-api-slo-baseline.md` - Staging health runbook: `docs/operations/staging-health-monitoring-runbook.md` - Environment variables: `docs/ENVIRONMENT_VARIABLES.md` (canary section)