Skip to Content
InternalDocsOperationsPublic API Slo Baseline

Public API Slo Baseline

Source: docs/operations/public-api-slo-baseline.md

# Public API SLO Baseline **Published:** 2026-02-23 **Owner:** platform-oncall **Canary workflow:** `.github/workflows/public-api-canary.yml` **Dashboard:** GitHub Actions workflow run history for `public-api-canary.yml` --- ## Service Level Objectives | Metric | Target | Measurement | |--------|--------|-------------| | **Availability** | 99.5% canary success rate (rolling 7 days) | % of canary runs with `CANARY_STATUS=PASS` | | **Auth latency** | p95 ≤ 3,000ms | `CANARY_AUTH_LATENCY_MS` from canary output | | **Enqueue latency** | p95 ≤ 5,000ms | `CANARY_ENQUEUE_LATENCY_MS` from canary output | | **Job completion (SIMA batch, bounded)** | p95 ≤ 90s | `CANARY_COMPLETION_SECONDS` from canary output | ### Measurement Cadence - **Canary frequency:** Every 30 minutes (48 runs/day) - **7-day window:** 336 runs - **7-day clean gate:** Zero unresolved failures required for launch --- ## Alerting Policy ### Week-1 (Pre-Launch Strict Mode) Any SLO breach or hard failure = immediate workflow fail for launch-gate integrity. | Condition | Action | Latency | |-----------|--------|---------| | Hard failure (HTTP error, timeout, FAILED job) | Workflow fails → Slack + GitHub issue upsert | Immediate | | SLO breach (any threshold exceeded) | Workflow fails (gate stays strict). Paging is suppressed by default to reduce false pages from transient spikes. | Immediate | | Replay on scheduled run (slot overlap/drift) | Workflow fails → Slack + GitHub issue upsert | Immediate | | 3+ consecutive hard failures | Escalation per on-call runbook | Manual | To page on SLO-only breaches, set repo variable `CANARY_PAGE_ON_SLO_BREACH=true`. ### Post-Launch (Future) Can relax SLO breach handling to sustained-breach escalation (2+ consecutive) via `github-script` querying prior workflow runs (`actions.listWorkflowRuns`). Not implemented in the initial PR — immediate fail is sufficient and avoids fragility of artifact-based state. --- ## Canary Test Path The canary exercises the exact public customer integration path: 1. `POST /api/auth/token/integration` — integration key → JWT exchange 2. `POST /api/sima/batch` — enqueue bounded SKU job with explicit `runPolicy: "always"`, `screeningAuthority: "CA"`, and slot-based idempotency key 3. `GET /api/jobs/:id` — poll with exponential backoff until terminal state 4. `GET /api/sima/results/:sku/evidence` — validate canary SKU appears in results ### Design Decisions - **Bounded SKU list** (`["CANARY-SKU-001"]`): Deterministic runtime for reliable SLO measurement. Avoids unbounded full-tenant scans. - **Slot-based idempotency** (`canary-<30min-bucket>`): One fresh job per schedule tick; manual reruns within a slot safely replay. - **Explicit request fields**: Pinned `runPolicy` and `screeningAuthority` prevent future default changes from silently altering canary signal. - **Post-completion SKU validation**: Catches stale/missing seed data without relying on enqueue rejection. --- ## Thresholds (Configurable) Thresholds are stored as GitHub repo variables and can be tuned without code changes: | Variable | Default | Description | |----------|---------|-------------| | `CANARY_SLO_AUTH_LATENCY_MS` | 3000 | Max auth latency before SLO breach | | `CANARY_SLO_ENQUEUE_LATENCY_MS` | 5000 | Max enqueue latency before SLO breach | | `CANARY_SLO_COMPLETION_SECONDS` | 90 | Max job completion time before SLO breach | --- ## Related Documents - Canary runbook: `docs/operations/public-api-canary-runbook.md` - Canary script: `scripts/canary-public-api.sh` - Workflow: `.github/workflows/public-api-canary.yml` - Environment variables: `docs/ENVIRONMENT_VARIABLES.md` (canary section)