Public API Slo Baseline
Source: docs/operations/public-api-slo-baseline.md
# Public API SLO Baseline
**Published:** 2026-02-23
**Owner:** platform-oncall
**Canary workflow:** `.github/workflows/public-api-canary.yml`
**Dashboard:** GitHub Actions workflow run history for `public-api-canary.yml`
---
## Service Level Objectives
| Metric | Target | Measurement |
|--------|--------|-------------|
| **Availability** | 99.5% canary success rate (rolling 7 days) | % of canary runs with `CANARY_STATUS=PASS` |
| **Auth latency** | p95 ≤ 3,000ms | `CANARY_AUTH_LATENCY_MS` from canary output |
| **Enqueue latency** | p95 ≤ 5,000ms | `CANARY_ENQUEUE_LATENCY_MS` from canary output |
| **Job completion (SIMA batch, bounded)** | p95 ≤ 90s | `CANARY_COMPLETION_SECONDS` from canary output |
### Measurement Cadence
- **Canary frequency:** Every 30 minutes (48 runs/day)
- **7-day window:** 336 runs
- **7-day clean gate:** Zero unresolved failures required for launch
---
## Alerting Policy
### Week-1 (Pre-Launch Strict Mode)
Any SLO breach or hard failure = immediate workflow fail for launch-gate integrity.
| Condition | Action | Latency |
|-----------|--------|---------|
| Hard failure (HTTP error, timeout, FAILED job) | Workflow fails → Slack + GitHub issue upsert | Immediate |
| SLO breach (any threshold exceeded) | Workflow fails (gate stays strict). Paging is suppressed by default to reduce false pages from transient spikes. | Immediate |
| Replay on scheduled run (slot overlap/drift) | Workflow fails → Slack + GitHub issue upsert | Immediate |
| 3+ consecutive hard failures | Escalation per on-call runbook | Manual |
To page on SLO-only breaches, set repo variable `CANARY_PAGE_ON_SLO_BREACH=true`.
### Post-Launch (Future)
Can relax SLO breach handling to sustained-breach escalation (2+ consecutive) via `github-script` querying prior workflow runs (`actions.listWorkflowRuns`). Not implemented in the initial PR — immediate fail is sufficient and avoids fragility of artifact-based state.
---
## Canary Test Path
The canary exercises the exact public customer integration path:
1. `POST /api/auth/token/integration` — integration key → JWT exchange
2. `POST /api/sima/batch` — enqueue bounded SKU job with explicit `runPolicy: "always"`, `screeningAuthority: "CA"`, and slot-based idempotency key
3. `GET /api/jobs/:id` — poll with exponential backoff until terminal state
4. `GET /api/sima/results/:sku/evidence` — validate canary SKU appears in results
### Design Decisions
- **Bounded SKU list** (`["CANARY-SKU-001"]`): Deterministic runtime for reliable SLO measurement. Avoids unbounded full-tenant scans.
- **Slot-based idempotency** (`canary-<30min-bucket>`): One fresh job per schedule tick; manual reruns within a slot safely replay.
- **Explicit request fields**: Pinned `runPolicy` and `screeningAuthority` prevent future default changes from silently altering canary signal.
- **Post-completion SKU validation**: Catches stale/missing seed data without relying on enqueue rejection.
---
## Thresholds (Configurable)
Thresholds are stored as GitHub repo variables and can be tuned without code changes:
| Variable | Default | Description |
|----------|---------|-------------|
| `CANARY_SLO_AUTH_LATENCY_MS` | 3000 | Max auth latency before SLO breach |
| `CANARY_SLO_ENQUEUE_LATENCY_MS` | 5000 | Max enqueue latency before SLO breach |
| `CANARY_SLO_COMPLETION_SECONDS` | 90 | Max job completion time before SLO breach |
---
## Related Documents
- Canary runbook: `docs/operations/public-api-canary-runbook.md`
- Canary script: `scripts/canary-public-api.sh`
- Workflow: `.github/workflows/public-api-canary.yml`
- Environment variables: `docs/ENVIRONMENT_VARIABLES.md` (canary section)