Staging Health Monitoring Runbook
Source: docs/operations/staging-health-monitoring-runbook.md
# Staging Health Monitoring Runbook
This runbook covers unattended health monitoring for staging web + API services.
## Scope
- API health endpoint: `GET /health`
- Web health endpoint: `GET /api/health`
- Workflow: `.github/workflows/staging-health-checks.yml`
- Checker script: `scripts/check-staging-health.sh`
## Required Repo Configuration
Set these GitHub repository variables:
```bash
gh variable set STAGING_API_HEALTH_URL --body 'https://<api-domain>/health'
gh variable set STAGING_WEB_HEALTH_URL --body 'https://<web-domain>/api/health'
```
Optional variable for routing ownership in Slack message text:
```bash
gh variable set STAGING_HEALTH_ALERT_OWNER --body 'platform-oncall'
```
Optional Slack webhook secret (reuses existing staging checks channel):
```bash
gh secret set STAGING_CHECKS_SLACK_WEBHOOK_URL --body '<slack-webhook-url>'
```
## Triggering Checks
Manual run:
```bash
gh workflow run 'Staging Health Checks' \
-f timeout_seconds=10 \
-f max_attempts=3 \
-f initial_backoff_seconds=2
```
Watch recent runs:
```bash
gh run list --workflow='staging-health-checks.yml' --limit 10
```
The workflow also runs every 30 minutes on cron.
## Failure Triage
When a run fails:
1. Open the failed workflow run and check the step summary plus `staging-health-checks` artifact (`staging-health.out`, `staging-health.err`).
2. Reproduce from local shell:
```bash
scripts/check-staging-health.sh \
--api-url='<api-health-url>' \
--web-url='<web-health-url>' \
--timeout-seconds=10 \
--max-attempts=3 \
--initial-backoff-seconds=2
```
Note: transient transport/edge failures are retried with exponential backoff before the script fails.
3. Isolate target failure:
```bash
curl -i '<api-health-url>'
curl -i '<web-health-url>'
```
4. If only API fails:
- Check Render deploy status and logs for `rgl8r-staging-api`.
- Verify DB connectivity and env vars (`DATABASE_URL`, Clerk keys, `JWT_*`).
- Redeploy the latest successful commit.
5. If only web fails:
- Check Vercel deployment status and function logs.
- Verify `API_URL`, Clerk vars, and proxy route behavior.
- Redeploy latest successful build.
6. If both fail:
- Check upstream outage (Render/Vercel/network).
- Validate domain and DNS settings for both services.
- Confirm no expired cert/domain misconfiguration.
## Recovery Verification
After remediation, run:
```bash
gh workflow run 'Staging Health Checks' -f timeout_seconds=10
```
Confirm one manual pass and at least one subsequent scheduled pass are green.
## Exit Gate Mapping (P9-C2)
This monitoring lane satisfies the P9-C2 implementation requirements:
- Staging web + API health checks are automated.
- Failures route to Slack when `STAGING_CHECKS_SLACK_WEBHOOK_URL` is set.
- This runbook documents triage and recovery.
Keep a 7-day green run history before marking P9-C2 complete.