Staging Health Monitoring Runbook

Source: docs/operations/staging-health-monitoring-runbook.md

# Staging Health Monitoring Runbook
 
This runbook covers unattended health monitoring for staging web + API services.
 
## Scope
 
- API health endpoint: `GET /health`
- Web health endpoint: `GET /api/health`
- Workflow: `.github/workflows/staging-health-checks.yml`
- Checker script: `scripts/check-staging-health.sh`
 
## Required Repo Configuration
 
Set these GitHub repository variables:
 
```bash
gh variable set STAGING_API_HEALTH_URL --body 'https://<api-domain>/health'
gh variable set STAGING_WEB_HEALTH_URL --body 'https://<web-domain>/api/health'
```
 
Optional variable for routing ownership in Slack message text:
 
```bash
gh variable set STAGING_HEALTH_ALERT_OWNER --body 'platform-oncall'
```
 
Optional Slack webhook secret (reuses existing staging checks channel):
 
```bash
gh secret set STAGING_CHECKS_SLACK_WEBHOOK_URL --body '<slack-webhook-url>'
```
 
## Triggering Checks
 
Manual run:
 
```bash
gh workflow run 'Staging Health Checks' \
  -f timeout_seconds=10 \
  -f max_attempts=3 \
  -f initial_backoff_seconds=2
```
 
Watch recent runs:
 
```bash
gh run list --workflow='staging-health-checks.yml' --limit 10
```
 
The workflow also runs every 30 minutes on cron.
 
## Failure Triage
 
When a run fails:
 
1. Open the failed workflow run and check the step summary plus `staging-health-checks` artifact (`staging-health.out`, `staging-health.err`).
2. Reproduce from local shell:
 
   ```bash
   scripts/check-staging-health.sh \
     --api-url='<api-health-url>' \
     --web-url='<web-health-url>' \
     --timeout-seconds=10 \
     --max-attempts=3 \
     --initial-backoff-seconds=2
   ```
 
   Note: transient transport/edge failures are retried with exponential backoff before the script fails.
 
3. Isolate target failure:
 
   ```bash
   curl -i '<api-health-url>'
   curl -i '<web-health-url>'
   ```
 
4. If only API fails:
- Check Render deploy status and logs for `rgl8r-staging-api`.
- Verify DB connectivity and env vars (`DATABASE_URL`, Clerk keys, `JWT_*`).
- Redeploy the latest successful commit.
 
5. If only web fails:
- Check Vercel deployment status and function logs.
- Verify `API_URL`, Clerk vars, and proxy route behavior.
- Redeploy latest successful build.
 
6. If both fail:
- Check upstream outage (Render/Vercel/network).
- Validate domain and DNS settings for both services.
- Confirm no expired cert/domain misconfiguration.
 
## Recovery Verification
 
After remediation, run:
 
```bash
gh workflow run 'Staging Health Checks' -f timeout_seconds=10
```
 
Confirm one manual pass and at least one subsequent scheduled pass are green.
 
## Exit Gate Mapping (P9-C2)
 
This monitoring lane satisfies the P9-C2 implementation requirements:
 
- Staging web + API health checks are automated.
- Failures route to Slack when `STAGING_CHECKS_SLACK_WEBHOOK_URL` is set.
- This runbook documents triage and recovery.
 
Keep a 7-day green run history before marking P9-C2 complete.