Skip to Content
InternalDocsOperationsStaging Health Monitoring Runbook

Staging Health Monitoring Runbook

Source: docs/operations/staging-health-monitoring-runbook.md

# Staging Health Monitoring Runbook This runbook covers unattended health monitoring for staging web + API services. ## Scope - API health endpoint: `GET /health` - Web health endpoint: `GET /api/health` - Workflow: `.github/workflows/staging-health-checks.yml` - Checker script: `scripts/check-staging-health.sh` ## Required Repo Configuration Set these GitHub repository variables: ```bash gh variable set STAGING_API_HEALTH_URL --body 'https://<api-domain>/health' gh variable set STAGING_WEB_HEALTH_URL --body 'https://<web-domain>/api/health' ``` Optional variable for routing ownership in Slack message text: ```bash gh variable set STAGING_HEALTH_ALERT_OWNER --body 'platform-oncall' ``` Optional Slack webhook secret (reuses existing staging checks channel): ```bash gh secret set STAGING_CHECKS_SLACK_WEBHOOK_URL --body '<slack-webhook-url>' ``` ## Triggering Checks Manual run: ```bash gh workflow run 'Staging Health Checks' \ -f timeout_seconds=10 \ -f max_attempts=3 \ -f initial_backoff_seconds=2 ``` Watch recent runs: ```bash gh run list --workflow='staging-health-checks.yml' --limit 10 ``` The workflow also runs every 30 minutes on cron. ## Failure Triage When a run fails: 1. Open the failed workflow run and check the step summary plus `staging-health-checks` artifact (`staging-health.out`, `staging-health.err`). 2. Reproduce from local shell: ```bash scripts/check-staging-health.sh \ --api-url='<api-health-url>' \ --web-url='<web-health-url>' \ --timeout-seconds=10 \ --max-attempts=3 \ --initial-backoff-seconds=2 ``` Note: transient transport/edge failures are retried with exponential backoff before the script fails. 3. Isolate target failure: ```bash curl -i '<api-health-url>' curl -i '<web-health-url>' ``` 4. If only API fails: - Check Render deploy status and logs for `rgl8r-staging-api`. - Verify DB connectivity and env vars (`DATABASE_URL`, Clerk keys, `JWT_*`). - Redeploy the latest successful commit. 5. If only web fails: - Check Vercel deployment status and function logs. - Verify `API_URL`, Clerk vars, and proxy route behavior. - Redeploy latest successful build. 6. If both fail: - Check upstream outage (Render/Vercel/network). - Validate domain and DNS settings for both services. - Confirm no expired cert/domain misconfiguration. ## Recovery Verification After remediation, run: ```bash gh workflow run 'Staging Health Checks' -f timeout_seconds=10 ``` Confirm one manual pass and at least one subsequent scheduled pass are green. ## Exit Gate Mapping (P9-C2) This monitoring lane satisfies the P9-C2 implementation requirements: - Staging web + API health checks are automated. - Failures route to Slack when `STAGING_CHECKS_SLACK_WEBHOOK_URL` is set. - This runbook documents triage and recovery. Keep a 7-day green run history before marking P9-C2 complete.