Skip to main content

Uptime Monitoring Setup — BetterStack External Health Checks

1. Overview

Current setup: The Agentix API exposes a /health endpoint that verifies database (PostgreSQL) and Redis connectivity, returning HTTP 200 when healthy and HTTP 503 when either dependency is down. Target: External uptime monitoring via BetterStack that pings /health every 60 seconds and alerts the team via email (and optionally Slack) when the service is degraded. Why this matters:
  • Internal health checks only help if the server is reachable — external monitoring catches network, DNS, and infrastructure failures
  • 60-second intervals ensure issues are detected within 1-2 minutes
  • Automated alerts reduce mean-time-to-detection (MTTD) from hours to minutes
  • A public status page builds trust with tenants
What the /health endpoint checks:
  • PostgreSQL connectivity (runs a lightweight query)
  • Redis connectivity (sends a PING command)
  • Returns 200 OK with { "status": "healthy" } if both pass
  • Returns 503 Service Unavailable with { "status": "unhealthy", "details": {...} } if either fails

2. Prerequisites

  • BetterStack account — sign up at https://betterstack.com (free tier includes 5 monitors with 3-minute checks; upgrade for 60-second intervals)
  • Production API URL (e.g., https://api.agentix.app)
  • Team email addresses for alert recipients

3. Step 1 — Create a Monitor

  1. Sign into the BetterStack dashboard
  2. Navigate to Monitors in the left sidebar
  3. Click Create Monitor
  4. Configure the monitor settings:
SettingValue
Monitor typeHTTP(s)
URLhttps://api.agentix.app/health (substitute your actual production URL)
Check frequencyEvery 60 seconds
Request methodGET
Expected status code200
Confirmation period2 checks (waits for 2 consecutive failures before alerting — avoids false alarms on transient blips)
Request timeout10 seconds
Monitor nameAgentix API — Health (or any descriptive name)
  1. Click Save to create the monitor
Note: The free tier limits check frequency to 3 minutes. For 60-second checks, the Freelancer plan ($16.67/mo billed annually) or higher is required. The 3-minute free tier is still useful for basic coverage.

4. Step 2 — Configure Email Alerts

BetterStack sends alerts to people added to your escalation policy.
  1. Navigate to On-call > People in the left sidebar
  2. Click Invite team member and add each recipient’s email address
  3. Navigate to On-call > Escalation policies
  4. Edit the default escalation policy (or create a new one):
    • Step 1: Notify the team immediately on incident creation
    • Add all relevant team members
  5. Return to your monitor and verify the escalation policy is assigned
Test email delivery:
  • BetterStack sends a welcome email when you invite team members
  • If no welcome email arrives, check spam/junk folders and verify the email address

5. Step 3 — Configure Slack Alerts (Optional)

For faster response times, add Slack notifications alongside email.
  1. Navigate to Integrations in the left sidebar
  2. Find Slack and click Connect
  3. Authorize BetterStack to post to your Slack workspace
  4. Select the channel for alerts (e.g., #ops-alerts or #engineering)
  5. Return to On-call > Escalation policies
  6. Add a Slack notification step to your escalation policy:
    • Step 1: Notify via Slack channel immediately
    • Step 2: Notify team members via email (if not acknowledged within 5 minutes)

6. Step 4 — Create a Status Page (Optional)

A public status page communicates uptime to tenants without them needing to contact support.
  1. Navigate to Status pages in the left sidebar
  2. Click Create status page
  3. Configure:
    • Name: Agentix Status
    • Subdomain: status.agentix.app (or use BetterStack’s default subdomain)
    • Resources: Add the Agentix API — Health monitor
  4. Click Save
  5. Share the status page URL with tenants or link it from the product
Custom domain (optional):
  • Add a CNAME record in your DNS pointing status.agentix.app to BetterStack’s status page domain
  • Configure the custom domain in BetterStack’s status page settings

7. Verification

After creating the monitor, verify everything is working:
  1. Wait 2-3 minutes for the first few checks to complete
  2. In the BetterStack dashboard, confirm the monitor shows Up status with a green indicator
  3. Check that the response time graph is populating
Test alerting end-to-end:
  1. Temporarily change the monitor URL to a non-existent path (e.g., https://api.agentix.app/health-test-invalid)
  2. Wait for 2 check cycles (2-3 minutes depending on your interval)
  3. Confirm an alert email arrives (check spam if not in inbox)
  4. Confirm Slack notification arrives (if configured)
  5. Immediately revert the monitor URL back to https://api.agentix.app/health
  6. Confirm the monitor recovers and shows Up status
  7. Confirm a recovery notification is sent

8. Verification Checklist

  • Monitor exists in BetterStack dashboard with Up status
  • Check frequency is set to 60 seconds (or 3 minutes on free tier)
  • Expected status code is 200
  • Confirmation period is 2 checks
  • Request timeout is 10 seconds
  • At least one team member is configured in the escalation policy
  • Test alert was received via email
  • (Optional) Slack integration is connected and test alert received
  • (Optional) Status page is created and accessible

9. Troubleshooting

Monitor shows “Down” but the app works in browser

  • CORS or auth blocking: The /health endpoint should not require authentication or set CORS restrictions. Verify by running:
    curl -s -o /dev/null -w "%{http_code}" https://api.agentix.app/health
    
    Expected output: 200
  • Firewall or WAF: If using Cloudflare or another WAF, ensure BetterStack’s IP ranges are not blocked. BetterStack publishes their monitoring IP ranges in their documentation.
  • DNS resolution: The monitor URL must be publicly resolvable. If the API is behind a private network, external monitoring cannot reach it.

Alerts not arriving

  • Email: Check spam/junk folders. Verify the email address in On-call > People. Ensure the escalation policy is assigned to the monitor.
  • Slack: Verify the Slack integration is still authorized (tokens can expire). Reconnect if needed.
  • Escalation policy: Ensure the policy has at least one active step with team members assigned.

False alarms (intermittent “Down” alerts)

  • Increase the confirmation period from 2 to 3 checks
  • Increase the request timeout from 10 to 15 seconds
  • Check if the API has cold-start latency (Railway sleeps inactive services on some plans)

Health endpoint returns 503

The /health endpoint returns 503 when PostgreSQL or Redis is unreachable. This is a real issue that requires investigation:
  • Check Railway dashboard for database/Redis service status
  • Check PostgreSQL connection limits (max_connections)
  • Check Redis memory usage and eviction policy
  • Review API logs in Railway for connection errors

10. Ongoing Maintenance

  • Review monthly: Check the uptime percentage in BetterStack dashboard. Aim for 99.9%+ uptime.
  • DMARC upgrade path: None needed — BetterStack alerts come from BetterStack’s own domain.
  • Escalation policy updates: When team members join or leave, update the escalation policy in On-call > People.
  • Monitor updates: If the API URL changes (e.g., domain migration), update the monitor URL immediately.

References