PagerDuty is the de facto incident management platform for DevOps and SRE teams at thousands of enterprises. The irony of PagerDuty downtime is acute: your system for managing incidents has its own incident. This guide helps you verify PagerDuty outages quickly and manage incidents when your alerting infrastructure is compromised.
How to Verify a PagerDuty Outage
1. Check the Official PagerDuty Status Page
PagerDuty maintains a detailed status page at status.pagerduty.com. It shows granular component health for: API, Notifications (phone, SMS, email, push), Integrations, and the PagerDuty web app. Critically, check which notification type is affected โ a phone outage is very different from a full API outage.
2. Test the PagerDuty API
Verify the API is responding:
curl -s -o /dev/null -w "%{http_code}" \
https://api.pagerduty.com/abilities \
-H "Accept: application/vnd.pagerduty+json;version=2" \
-H "Authorization: Token token=YOUR_API_TOKEN"
# Expect 200 if the PagerDuty API is healthy๐ก Monitor PagerDuty uptime every 30 seconds โ get alerted in under a minute
Trusted by 100,000+ websites ยท Free tier available
3. Test a Manual Event Integration
Send a test event to PagerDuty's Events API to verify event ingestion is working:
curl -s -o /dev/null -w "%{http_code}" \
-X POST https://events.pagerduty.com/v2/enqueue \
-H "Content-Type: application/json" \
-d '{"routing_key": "YOUR_INTEGRATION_KEY", "event_action": "trigger", "payload": {"summary": "Status test", "severity": "info", "source": "api-status-check"}}'
# Expect 202 AcceptedMonitor your PagerDuty integration with Better Stack
Ironic but necessary: use Better Stack to monitor whether PagerDuty's Events API is accepting alerts, so you know when your incident pipeline is broken.
Try Better Stack Free โEmergency Playbook: PagerDuty Down During an Active Incident
๐จ Immediate Actions (First 5 Minutes):
- Manually page on-call engineers via Slack DMs or personal phone numbers.
- Create a war room channel in Slack:
#incident-YYYY-MM-DD. - Designate one person as incident commander manually.
- Start a shared Google Doc or Confluence page to log the timeline.
๐ Backup Escalation Chain:
Every team should maintain a physical escalation list independent of PagerDuty. Typical structure:
- Primary on-call: <phone number> (text first, call if no response in 2 min)
- Secondary on-call: <phone number>
- Engineering manager: <phone number>
- Director of Engineering: <phone number>
Keep this list in a location that doesn't require PagerDuty access: Confluence, a pinned Slack message, or a laminated card in your office.
Common PagerDuty Failure Modes
- Phone/SMS Notification Failures: The most common partial outage. PagerDuty relies on telephony providers (Twilio and others) for voice and SMS. Carrier issues can delay or drop alerts.
- Events API Ingestion Lag: High-volume alerting (e.g., a widespread monitoring system storm) can cause the Events API to queue events with delays before processing.
- Web App Availability: The PagerDuty dashboard itself can be degraded while the core alerting infrastructure remains functional.
- AWS Regional Issues: PagerDuty is hosted on AWS. US-EAST-1 disruptions in particular can affect PagerDuty availability.
Building PagerDuty Redundancy
The SRE best practice is to treat your monitoring and incident management infrastructure as itself requiring high availability:
- Use multiple notification channels: Always configure phone AND email AND Slack as escalation paths. If phone fails, email often still works.
- Maintain an offline escalation document: Accessible via Google Drive, Confluence, or printed copy.
- Consider redundant alerting: Some teams use OpsGenie or Rootly as a secondary system for critical services, with PagerDuty as primary.
- Set up direct monitoring on PagerDuty: Use API Status Check to ping the PagerDuty Events API every 5 minutes and alert via alternative channels (email, Slack webhook) if it goes unresponsive.
Frequently Asked Questions
Is PagerDuty down or is it a carrier issue?
PagerDuty uses third-party carriers for SMS and phone notifications. If status.pagerduty.com shows "degraded notifications" but the API is healthy, it's likely a carrier issue โ your alerts are being queued but delivery is slow.
Will I be compensated if PagerDuty misses an alert?
PagerDuty offers SLA credits on paid plans if uptime drops below guaranteed levels. Review your contract terms for credit amounts and how to file a claim.
Should we consider OpsGenie or Rootly as a backup?
For critical services, yes. Running a lightweight backup alerting system is a reasonable investment. Many teams use PagerDuty for primary on-call management and a simpler webhook-based backup for the most critical alerts.
Alert Pro
14-day free trialStop checking โ get alerted instantly
Next time PagerDuty goes down, you'll know in under 60 seconds โ not when your users start complaining.
- Email alerts for PagerDuty + 9 more APIs
- $0 due today for trial
- Cancel anytime โ $9/mo after trial