Is PagerDuty Down? How to Check PagerDuty Status & Troubleshoot Alerts
Is PagerDuty Down? How to Check PagerDuty Status & Troubleshoot Alerts
Quick Answer: To check if PagerDuty is down, visit apistatuscheck.com/api/pagerduty for real-time monitoring, or check the official status.pagerduty.com page. Common signs include missing alert notifications, webhook delivery failures, mobile app sync issues, and integration connection errors with Slack, Microsoft Teams, or monitoring tools.
When your incident management system goes down, you're flying blind during the very moments you need visibility most. PagerDuty serves as the critical alerting backbone for thousands of engineering teams managing production incidents, making any service disruption a potential cascade failure for your entire incident response workflow. Whether you're missing critical alerts, experiencing webhook delays, or seeing integration failures, knowing how to quickly diagnose PagerDuty's status can mean the difference between a swift incident response and a prolonged outage.
How to Check PagerDuty Status in Real-Time
1. API Status Check (Fastest Method)
The most reliable way to verify PagerDuty's operational status is through apistatuscheck.com/api/pagerduty. This real-time monitoring service:
- Tests actual API endpoints including events, incidents, and services
- Monitors webhook delivery with sub-minute precision
- Tracks mobile notification latency across iOS and Android
- Provides instant alerts when PagerDuty issues are detected
- Historical uptime analysis over 30/60/90 day periods
- Regional health checks (US, EU, multi-region validation)
Unlike status pages that may lag during incidents, API Status Check performs continuous active health checks against PagerDuty's production infrastructure, detecting issues often within 60 seconds of occurrence.
2. Official PagerDuty Status Page
PagerDuty maintains status.pagerduty.com as their authoritative source for service status. The page displays:
- Current operational status for all components
- Active incident investigations and updates
- Scheduled maintenance windows
- Post-incident reports with root cause analysis
- Component-level granularity (API, Events, Webhooks, Mobile Push, Email, SMS, Voice)
Critical tip: Subscribe to status updates immediately. Use the "Subscribe to Updates" button to receive notifications via email, SMS, Slack webhook, or Atom/RSS feed. During major incidents, PagerDuty posts updates every 30-60 minutes until resolution.
3. Test the Events API Directly
For on-call engineers, making a test event submission confirms end-to-end alerting functionality:
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-H 'Authorization: Token token=YOUR_INTEGRATION_KEY' \
-d '{
"routing_key": "YOUR_INTEGRATION_KEY",
"event_action": "trigger",
"payload": {
"summary": "Health check test",
"severity": "info",
"source": "monitoring"
}
}'
A successful response with HTTP 202 Accepted indicates the Events API is operational. If you don't receive a notification within 1-2 minutes, there may be downstream delivery issues.
4. Check Your Integrations
PagerDuty's value comes from integrations. If alerts aren't flowing, verify:
- Monitoring tool connections (Datadog, New Relic, Prometheus, Grafana)
- Slack/Teams integration status in PagerDuty settings
- Email gateway functionality (check spam/quarantine)
- Webhook endpoints receiving POST requests
Navigate to Configuration → Services → [Your Service] → Integrations to see connection health indicators and recent event counts.
5. Mobile App Connectivity Test
The PagerDuty mobile app (iOS/Android) has a built-in connectivity check:
- Open the PagerDuty mobile app
- Pull down to refresh your incident list
- Navigate to Settings → Test Notifications
- Send a test push notification
If the test notification arrives within 10 seconds, mobile push is working. Delays beyond 30 seconds often indicate service degradation.
Common PagerDuty Issues and How to Identify Them
Missing or Delayed Alert Notifications
Symptoms:
- Critical incidents created but no phone calls, SMS, or push notifications sent
- Alerts arriving 5-30+ minutes after incident creation
- Escalation policies not triggering subsequent on-call responders
- Mobile app showing incidents that never triggered notifications
What it means: This is the most critical failure mode. When the notification pipeline breaks, on-call engineers miss critical production incidents. Check if the issue affects all notification channels (push, SMS, voice) or specific ones.
Diagnostic steps:
# Check if event was received by PagerDuty
curl -H "Authorization: Token token=YOUR_API_KEY" \
-H "Accept: application/vnd.pagerduty+json;version=2" \
-X GET "https://api.pagerduty.com/incidents?since=2026-02-04T00:00:00Z"
# Verify service integration health
curl -H "Authorization: Token token=YOUR_API_KEY" \
-H "Accept: application/vnd.pagerduty+json;version=2" \
-X GET "https://api.pagerduty.com/services/SERVICE_ID"
Webhook Delivery Failures
Webhooks power critical automation workflows. Common webhook issues:
- Events not reaching your endpoint despite PagerDuty showing successful sends
- Webhook signature verification failing unexpectedly
- Massive delays between incident state changes and webhook delivery
- Duplicate webhook deliveries for the same event
Webhook test script:
const crypto = require('crypto');
function verifyWebhookSignature(payload, signature, secret) {
const computedSignature = crypto
.createHmac('sha256', secret)
.update(payload)
.digest('hex');
return `v1=${computedSignature}` === signature;
}
// Express endpoint example
app.post('/pagerduty-webhook', (req, res) => {
const signature = req.headers['x-pagerduty-signature'];
const payload = JSON.stringify(req.body);
if (!verifyWebhookSignature(payload, signature, WEBHOOK_SECRET)) {
console.error('Webhook signature verification failed');
return res.status(403).send('Forbidden');
}
// Process webhook
console.log('Received webhook:', req.body.messages[0]);
res.status(200).send('OK');
});
Check PagerDuty's webhook delivery logs (Configuration → Extensions → [Your Webhook] → View Logs) for failed attempts, timeout errors, or rate limiting.
Integration Connection Errors
Slack integration issues:
- Incident updates not posting to Slack channels
- Unable to acknowledge incidents from Slack buttons
- "Slack authorization expired" errors
- Bot missing from critical alert channels
Microsoft Teams issues:
- Connector webhook failures
- Cards not rendering incident details
- Action buttons not working (acknowledge, resolve)
Resolution: Navigate to Integrations → [Slack/Teams] and click "Reconnect" or "Reauthorize." Ensure the bot has proper channel permissions.
API Errors and Rate Limiting
Common error responses during outages:
500 Internal Server Error- Backend service failure502 Bad Gateway- Load balancer cannot reach PagerDuty servers503 Service Unavailable- Temporary capacity overload504 Gateway Timeout- Request processing exceeded timeout429 Too Many Requests- Rate limit exceeded (unusual unless you're making excessive calls)
Monitoring script with exponential backoff:
import requests
import time
def check_pagerduty_health(api_key, max_retries=3):
headers = {
'Authorization': f'Token token={api_key}',
'Accept': 'application/vnd.pagerduty+json;version=2'
}
for attempt in range(max_retries):
try:
response = requests.get(
'https://api.pagerduty.com/users',
headers=headers,
timeout=10
)
if response.status_code == 200:
return True, "PagerDuty API healthy"
elif response.status_code >= 500:
# Server error - likely outage
if attempt < max_retries - 1:
wait = 2 ** attempt # Exponential backoff
time.sleep(wait)
continue
return False, f"API returned {response.status_code}"
else:
return False, f"Unexpected status: {response.status_code}"
except requests.exceptions.Timeout:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
continue
return False, "API request timed out"
except requests.exceptions.ConnectionError:
return False, "Cannot connect to PagerDuty API"
return False, "Max retries exceeded"
# Run health check
is_healthy, message = check_pagerduty_health(YOUR_API_KEY)
print(f"Health check: {message}")
Mobile App Sync Issues
Indicators the mobile app is affected:
- Incidents visible on web but not syncing to mobile
- Unable to acknowledge or resolve incidents from mobile
- "Connection error" when refreshing incident list
- Push notifications not arriving despite web showing them sent
Troubleshooting steps:
- Force close and reopen the app
- Check your device's internet connection (try both WiFi and cellular)
- Verify push notification permissions in device settings
- Log out and log back into the PagerDuty app
- Clear app cache (iOS: Delete and reinstall; Android: Settings → Apps → PagerDuty → Clear Cache)
Incident Routing Failures
Symptoms:
- Incidents created but assigned to the wrong service or escalation policy
- Events from monitoring tools not matching to correct PagerDuty services
- Routing keys suddenly not working
- New incidents stuck in "Unassigned" state
This often indicates integration key misconfiguration, but during outages, the routing engine itself may be impacted.
The Real Impact When PagerDuty Goes Down
Blind Spots During Production Incidents
The paradox of incident management platform outages: you lose visibility precisely when you need it most.
- Critical alerts go unnoticed - Database down? API failing? Nobody knows.
- Mean Time to Detection (MTTD) skyrockets - Instead of 2-minute alert response, issues go undetected for 20+ minutes
- Manual monitoring required - Engineers must constantly check dashboards instead of getting proactive alerts
- On-call schedule chaos - No automated escalation means manual phone trees
For a SaaS company with a 99.9% SLA, a 30-minute PagerDuty outage during a production incident could breach customer commitments and trigger financial penalties.
Cascade Failures in Incident Response
Modern incident response relies on PagerDuty as the coordination hub:
- War room assembly delayed - Slack integration down means no automatic channel creation
- Stakeholder communication breaks - Automated status page updates fail
- Runbook automation disabled - Webhook-triggered remediation scripts don't fire
- Post-incident analysis incomplete - Incident timeline data missing or corrupted
Real-world scenario: A database failover occurs. Normally, PagerDuty receives the alert from Datadog, pages the on-call database engineer, creates a Slack channel, and triggers automated failover scripts via webhooks. With PagerDuty down, none of this happens automatically. The failover takes 45 minutes instead of 5 minutes, causing customer-facing downtime.
On-Call Engineer Stress and Alert Fatigue
When PagerDuty recovers from an outage, on-call engineers face:
- Alert storms - Hundreds of queued notifications deliver simultaneously
- Duplicate pages - Same incident triggers multiple notification attempts
- False escalations - Issues already resolved trigger delayed escalation policies
- Out-of-order webhooks - Automation systems receive events in wrong sequence
This creates alert fatigue and can cause engineers to miss genuinely critical alerts buried in the noise.
Broken Multi-Team Coordination
For large engineering organizations using PagerDuty's team features:
- Cross-team escalations fail - Infrastructure team can't automatically escalate to application team
- Manager escalations blocked - Critical P0 incidents that should notify directors/VPs go unseen
- Follow-the-sun on-call breaks - Global teams can't smoothly hand off incidents across time zones
- Service dependencies ignored - Cascading failures across microservices don't trigger appropriate responders
Compliance and Audit Trail Loss
Many regulated industries (finance, healthcare, government) rely on PagerDuty for compliance:
- Incomplete incident records - Gaps in incident timeline for audit reports
- Response time violations - Unable to prove 5-minute response SLAs were met
- Change management tracking broken - Incident-to-change correlation data missing
- SOC 2/ISO 27001 evidence gaps - Auditors can't verify incident response procedures
During PagerDuty outages, organizations should activate manual incident logging procedures to maintain compliance documentation.
What to Do When PagerDuty Goes Down
1. Implement Multi-Channel Alerting with Fallback
Never rely solely on PagerDuty. Build a multi-layer alerting strategy:
// Multi-provider alerting with automatic failover
const alertProviders = [
{
name: 'pagerduty',
send: async (incident) => {
const response = await fetch('https://events.pagerduty.com/v2/enqueue', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Token token=${PAGERDUTY_KEY}`
},
body: JSON.stringify({
routing_key: PAGERDUTY_INTEGRATION_KEY,
event_action: 'trigger',
payload: {
summary: incident.title,
severity: incident.severity,
source: incident.source
}
}),
timeout: 5000
});
return response.ok;
}
},
{
name: 'opsgenie',
send: async (incident) => {
// Fallback to Opsgenie
const response = await fetch('https://api.opsgenie.com/v2/alerts', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `GenieKey ${OPSGENIE_API_KEY}`
},
body: JSON.stringify({
message: incident.title,
priority: incident.severity,
source: incident.source
}),
timeout: 5000
});
return response.ok;
}
},
{
name: 'slack',
send: async (incident) => {
// Ultimate fallback to Slack
const response = await fetch(SLACK_WEBHOOK_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text: `🚨 CRITICAL ALERT: ${incident.title}`,
attachments: [{
color: 'danger',
text: `Severity: ${incident.severity}\nSource: ${incident.source}\nPagerDuty may be down - check manually!`
}]
}),
timeout: 5000
});
return response.ok;
}
}
];
async function sendAlertWithFallback(incident) {
for (const provider of alertProviders) {
try {
const success = await provider.send(incident);
if (success) {
console.log(`Alert sent via ${provider.name}`);
return true;
}
} catch (error) {
console.error(`${provider.name} failed:`, error.message);
// Continue to next provider
}
}
// All providers failed - log critically
console.error('ALL ALERTING PROVIDERS FAILED - MANUAL INTERVENTION REQUIRED');
return false;
}
This ensures critical alerts reach your team even if PagerDuty is completely unavailable.
2. Build Webhook Retry Logic with Dead Letter Queue
Webhook failures during PagerDuty outages can break critical automation. Implement robust retry logic:
const { Queue } = require('bullmq');
const Redis = require('ioredis');
const connection = new Redis({ maxRetriesPerRequest: null });
const webhookQueue = new Queue('pagerduty-webhooks', { connection });
async function processWebhook(webhookData) {
const maxAttempts = 5;
const backoffMultiplier = 2;
await webhookQueue.add('process-webhook', webhookData, {
attempts: maxAttempts,
backoff: {
type: 'exponential',
delay: 2000 // Start with 2 seconds, doubles each retry
},
removeOnComplete: 1000, // Keep last 1000 successful jobs
removeOnFail: 5000 // Keep last 5000 failed jobs for analysis
});
}
// Worker to process webhooks
const { Worker } = require('bullmq');
const worker = new Worker('pagerduty-webhooks', async (job) => {
const { webhookData } = job.data;
// Attempt to send webhook to your internal system
const response = await fetch('https://your-system.com/incident-handler', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(webhookData),
timeout: 10000
});
if (!response.ok) {
throw new Error(`Webhook processing failed: ${response.status}`);
}
return { success: true, processedAt: Date.now() };
}, { connection });
// Handle failed webhooks after all retries exhausted
worker.on('failed', (job, err) => {
console.error(`Webhook ${job.id} failed after ${job.attemptsMade} attempts`);
// Send to dead letter queue for manual review
saveToDeadLetterQueue(job.data);
});
This pattern ensures webhooks are retried intelligently and never silently dropped.
3. Maintain Manual On-Call Backup Procedures
Document and regularly test manual escalation procedures:
Manual On-Call Runbook:
- Primary check: Review monitoring dashboards every 15 minutes during incidents
- Secondary escalation: If no response in 10 minutes, call the on-call engineer directly (keep phone list updated)
- Manager escalation: After 20 minutes, escalate to engineering manager
- War room assembly: Start Slack/Zoom incident channel manually
- Status updates: Post to status page manually every 30 minutes
Store this information redundantly:
- Printed copy in office (for datacenter incidents affecting cloud services)
- Google Doc shared with entire engineering team
- Slack pinned message in #incidents channel
- On-call phone tree in company wiki
4. Test Your Incident Response Without PagerDuty
Quarterly "disaster recovery drills" should include PagerDuty outage scenarios:
Drill exercise:
- Announce: "Simulated PagerDuty outage starting now"
- Disable PagerDuty integrations for 30 minutes
- Trigger a simulated production incident
- Measure: How long until the on-call engineer is notified via fallback methods?
- Document: What broke? What worked? What needs improvement?
Teams that regularly practice PagerDuty-less incident response respond 3-5x faster during actual outages.
5. Monitor PagerDuty's Health Continuously
Don't wait for alerts to fail before discovering PagerDuty is down:
import requests
import time
from datetime import datetime
def continuous_pagerduty_health_check():
"""
Runs every 60 seconds to verify PagerDuty is operational
Alerts via alternative channel if health check fails
"""
while True:
try:
# Test Events API
start_time = time.time()
events_response = requests.post(
'https://events.pagerduty.com/v2/enqueue',
headers={
'Content-Type': 'application/json',
'Authorization': f'Token token={INTEGRATION_KEY}'
},
json={
'routing_key': INTEGRATION_KEY,
'event_action': 'trigger',
'payload': {
'summary': 'Health check',
'severity': 'info',
'source': 'health-monitor',
'custom_details': {'suppress_notification': True}
}
},
timeout=10
)
events_latency = (time.time() - start_time) * 1000
# Test REST API
start_time = time.time()
api_response = requests.get(
'https://api.pagerduty.com/users',
headers={
'Authorization': f'Token token={API_KEY}',
'Accept': 'application/vnd.pagerduty+json;version=2'
},
timeout=10
)
api_latency = (time.time() - start_time) * 1000
# Evaluate health
if events_response.status_code == 202 and api_response.status_code == 200:
if events_latency < 1000 and api_latency < 1000:
print(f"[{datetime.now()}] PagerDuty healthy - Events: {events_latency:.0f}ms, API: {api_latency:.0f}ms")
else:
print(f"[{datetime.now()}] WARNING: High latency - Events: {events_latency:.0f}ms, API: {api_latency:.0f}ms")
send_slack_alert(f"PagerDuty latency elevated: Events {events_latency:.0f}ms, API {api_latency:.0f}ms")
else:
print(f"[{datetime.now()}] ERROR: PagerDuty unhealthy - Events: {events_response.status_code}, API: {api_response.status_code}")
send_slack_alert(f"🚨 PagerDuty may be down! Events API: {events_response.status_code}, REST API: {api_response.status_code}")
except requests.exceptions.Timeout:
print(f"[{datetime.now()}] ERROR: PagerDuty API timeout")
send_slack_alert("🚨 PagerDuty API timeout - possible outage")
except Exception as e:
print(f"[{datetime.now()}] ERROR: Health check failed - {str(e)}")
send_slack_alert(f"🚨 PagerDuty health check error: {str(e)}")
time.sleep(60) # Check every minute
def send_slack_alert(message):
"""Send alert via Slack as backup notification channel"""
requests.post(
SLACK_WEBHOOK_URL,
json={'text': message},
timeout=5
)
if __name__ == '__main__':
continuous_pagerduty_health_check()
Run this as a separate service outside your main application infrastructure. Consider deploying it to a different cloud provider than your primary stack for maximum resilience.
6. Subscribe to Multiple Alert Channels
Don't rely on a single notification method:
- Official status page: status.pagerduty.com - Subscribe via email and Slack
- API Status Check: apistatuscheck.com/api/pagerduty - Real-time monitoring with instant alerts
- Twitter: Follow @PagerDuty and enable mobile notifications for tweets
- Community forums: community.pagerduty.com - Other users often report issues quickly
- RSS feed: Add status.pagerduty.com RSS to your feed reader
The redundancy ensures you learn about outages through multiple channels.
Related Resources
- Is Slack Down? - Troubleshoot Slack integration issues when PagerDuty alerts fail to reach your channels
- API Outage Alerts via RSS, Slack & Discord - Build multi-channel alerting systems as PagerDuty backup
- API Monitoring Comparison 2026 - Alternative incident management platforms (Opsgenie, VictorOps, xMatters)
- Why Status Pages Lie - Understand limitations of status pages and why active monitoring matters
Frequently Asked Questions
How often does PagerDuty go down?
PagerDuty maintains industry-leading reliability with 99.99%+ uptime. Major incidents affecting all customers occur 1-2 times per year, typically lasting less than 2 hours. Regional or component-specific issues (like email delivery or specific integrations) may occur more frequently. Most engineering teams experience zero significant PagerDuty downtime annually.
What's the difference between PagerDuty status page and API Status Check?
PagerDuty's official status page (status.pagerduty.com) is manually updated by their operations team, which can lag by 5-15 minutes during incident detection and triage. API Status Check performs automated health checks every 60 seconds against live PagerDuty endpoints (Events API, REST API, webhooks), often detecting issues before they're officially acknowledged. Use both for comprehensive visibility.
Can I get SLA credits for PagerDuty downtime?
PagerDuty offers SLA commitments with their Business and Digital Operations plans (typically 99.9% uptime guarantee). If they fail to meet the SLA, you may be eligible for service credits. Check your specific plan's Service Level Agreement document or contact PagerDuty support with incident dates to request credits. Enterprise customers often have custom SLA terms.
Should I use webhook notifications or polling for critical workflows?
For mission-critical workflows, implement both: rely on webhooks as the primary mechanism (low latency, efficient) but include scheduled polling as a safety net. Poll PagerDuty's incidents API every 1-5 minutes to catch any incidents that webhooks failed to deliver. During outages, webhooks may be delayed or lost entirely, so polling ensures you don't miss critical incidents.
How do I prevent alert storms after PagerDuty recovers from an outage?
Implement de-duplication logic in your monitoring tools:
- Use PagerDuty's deduplication keys - Same dedup_key merges multiple events into one incident
- Rate limit notifications - If >10 incidents in 60 seconds, suppress notifications and send summary instead
- Auto-resolve old incidents - Script to resolve incidents >30 minutes old after outage recovery
- Intelligent suppression - Don't re-alert for issues already resolved in your system
What should I use as a backup to PagerDuty?
Popular PagerDuty alternatives for fallback scenarios:
- Opsgenie (by Atlassian) - Similar feature set, good multi-provider option
- VictorOps/Splunk On-Call - Strong incident collaboration features
- xMatters - Enterprise-focused with advanced workflow automation
- Simple fallbacks: Direct Slack webhooks, email forwarding, SMS via Twilio
- Free monitoring options can serve as basic backup alerting
How do I test if my PagerDuty integration is working?
Quick integration test:
- Navigate to your Service in PagerDuty web interface
- Click "New Incident" button to manually create a test incident
- Verify you receive notifications via all configured channels (push, SMS, call)
- Check Slack/Teams channels for incident messages
- Acknowledge from mobile app to test bidirectional sync
- Resolve the incident
Automated test: Use the Events API test script above to create incidents programmatically and verify end-to-end flow.
What regions does PagerDuty operate in?
PagerDuty operates globally with primary infrastructure in:
- United States (US East, US West)
- Europe (EU region available for GDPR compliance)
- Multi-region redundancy for high availability
An outage may affect specific regions while others remain operational. PagerDuty's status page shows regional breakdowns during incidents. Enterprise customers can request specific regional deployments.
How long do PagerDuty outages typically last?
Historical data shows:
- Minor incidents (single component): 15-45 minutes average
- Major incidents (multiple components): 1-3 hours average
- Critical outages (full platform): Rare, typically resolved within 2-4 hours
PagerDuty's engineering team has strong incident response capabilities, often restoring core functionality (event ingestion) within 30 minutes, with full feature recovery following incrementally.
Is there a PagerDuty downtime notification service?
Yes, several monitoring options exist:
- Official: Subscribe to status.pagerduty.com via email, SMS, Slack, or RSS
- Third-party: API Status Check provides real-time monitoring with customizable alerts
- Self-hosted: Deploy the health check scripts above to monitor from your infrastructure
- Community: Join r/pagerduty or PagerDuty Community forums where users report issues
Recommendation: Use multiple methods (official status page + API Status Check) for redundant notification.
Stay Ahead of PagerDuty Outages
Don't let incident management failures catch you off guard. Subscribe to real-time PagerDuty alerts and get notified instantly when issues are detected—before your entire on-call workflow breaks.
API Status Check monitors PagerDuty 24/7 with:
- 60-second health checks across Events API, REST API, and webhooks
- Instant alerts via email, Slack, Discord, or webhook
- Historical uptime tracking and incident timeline analysis
- Multi-platform monitoring for your entire incident response stack
Start monitoring PagerDuty now →
Last updated: February 4, 2026. PagerDuty status information is provided in real-time based on active monitoring. For official incident reports, always refer to status.pagerduty.com.
Monitor Your APIs
Check the real-time status of 100+ popular APIs used by developers.
View API Status →