Complete API Dependency Monitoring Strategy: From Detection to Recovery
Complete API Dependency Monitoring Strategy: From Detection to Recovery
Modern applications depend on dozens of third-party APIs—payment processors, authentication services, cloud infrastructure, communication platforms. When these dependencies fail, your application fails.
Yet most engineering teams don't discover API outages until users report them. By then, the damage is done: transactions failed, users frustrated, revenue lost.
This guide covers everything you need to build a resilient API dependency monitoring strategy—from detection to recovery.
Why API Dependency Monitoring Matters
The hidden cost of API downtime:
- Revenue impact: Stripe outage = no payments processed
- User experience: Auth0 down = no one can log in
- Cascading failures: One API failure breaks multiple features
- Mean Time to Recovery (MTTR): The difference between 5-minute and 2-hour outages is usually detection speed
Real-world example: When AWS had a major outage in December 2025, companies with proactive monitoring pivoted to fallback regions within 10 minutes. Those relying on user reports took 2+ hours to respond.
The difference? A comprehensive dependency monitoring strategy.
The 4 Layers of API Dependency Monitoring
Effective monitoring requires multiple layers. Relying on just one creates blind spots.
Layer 1: Status Page Monitoring
What it is: Track official status pages from your API providers.
Why it matters:
- First source of truth during outages
- Often updated before your own monitoring detects issues
- Provides context: "planned maintenance" vs "unexpected outage"
How to implement:
- Centralized dashboard: Use API Status Check to monitor 1,000+ status pages in one place
- RSS/Webhook alerts: Get notified the moment providers update their status
- Historical tracking: Understand which providers are most reliable
Services to monitor:
- Cloud infrastructure: AWS, Google Cloud, Azure
- Payments: Stripe, PayPal
- Authentication: Auth0, Okta
- Communication: Twilio, SendGrid
- Development tools: GitHub, Vercel, Supabase
See all 1,000+ monitored services →
Layer 2: Active Health Checks
What it is: Ping your dependencies at regular intervals to verify they're responsive.
Why it matters: Status pages aren't always updated immediately. Your health checks might detect issues first.
Implementation approaches:
Simple HTTP health checks:
# Basic uptime monitoring
curl -f https://api.stripe.com/healthcheck || alert_team
Synthetic transactions:
// Test critical user flows
async function healthCheck() {
// 1. Create test user
const user = await auth0.createUser({ email: 'test@example.com' });
// 2. Authenticate
const token = await auth0.login(user);
// 3. Make API call
const response = await api.getData(token);
// 4. Verify response
if (!response.ok) {
throw new Error('API health check failed');
}
}
What to monitor:
- Response time (p50, p95, p99)
- Error rates (4xx, 5xx)
- Timeout rates
- SSL certificate expiration
Tools:
- Better Stack — synthetic monitoring
- UptimeRobot — simple uptime checks
- Datadog — APM + synthetic monitoring
- Pingdom — global monitoring network
Layer 3: Error Rate Monitoring
What it is: Track the error rates in your application logs to detect API failures affecting real users.
Why it matters: Health checks test one scenario. Production traffic reveals edge cases.
Implementation:
// Instrument API calls with error tracking
async function callExternalAPI(endpoint: string, data: any) {
const startTime = Date.now();
try {
const response = await fetch(endpoint, {
method: 'POST',
body: JSON.stringify(data),
timeout: 10000
});
// Track response time
metrics.histogram('api.response_time', Date.now() - startTime, {
service: 'stripe',
endpoint: '/charges'
});
if (!response.ok) {
// Track error by type
metrics.increment('api.errors', {
service: 'stripe',
status: response.status,
type: response.status >= 500 ? 'server' : 'client'
});
throw new Error(`API error: ${response.status}`);
}
return response.json();
} catch (error) {
// Track timeout/network errors
if (error.name === 'AbortError') {
metrics.increment('api.timeouts', { service: 'stripe' });
} else {
metrics.increment('api.network_errors', { service: 'stripe' });
}
throw error;
}
}
Alert on anomalies:
- Error rate > 5% over 5 minutes
- Timeout rate > 2%
- Response time p99 > 5 seconds
Tools:
- Sentry — error tracking
- Datadog APM — distributed tracing
- New Relic — application monitoring
Layer 4: User Impact Monitoring
What it is: Track the business metrics that matter when APIs fail.
Why it matters: Alerts should reflect user impact, not just technical metrics.
What to track:
| Metric | Indicates | Alert Threshold |
|---|---|---|
| Successful checkouts | Payment API health | < 95% success rate |
| Login success rate | Auth API health | < 98% success rate |
| Message delivery rate | Communication API health | < 99% delivery |
| API response time (user-facing) | Overall API performance | p95 > 3 seconds |
Example dashboard:
# Business impact dashboard
- widget: "Checkout Success Rate"
query: "successful_checkouts / total_checkout_attempts"
alert: "< 0.95 for 10 minutes"
- widget: "Failed Payment API Calls"
query: "count(stripe_errors) / count(stripe_calls)"
alert: "> 0.05 for 5 minutes"
Alerting Strategy: Signal vs Noise
Bad alerting = too many false positives = alert fatigue = missed real outages.
The golden rules:
1. Alert on Impact, Not Symptoms
❌ Bad: "Stripe API returned 503"
✅ Good: "Payment processing success rate dropped to 85% (threshold: 95%)"
2. Use Escalating Severity
- Info: Single health check failure (might be a blip)
- Warning: 3 consecutive failures OR error rate > 5%
- Critical: Error rate > 10% OR user-facing impact confirmed
3. Group Related Alerts
If Stripe is down, you don't need 15 separate alerts for every affected endpoint. Group them:
"🚨 CRITICAL: Stripe API outage detected. Affected: /charges, /customers, /subscriptions. Impact: 0% successful payments."
4. Include Context
Every alert should answer:
- What's failing? (Stripe Charges API)
- How bad? (100% error rate for 5 minutes)
- Impact? (No payments processing, $2K/min revenue loss)
- What to do? (Check Stripe status page, enable fallback)
Alert template:
🚨 [CRITICAL] Stripe API Outage
Status: 100% error rate for 8 minutes
Impact: Payment processing completely down
Revenue loss: ~$16K (estimated)
Quick checks:
1. Stripe status: https://status.stripe.com
2. API Status Check: https://apistatuscheck.com/status/stripe
3. Enable manual payment fallback: /runbook/stripe-fallback
Runbook: https://docs.company.com/runbooks/stripe-outage
Building Fallback Strategies
Monitoring is only half the battle. You need fallback strategies for when APIs fail.
Pattern 1: Graceful Degradation
Concept: Disable non-critical features, keep core functionality working.
Example: Auth provider is down
- ✅ Allow previously authenticated users to continue
- ✅ Show "Login temporarily unavailable" message
- ❌ Don't block entire application
async function authenticate(credentials) {
try {
return await auth0.login(credentials);
} catch (error) {
if (isAuth0Down(error)) {
// Check local session cache
const cachedSession = await sessionCache.get(credentials.email);
if (cachedSession && !isExpired(cachedSession)) {
logger.warn('Auth0 down, using cached session');
return cachedSession;
}
}
throw error;
}
}
Pattern 2: Circuit Breaker
Concept: Stop calling a failing API to prevent cascading failures.
Implementation:
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failureCount = 0;
this.threshold = threshold;
this.timeout = timeout;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = Date.now();
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
logger.error('Circuit breaker opened');
}
}
}
// Usage
const stripeBreaker = new CircuitBreaker();
async function createCharge(amount) {
try {
return await stripeBreaker.call(() => stripe.charges.create({ amount }));
} catch (error) {
// Fallback: Queue for later or use alternative processor
await paymentQueue.add({ amount, provider: 'stripe' });
throw new Error('Payment queued due to Stripe outage');
}
}
Pattern 3: Failover to Alternative Provider
Concept: Switch to a backup provider when primary fails.
Example: Payment processing
async function processPayment(amount, paymentInfo) {
const providers = [
{ name: 'stripe', client: stripeClient, breaker: stripeBreaker },
{ name: 'braintree', client: braintreeClient, breaker: braintreeBreaker }
];
for (const provider of providers) {
try {
return await provider.breaker.call(() =>
provider.client.createCharge(amount, paymentInfo)
);
} catch (error) {
logger.warn(`${provider.name} failed, trying next provider`);
continue;
}
}
throw new Error('All payment providers failed');
}
Pattern 4: Request Queue + Retry
Concept: Queue failed requests and retry when the API recovers.
Implementation:
import Bull from 'bull';
const paymentQueue = new Bull('payments', {
redis: { host: 'localhost', port: 6379 }
});
// Process queue
paymentQueue.process(async (job) => {
const { amount, customerId } = job.data;
// Check if Stripe is healthy
const stripeStatus = await checkAPIStatus('stripe');
if (!stripeStatus.operational) {
throw new Error('Stripe still down, will retry');
}
// Process payment
return await stripe.charges.create({ amount, customer: customerId });
});
// Add to queue on failure
async function processPaymentWithRetry(amount, customerId) {
try {
return await stripe.charges.create({ amount, customer: customerId });
} catch (error) {
if (isStripeOutage(error)) {
await paymentQueue.add(
{ amount, customerId },
{
attempts: 10,
backoff: { type: 'exponential', delay: 5000 }
}
);
return { status: 'queued', message: 'Payment will be processed when Stripe recovers' };
}
throw error;
}
}
Incident Response Runbook
When an API dependency fails, speed matters. A clear runbook reduces MTTR.
1. Detection (0-2 minutes)
- Alert fires
- On-call engineer acknowledges
- Check official status pages:
- API Status Check aggregator
- Provider's status page
- DownDetector (user reports)
2. Assessment (2-5 minutes)
Determine scope:
- Which API endpoints are affected?
- What's the error rate? (5%? 50%? 100%?)
- How many users are impacted?
- What's the business impact? (revenue loss, support tickets)
Check monitoring dashboards:
Dashboard: API Dependencies
- Error rate by endpoint
- Failed transaction count
- Revenue impact estimate
- User-facing error rate
3. Communication (5-10 minutes)
Internal:
- Update status page
- Alert relevant teams (product, support, leadership)
- Create incident Slack channel
External (if user-facing):
- Update company status page
- Post on social media (if major)
- Prepare support team with messaging
Example status update:
🟡 Partial Outage: Payment Processing
We're experiencing issues with payment processing due to
a third-party provider outage.
Current impact:
- Credit card payments: Unavailable
- Alternative payment methods: Working
- Existing subscriptions: Not affected
Our team is monitoring the situation. Updates every 30 minutes.
Next update: 3:30 PM EST
4. Mitigation (10-30 minutes)
Immediate actions:
- Enable circuit breaker (stop hammering the failing API)
- Activate fallback strategy
- Switch to alternative provider (if available)
- Queue requests for retry
Example incident checklist:
Stripe Outage Response:
☐ Confirm outage on status.stripe.com
☐ Enable circuit breaker in production
☐ Switch to Braintree fallback processor
☐ Update checkout UI: "Some payment methods temporarily unavailable"
☐ Monitor Braintree capacity (ensure it can handle extra load)
☐ Alert finance team (manual reconciliation may be needed)
☐ Post status update
5. Recovery (30min - 2hrs)
- Monitor for API recovery
- Gradually re-enable primary provider
- Process queued requests
- Verify normal operation
Health check before full restore:
# Test multiple endpoints before switching back
curl https://api.stripe.com/v1/charges/test
curl https://api.stripe.com/v1/customers/test
curl https://api.stripe.com/v1/subscriptions/test
# If all green:
./scripts/disable-circuit-breaker.sh stripe
./scripts/drain-payment-queue.sh
6. Post-Mortem (within 1 week)
Document what happened and how to prevent it next time.
Questions to answer:
- What was the root cause?
- How long did detection take?
- How long did mitigation take?
- What was the total impact? (users affected, revenue lost)
- What worked well?
- What didn't work?
- How do we prevent this in the future?
Action items:
- Improve monitoring (earlier detection)
- Add missing fallbacks
- Update runbooks
- Implement circuit breakers if not present
- Consider alternative providers
Real-World Case Studies
Case Study 1: E-Commerce Platform (Stripe Outage)
Scenario: Black Friday, 2PM EST. Stripe API starts returning 503 errors.
Timeline without proper monitoring:
- 2:00 PM: Stripe outage begins
- 2:15 PM: First customer complaint on support chat
- 2:30 PM: Support team escalates to engineering
- 2:45 PM: Engineers identify Stripe as root cause
- 3:00 PM: Manual failover to Braintree begins
- 3:30 PM: Payments restored
- Total downtime: 90 minutes
- Estimated loss: $180K in failed transactions
Timeline with comprehensive monitoring:
- 2:00 PM: Stripe outage begins
- 2:01 PM: API Status Check detects status page update + health checks fail
- 2:02 PM: Circuit breaker activates automatically
- 2:03 PM: Automated failover to Braintree
- 2:05 PM: Engineers acknowledge alert and monitor
- Total downtime: 5 minutes
- Estimated loss: $10K
Difference: $170K saved with proper monitoring + automation.
Case Study 2: SaaS Application (Auth0 Outage)
Scenario: Auth0 authentication service goes down during business hours.
Without fallback strategy:
- Existing users are logged out
- No one can log in
- Support flooded with "can't access my account" tickets
- Business impact: 2,000 users locked out for 45 minutes
With fallback strategy:
- Circuit breaker stops new login attempts after 3 failures
- Existing sessions remain valid (local session cache)
- Login page shows: "Authentication temporarily unavailable. If you're already logged in, you can continue working."
- Business impact: New logins unavailable, but 80% of users unaffected
Choosing the Right Tools
You don't need to build everything from scratch. Here's the modern API monitoring stack:
For Startups (Budget: $0-$200/mo)
Status Page Monitoring:
- API Status Check — Free, monitors 1,000+ services
- RSS feeds to Slack/Discord
Uptime Monitoring:
- UptimeRobot — Free tier: 50 monitors
- Cronitor — $10/mo for 25 monitors
Error Tracking:
- Sentry — Free tier: 5K events/month
Logging:
- BetterStack Logs — $10/mo for 1GB
- Papertrail — Free tier: 50MB/month
Total: ~$100/mo for comprehensive monitoring
For Scale-Ups (Budget: $500-$2K/mo)
Status Page Monitoring:
- API Status Check — Free centralized view
- Custom webhook integration to PagerDuty
Uptime + Synthetics:
- Better Stack — $89/mo: uptime + incident management
- Checkly — $399/mo: Playwright-based synthetic monitoring
APM + Error Tracking:
Incident Management:
- PagerDuty — $299+/mo: on-call schedules, escalations
Total: ~$2K/mo for enterprise-grade monitoring
For Enterprises (Budget: $5K+/mo)
- Datadog or New Relic for full observability
- PagerDuty for incident management
- Custom internal tools for business-specific monitoring
- Dedicated SRE team managing tooling
Monitoring Best Practices Checklist
Use this checklist to audit your current monitoring setup:
Detection
- Monitor official status pages for all critical dependencies
- Active health checks for all external APIs (every 1-5 minutes)
- Track error rates in application logs
- Alert on anomalies, not just absolute thresholds
- Test your monitoring (simulate outages quarterly)
Alerting
- Alerts include context (what, impact, severity)
- Escalation policy in place (Slack → PagerDuty → Phone)
- Alert fatigue is managed (< 5 alerts per week during normal ops)
- Alerts route to correct team (payments → payment team, auth → infra)
Response
- Runbooks exist for every critical dependency
- Runbooks include: detection steps, common causes, mitigation steps, rollback procedures
- On-call rotation defined
- Post-mortems are blameless and result in action items
Resilience
- Circuit breakers protect against cascading failures
- Retry logic with exponential backoff
- Request queuing for temporary failures
- Graceful degradation for non-critical features
- Failover to alternative providers (where possible)
Testing
- Chaos engineering: regularly test failure scenarios
- Game day exercises: simulate major outages
- Load test failover paths (can Braintree handle 10x traffic?)
API Dependency Monitoring: Key Takeaways
- Layer your monitoring: Status pages, health checks, error rates, user impact
- Alert on impact, not symptoms: "Payments down" beats "Stripe returned 503"
- Build fallbacks before you need them: Circuit breakers, retries, alternative providers
- Speed matters: The difference between 5-minute and 2-hour MTTR is your monitoring setup
- Test your resilience: Regular chaos engineering prevents surprises
Next Steps
- Audit your dependencies: List every third-party API your application depends on
- Categorize by criticality:
- Critical: App doesn't work without it (auth, payments)
- Important: Major features affected (email, notifications)
- Nice-to-have: Minor features (analytics, social sharing)
- Implement monitoring:
- Start with status page monitoring (5 minutes setup)
- Add health checks for critical APIs
- Set up error rate alerts
- Build fallbacks:
- Start with circuit breakers
- Add retry logic with queuing
- Implement graceful degradation
- Document everything:
- Create runbooks for each dependency
- Define escalation policies
- Schedule quarterly failure testing
Monitor 1,000+ API Status Pages in One Place
API Status Check provides a centralized dashboard for tracking official status pages across:
- Cloud infrastructure: AWS, Google Cloud, Azure, DigitalOcean, Cloudflare
- Payments: Stripe, PayPal, Square, Braintree
- Authentication: Auth0, Okta, Firebase Auth
- Communication: Twilio, SendGrid, Mailgun
- Collaboration: Slack, Discord, Zoom, Notion
- Developer tools: GitHub, GitLab, Vercel, Supabase, Netlify
Start monitoring your API dependencies →
Related guides:
API Status Check
Stop checking API status pages manually
Get instant email alerts when OpenAI, Stripe, AWS, and 100+ APIs go down. Know before your users do.
Free dashboard available · 14-day trial on paid plans · Cancel anytime
Browse Free Dashboard →