Complete API Dependency Monitoring Strategy: From Detection to Recovery

by API Status Check Team

Complete API Dependency Monitoring Strategy: From Detection to Recovery

Modern applications depend on dozens of third-party APIs—payment processors, authentication services, cloud infrastructure, communication platforms. When these dependencies fail, your application fails.

Yet most engineering teams don't discover API outages until users report them. By then, the damage is done: transactions failed, users frustrated, revenue lost.

This guide covers everything you need to build a resilient API dependency monitoring strategy—from detection to recovery.

Why API Dependency Monitoring Matters

The hidden cost of API downtime:

  • Revenue impact: Stripe outage = no payments processed
  • User experience: Auth0 down = no one can log in
  • Cascading failures: One API failure breaks multiple features
  • Mean Time to Recovery (MTTR): The difference between 5-minute and 2-hour outages is usually detection speed

Real-world example: When AWS had a major outage in December 2025, companies with proactive monitoring pivoted to fallback regions within 10 minutes. Those relying on user reports took 2+ hours to respond.

The difference? A comprehensive dependency monitoring strategy.

The 4 Layers of API Dependency Monitoring

Effective monitoring requires multiple layers. Relying on just one creates blind spots.

Layer 1: Status Page Monitoring

What it is: Track official status pages from your API providers.

Why it matters:

  • First source of truth during outages
  • Often updated before your own monitoring detects issues
  • Provides context: "planned maintenance" vs "unexpected outage"

How to implement:

  1. Centralized dashboard: Use API Status Check to monitor 1,000+ status pages in one place
  2. RSS/Webhook alerts: Get notified the moment providers update their status
  3. Historical tracking: Understand which providers are most reliable

Services to monitor:

See all 1,000+ monitored services →

Layer 2: Active Health Checks

What it is: Ping your dependencies at regular intervals to verify they're responsive.

Why it matters: Status pages aren't always updated immediately. Your health checks might detect issues first.

Implementation approaches:

Simple HTTP health checks:

# Basic uptime monitoring
curl -f https://api.stripe.com/healthcheck || alert_team

Synthetic transactions:

// Test critical user flows
async function healthCheck() {
  // 1. Create test user
  const user = await auth0.createUser({ email: 'test@example.com' });
  
  // 2. Authenticate
  const token = await auth0.login(user);
  
  // 3. Make API call
  const response = await api.getData(token);
  
  // 4. Verify response
  if (!response.ok) {
    throw new Error('API health check failed');
  }
}

What to monitor:

  • Response time (p50, p95, p99)
  • Error rates (4xx, 5xx)
  • Timeout rates
  • SSL certificate expiration

Tools:

Layer 3: Error Rate Monitoring

What it is: Track the error rates in your application logs to detect API failures affecting real users.

Why it matters: Health checks test one scenario. Production traffic reveals edge cases.

Implementation:

// Instrument API calls with error tracking
async function callExternalAPI(endpoint: string, data: any) {
  const startTime = Date.now();
  
  try {
    const response = await fetch(endpoint, {
      method: 'POST',
      body: JSON.stringify(data),
      timeout: 10000
    });
    
    // Track response time
    metrics.histogram('api.response_time', Date.now() - startTime, {
      service: 'stripe',
      endpoint: '/charges'
    });
    
    if (!response.ok) {
      // Track error by type
      metrics.increment('api.errors', {
        service: 'stripe',
        status: response.status,
        type: response.status >= 500 ? 'server' : 'client'
      });
      
      throw new Error(`API error: ${response.status}`);
    }
    
    return response.json();
    
  } catch (error) {
    // Track timeout/network errors
    if (error.name === 'AbortError') {
      metrics.increment('api.timeouts', { service: 'stripe' });
    } else {
      metrics.increment('api.network_errors', { service: 'stripe' });
    }
    
    throw error;
  }
}

Alert on anomalies:

  • Error rate > 5% over 5 minutes
  • Timeout rate > 2%
  • Response time p99 > 5 seconds

Tools:

Layer 4: User Impact Monitoring

What it is: Track the business metrics that matter when APIs fail.

Why it matters: Alerts should reflect user impact, not just technical metrics.

What to track:

Metric Indicates Alert Threshold
Successful checkouts Payment API health < 95% success rate
Login success rate Auth API health < 98% success rate
Message delivery rate Communication API health < 99% delivery
API response time (user-facing) Overall API performance p95 > 3 seconds

Example dashboard:

# Business impact dashboard
- widget: "Checkout Success Rate"
  query: "successful_checkouts / total_checkout_attempts"
  alert: "< 0.95 for 10 minutes"
  
- widget: "Failed Payment API Calls"
  query: "count(stripe_errors) / count(stripe_calls)"
  alert: "> 0.05 for 5 minutes"

Alerting Strategy: Signal vs Noise

Bad alerting = too many false positives = alert fatigue = missed real outages.

The golden rules:

1. Alert on Impact, Not Symptoms

Bad: "Stripe API returned 503"
Good: "Payment processing success rate dropped to 85% (threshold: 95%)"

2. Use Escalating Severity

  • Info: Single health check failure (might be a blip)
  • Warning: 3 consecutive failures OR error rate > 5%
  • Critical: Error rate > 10% OR user-facing impact confirmed

3. Group Related Alerts

If Stripe is down, you don't need 15 separate alerts for every affected endpoint. Group them:

"🚨 CRITICAL: Stripe API outage detected. Affected: /charges, /customers, /subscriptions. Impact: 0% successful payments."

4. Include Context

Every alert should answer:

  • What's failing? (Stripe Charges API)
  • How bad? (100% error rate for 5 minutes)
  • Impact? (No payments processing, $2K/min revenue loss)
  • What to do? (Check Stripe status page, enable fallback)

Alert template:

🚨 [CRITICAL] Stripe API Outage

Status: 100% error rate for 8 minutes
Impact: Payment processing completely down
Revenue loss: ~$16K (estimated)

Quick checks:
1. Stripe status: https://status.stripe.com
2. API Status Check: https://apistatuscheck.com/status/stripe
3. Enable manual payment fallback: /runbook/stripe-fallback

Runbook: https://docs.company.com/runbooks/stripe-outage

Building Fallback Strategies

Monitoring is only half the battle. You need fallback strategies for when APIs fail.

Pattern 1: Graceful Degradation

Concept: Disable non-critical features, keep core functionality working.

Example: Auth provider is down

  • ✅ Allow previously authenticated users to continue
  • ✅ Show "Login temporarily unavailable" message
  • ❌ Don't block entire application
async function authenticate(credentials) {
  try {
    return await auth0.login(credentials);
  } catch (error) {
    if (isAuth0Down(error)) {
      // Check local session cache
      const cachedSession = await sessionCache.get(credentials.email);
      if (cachedSession && !isExpired(cachedSession)) {
        logger.warn('Auth0 down, using cached session');
        return cachedSession;
      }
    }
    throw error;
  }
}

Pattern 2: Circuit Breaker

Concept: Stop calling a failing API to prevent cascading failures.

Implementation:

class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failureCount = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = Date.now();
  }
  
  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF_OPEN';
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
      logger.error('Circuit breaker opened');
    }
  }
}

// Usage
const stripeBreaker = new CircuitBreaker();

async function createCharge(amount) {
  try {
    return await stripeBreaker.call(() => stripe.charges.create({ amount }));
  } catch (error) {
    // Fallback: Queue for later or use alternative processor
    await paymentQueue.add({ amount, provider: 'stripe' });
    throw new Error('Payment queued due to Stripe outage');
  }
}

Pattern 3: Failover to Alternative Provider

Concept: Switch to a backup provider when primary fails.

Example: Payment processing

async function processPayment(amount, paymentInfo) {
  const providers = [
    { name: 'stripe', client: stripeClient, breaker: stripeBreaker },
    { name: 'braintree', client: braintreeClient, breaker: braintreeBreaker }
  ];
  
  for (const provider of providers) {
    try {
      return await provider.breaker.call(() => 
        provider.client.createCharge(amount, paymentInfo)
      );
    } catch (error) {
      logger.warn(`${provider.name} failed, trying next provider`);
      continue;
    }
  }
  
  throw new Error('All payment providers failed');
}

Pattern 4: Request Queue + Retry

Concept: Queue failed requests and retry when the API recovers.

Implementation:

import Bull from 'bull';

const paymentQueue = new Bull('payments', {
  redis: { host: 'localhost', port: 6379 }
});

// Process queue
paymentQueue.process(async (job) => {
  const { amount, customerId } = job.data;
  
  // Check if Stripe is healthy
  const stripeStatus = await checkAPIStatus('stripe');
  if (!stripeStatus.operational) {
    throw new Error('Stripe still down, will retry');
  }
  
  // Process payment
  return await stripe.charges.create({ amount, customer: customerId });
});

// Add to queue on failure
async function processPaymentWithRetry(amount, customerId) {
  try {
    return await stripe.charges.create({ amount, customer: customerId });
  } catch (error) {
    if (isStripeOutage(error)) {
      await paymentQueue.add(
        { amount, customerId },
        {
          attempts: 10,
          backoff: { type: 'exponential', delay: 5000 }
        }
      );
      return { status: 'queued', message: 'Payment will be processed when Stripe recovers' };
    }
    throw error;
  }
}

Incident Response Runbook

When an API dependency fails, speed matters. A clear runbook reduces MTTR.

1. Detection (0-2 minutes)

2. Assessment (2-5 minutes)

Determine scope:

  • Which API endpoints are affected?
  • What's the error rate? (5%? 50%? 100%?)
  • How many users are impacted?
  • What's the business impact? (revenue loss, support tickets)

Check monitoring dashboards:

Dashboard: API Dependencies
- Error rate by endpoint
- Failed transaction count
- Revenue impact estimate
- User-facing error rate

3. Communication (5-10 minutes)

Internal:

  • Update status page
  • Alert relevant teams (product, support, leadership)
  • Create incident Slack channel

External (if user-facing):

  • Update company status page
  • Post on social media (if major)
  • Prepare support team with messaging

Example status update:

🟡 Partial Outage: Payment Processing

We're experiencing issues with payment processing due to 
a third-party provider outage. 

Current impact:
- Credit card payments: Unavailable
- Alternative payment methods: Working
- Existing subscriptions: Not affected

Our team is monitoring the situation. Updates every 30 minutes.

Next update: 3:30 PM EST

4. Mitigation (10-30 minutes)

Immediate actions:

  • Enable circuit breaker (stop hammering the failing API)
  • Activate fallback strategy
  • Switch to alternative provider (if available)
  • Queue requests for retry

Example incident checklist:

Stripe Outage Response:
☐ Confirm outage on status.stripe.com
☐ Enable circuit breaker in production
☐ Switch to Braintree fallback processor
☐ Update checkout UI: "Some payment methods temporarily unavailable"
☐ Monitor Braintree capacity (ensure it can handle extra load)
☐ Alert finance team (manual reconciliation may be needed)
☐ Post status update

5. Recovery (30min - 2hrs)

  • Monitor for API recovery
  • Gradually re-enable primary provider
  • Process queued requests
  • Verify normal operation

Health check before full restore:

# Test multiple endpoints before switching back
curl https://api.stripe.com/v1/charges/test
curl https://api.stripe.com/v1/customers/test
curl https://api.stripe.com/v1/subscriptions/test

# If all green:
./scripts/disable-circuit-breaker.sh stripe
./scripts/drain-payment-queue.sh

6. Post-Mortem (within 1 week)

Document what happened and how to prevent it next time.

Questions to answer:

  • What was the root cause?
  • How long did detection take?
  • How long did mitigation take?
  • What was the total impact? (users affected, revenue lost)
  • What worked well?
  • What didn't work?
  • How do we prevent this in the future?

Action items:

  • Improve monitoring (earlier detection)
  • Add missing fallbacks
  • Update runbooks
  • Implement circuit breakers if not present
  • Consider alternative providers

Real-World Case Studies

Case Study 1: E-Commerce Platform (Stripe Outage)

Scenario: Black Friday, 2PM EST. Stripe API starts returning 503 errors.

Timeline without proper monitoring:

  • 2:00 PM: Stripe outage begins
  • 2:15 PM: First customer complaint on support chat
  • 2:30 PM: Support team escalates to engineering
  • 2:45 PM: Engineers identify Stripe as root cause
  • 3:00 PM: Manual failover to Braintree begins
  • 3:30 PM: Payments restored
  • Total downtime: 90 minutes
  • Estimated loss: $180K in failed transactions

Timeline with comprehensive monitoring:

  • 2:00 PM: Stripe outage begins
  • 2:01 PM: API Status Check detects status page update + health checks fail
  • 2:02 PM: Circuit breaker activates automatically
  • 2:03 PM: Automated failover to Braintree
  • 2:05 PM: Engineers acknowledge alert and monitor
  • Total downtime: 5 minutes
  • Estimated loss: $10K

Difference: $170K saved with proper monitoring + automation.

Case Study 2: SaaS Application (Auth0 Outage)

Scenario: Auth0 authentication service goes down during business hours.

Without fallback strategy:

  • Existing users are logged out
  • No one can log in
  • Support flooded with "can't access my account" tickets
  • Business impact: 2,000 users locked out for 45 minutes

With fallback strategy:

  • Circuit breaker stops new login attempts after 3 failures
  • Existing sessions remain valid (local session cache)
  • Login page shows: "Authentication temporarily unavailable. If you're already logged in, you can continue working."
  • Business impact: New logins unavailable, but 80% of users unaffected

Choosing the Right Tools

You don't need to build everything from scratch. Here's the modern API monitoring stack:

For Startups (Budget: $0-$200/mo)

Status Page Monitoring:

  • API Status Check — Free, monitors 1,000+ services
  • RSS feeds to Slack/Discord

Uptime Monitoring:

Error Tracking:

  • Sentry — Free tier: 5K events/month

Logging:

Total: ~$100/mo for comprehensive monitoring

For Scale-Ups (Budget: $500-$2K/mo)

Status Page Monitoring:

  • API Status Check — Free centralized view
  • Custom webhook integration to PagerDuty

Uptime + Synthetics:

  • Better Stack — $89/mo: uptime + incident management
  • Checkly — $399/mo: Playwright-based synthetic monitoring

APM + Error Tracking:

  • Datadog — $1,000+/mo: full observability stack
  • New Relic — $700+/mo: APM + infrastructure

Incident Management:

  • PagerDuty — $299+/mo: on-call schedules, escalations

Total: ~$2K/mo for enterprise-grade monitoring

For Enterprises (Budget: $5K+/mo)

  • Datadog or New Relic for full observability
  • PagerDuty for incident management
  • Custom internal tools for business-specific monitoring
  • Dedicated SRE team managing tooling

Monitoring Best Practices Checklist

Use this checklist to audit your current monitoring setup:

Detection

  • Monitor official status pages for all critical dependencies
  • Active health checks for all external APIs (every 1-5 minutes)
  • Track error rates in application logs
  • Alert on anomalies, not just absolute thresholds
  • Test your monitoring (simulate outages quarterly)

Alerting

  • Alerts include context (what, impact, severity)
  • Escalation policy in place (Slack → PagerDuty → Phone)
  • Alert fatigue is managed (< 5 alerts per week during normal ops)
  • Alerts route to correct team (payments → payment team, auth → infra)

Response

  • Runbooks exist for every critical dependency
  • Runbooks include: detection steps, common causes, mitigation steps, rollback procedures
  • On-call rotation defined
  • Post-mortems are blameless and result in action items

Resilience

  • Circuit breakers protect against cascading failures
  • Retry logic with exponential backoff
  • Request queuing for temporary failures
  • Graceful degradation for non-critical features
  • Failover to alternative providers (where possible)

Testing

  • Chaos engineering: regularly test failure scenarios
  • Game day exercises: simulate major outages
  • Load test failover paths (can Braintree handle 10x traffic?)

API Dependency Monitoring: Key Takeaways

  1. Layer your monitoring: Status pages, health checks, error rates, user impact
  2. Alert on impact, not symptoms: "Payments down" beats "Stripe returned 503"
  3. Build fallbacks before you need them: Circuit breakers, retries, alternative providers
  4. Speed matters: The difference between 5-minute and 2-hour MTTR is your monitoring setup
  5. Test your resilience: Regular chaos engineering prevents surprises

Next Steps

  1. Audit your dependencies: List every third-party API your application depends on
  2. Categorize by criticality:
    • Critical: App doesn't work without it (auth, payments)
    • Important: Major features affected (email, notifications)
    • Nice-to-have: Minor features (analytics, social sharing)
  3. Implement monitoring:
    • Start with status page monitoring (5 minutes setup)
    • Add health checks for critical APIs
    • Set up error rate alerts
  4. Build fallbacks:
    • Start with circuit breakers
    • Add retry logic with queuing
    • Implement graceful degradation
  5. Document everything:
    • Create runbooks for each dependency
    • Define escalation policies
    • Schedule quarterly failure testing

Monitor 1,000+ API Status Pages in One Place

API Status Check provides a centralized dashboard for tracking official status pages across:

  • Cloud infrastructure: AWS, Google Cloud, Azure, DigitalOcean, Cloudflare
  • Payments: Stripe, PayPal, Square, Braintree
  • Authentication: Auth0, Okta, Firebase Auth
  • Communication: Twilio, SendGrid, Mailgun
  • Collaboration: Slack, Discord, Zoom, Notion
  • Developer tools: GitHub, GitLab, Vercel, Supabase, Netlify

Start monitoring your API dependencies →


Related guides:

API Status Check

Stop checking API status pages manually

Get instant email alerts when OpenAI, Stripe, AWS, and 100+ APIs go down. Know before your users do.

Get Alerts — $9/mo →

Free dashboard available · 14-day trial on paid plans · Cancel anytime

Browse Free Dashboard →