Circuit Breaker Pattern: Build Resilient APIs That Handle Failures Gracefully

β€’28 min read

When your API depends on external services like Stripe, AWS, or Twilio, failures are inevitable. The circuit breaker pattern prevents cascade failures by automatically detecting when a service is down and temporarily stopping requests. This guide covers everything from basic concepts to production-ready TypeScript implementations.

What Is a Circuit Breaker?

The circuit breaker pattern is a design pattern used in software development to detect failures and prevent cascading failures in distributed systems. It's inspired by electrical circuit breakers that protect electrical circuits from damage caused by excess current.

In software, a circuit breaker wraps API calls to external services and monitors for failures. When failures reach a threshold, the circuit "opens" and subsequent calls fail immediately without hitting the failing service. After a timeout, the circuit enters a "half-open" state to test if the service has recovered.

πŸ’‘ The Electrical Analogy

Just like an electrical breaker:

  • Closed (Normal): Current flows, requests succeed
  • Open (Tripped): Current stops, requests fail fast
  • Half-Open (Testing): Limited current flows to test if it's safe to reset

How It Works

// Pseudocode
if (circuitState === CLOSED) {
  try {
    result = await externalServiceCall()
    recordSuccess()
    return result
  } catch (error) {
    recordFailure()
    if (failureThresholdReached()) {
      circuitState = OPEN
      startResetTimer()
    }
    throw error
  }
} else if (circuitState === OPEN) {
  throw new CircuitBreakerOpenError("Service unavailable")
} else if (circuitState === HALF_OPEN) {
  try {
    result = await externalServiceCall()
    circuitState = CLOSED  // Service recovered!
    return result
  } catch (error) {
    circuitState = OPEN  // Still broken
    startResetTimer()
    throw error
  }
}

Why Circuit Breakers Matter

Without circuit breakers, your application suffers when dependencies fail:

The Cascade Failure Problem

⚠️ Real-World Example: Black Friday Disaster

An e-commerce site depends on Stripe for payments. During Black Friday, Stripe experiences a 2-minute outage. Without circuit breakers:

  1. Every checkout request waits 30 seconds for Stripe to timeout
  2. Request threads pile up (100, 200, 500...)
  3. Memory exhaustion crashes the payment service
  4. Now the entire site is down, not just payments
  5. Recovery takes 15 minutes even after Stripe comes back

Total impact: 17 minutes of complete downtime instead of 2 minutes of graceful degradation.

Benefits of Circuit Breakers

βœ… Fail Fast

Return errors in milliseconds instead of waiting for timeouts (30+ seconds).

βœ… Prevent Cascade Failures

Stop one failing service from bringing down your entire system.

βœ… Automatic Recovery

Detect when services recover and resume normal operations automatically.

βœ… Better User Experience

Show meaningful errors immediately instead of long loading spinners.

βœ… Resource Protection

Prevent thread pool exhaustion, memory leaks, and connection timeouts.

βœ… Observability

Clear metrics on which dependencies are failing and when.

When to Use Circuit Breakers

Circuit breakers are essential for:

  • External API calls: Payment gateways (Stripe, PayPal), auth providers (Auth0, Okta), cloud services (AWS, Azure)
  • Database connections: Especially when using read replicas or distributed databases
  • Microservices communication: Protect downstream services from upstream failures
  • Third-party integrations: CRMs (Salesforce, HubSpot), monitoring tools (Datadog, New Relic)
  • CDN and storage: Image processing, file uploads, content delivery

The Three States: Closed, Open, Half-Open

A circuit breaker has three states, each with different behavior:

🟒 CLOSED (Normal Operation)

  • Behavior: All requests pass through to the service
  • Monitoring: Track success/failure rate
  • Transition: Opens when failure threshold is reached (e.g., 5 failures in 10 seconds)
// Example: 50% failure rate over 10 requests triggers OPEN
if (failures / totalRequests >= 0.5 && totalRequests >= 10) {
  circuitState = OPEN
}

πŸ”΄ OPEN (Failing Fast)

  • Behavior: All requests fail immediately without calling the service
  • Duration: Stays open for a reset timeout (e.g., 30 seconds)
  • Transition: Enters HALF_OPEN after timeout expires
// Fail fast - no network call
if (circuitState === OPEN) {
  throw new CircuitBreakerOpenError(
    'Stripe API circuit breaker is OPEN. Service unavailable.'
  )
}

🟑 HALF_OPEN (Testing Recovery)

  • Behavior: Allow a limited number of test requests (e.g., 1-3)
  • Success: If test requests succeed, transition to CLOSED
  • Failure: If test requests fail, transition back to OPEN
// Allow 1 test request
if (circuitState === HALF_OPEN) {
  try {
    const result = await service.call()
    circuitState = CLOSED  // Success! Service recovered
    return result
  } catch (error) {
    circuitState = OPEN  // Still broken
    startResetTimer()
    throw error
  }
}

πŸ“Š State Transition Diagram


    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   CLOSED    β”‚  ← Normal operation
    β”‚ (requests   β”‚
    β”‚  pass thru) β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β”‚ Failure threshold reached
           β”‚ (e.g., 50% fail rate)
           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚    OPEN     β”‚  ← Failing fast
    β”‚  (reject    β”‚
    β”‚   all req)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β”‚ Reset timeout expires
           β”‚ (e.g., 30 seconds)
           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ HALF_OPEN   β”‚  ← Testing recovery
    β”‚ (allow test β”‚
    β”‚  requests)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β”œβ”€Successβ†’ CLOSED
           β”‚
           └─Failureβ†’ OPEN

Building a Circuit Breaker from Scratch

Let's build a production-ready circuit breaker in TypeScript to understand exactly how it works.

Step 1: Define the Core Types

// circuit-breaker.ts
export enum CircuitState {
  CLOSED = 'CLOSED',
  OPEN = 'OPEN',
  HALF_OPEN = 'HALF_OPEN',
}

export interface CircuitBreakerConfig {
  failureThreshold: number       // % failures to trigger OPEN (0-1)
  resetTimeout: number            // ms to wait before HALF_OPEN
  rollingWindowSize: number       // number of requests to track
  halfOpenMaxAttempts: number     // test requests in HALF_OPEN
}

export class CircuitBreakerOpenError extends Error {
  constructor(message: string) {
    super(message)
    this.name = 'CircuitBreakerOpenError'
  }
}

interface RequestRecord {
  timestamp: number
  success: boolean
}

Step 2: Implement the Circuit Breaker

export class CircuitBreaker<T> {
  private state: CircuitState = CircuitState.CLOSED
  private requestHistory: RequestRecord[] = []
  private openedAt: number | null = null
  private halfOpenAttempts = 0

  constructor(
    private serviceName: string,
    private config: CircuitBreakerConfig,
    private action: () => Promise<T>
  ) {}

  async execute(): Promise<T> {
    // Check if circuit should transition to HALF_OPEN
    if (this.state === CircuitState.OPEN) {
      if (this.shouldAttemptReset()) {
        this.state = CircuitState.HALF_OPEN
        this.halfOpenAttempts = 0
        console.log(`[${this.serviceName}] Circuit HALF_OPEN - testing recovery`)
      } else {
        throw new CircuitBreakerOpenError(
          `Circuit breaker is OPEN for ${this.serviceName}`
        )
      }
    }

    // In HALF_OPEN, limit test requests
    if (this.state === CircuitState.HALF_OPEN) {
      if (this.halfOpenAttempts >= this.config.halfOpenMaxAttempts) {
        throw new CircuitBreakerOpenError(
          `Circuit breaker is HALF_OPEN for ${this.serviceName}, max test attempts reached`
        )
      }
      this.halfOpenAttempts++
    }

    // Execute the action
    try {
      const result = await this.action()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      throw error
    }
  }

  private onSuccess(): void {
    this.recordRequest(true)

    if (this.state === CircuitState.HALF_OPEN) {
      // Success in HALF_OPEN β†’ reset to CLOSED
      this.state = CircuitState.CLOSED
      this.requestHistory = []
      console.log(`[${this.serviceName}] Circuit CLOSED - service recovered`)
    }
  }

  private onFailure(): void {
    this.recordRequest(false)

    if (this.state === CircuitState.HALF_OPEN) {
      // Failure in HALF_OPEN β†’ back to OPEN
      this.openCircuit()
    } else if (this.state === CircuitState.CLOSED) {
      // Check if we should open the circuit
      if (this.shouldOpenCircuit()) {
        this.openCircuit()
      }
    }
  }

  private recordRequest(success: boolean): void {
    this.requestHistory.push({
      timestamp: Date.now(),
      success,
    })

    // Keep only recent requests (sliding window)
    if (this.requestHistory.length > this.config.rollingWindowSize) {
      this.requestHistory.shift()
    }
  }

  private shouldOpenCircuit(): boolean {
    if (this.requestHistory.length < this.config.rollingWindowSize) {
      return false // Not enough data
    }

    const failures = this.requestHistory.filter(r => !r.success).length
    const failureRate = failures / this.requestHistory.length

    return failureRate >= this.config.failureThreshold
  }

  private openCircuit(): void {
    this.state = CircuitState.OPEN
    this.openedAt = Date.now()
    this.requestHistory = []
    console.log(`[${this.serviceName}] Circuit OPEN - service unavailable`)
  }

  private shouldAttemptReset(): boolean {
    if (!this.openedAt) return false
    return Date.now() - this.openedAt >= this.config.resetTimeout
  }

  getState(): CircuitState {
    return this.state
  }

  getMetrics() {
    const failures = this.requestHistory.filter(r => !r.success).length
    const successes = this.requestHistory.length - failures

    return {
      state: this.state,
      totalRequests: this.requestHistory.length,
      failures,
      successes,
      failureRate: this.requestHistory.length > 0 
        ? failures / this.requestHistory.length 
        : 0,
      openedAt: this.openedAt,
    }
  }
}

Step 3: Usage Example with Stripe API

import Stripe from 'stripe'

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!)

// Create circuit breaker for Stripe payment intents
const stripeCircuit = new CircuitBreaker(
  'Stripe Payment Intent',
  {
    failureThreshold: 0.5,      // Open after 50% failure rate
    resetTimeout: 30000,         // Try recovery after 30 seconds
    rollingWindowSize: 10,       // Track last 10 requests
    halfOpenMaxAttempts: 2,      // Allow 2 test requests
  },
  async () => {
    return stripe.paymentIntents.create({
      amount: 2000,
      currency: 'usd',
    })
  }
)

// In your Express route
app.post('/api/checkout', async (req, res) => {
  try {
    const paymentIntent = await stripeCircuit.execute()
    res.json({ clientSecret: paymentIntent.client_secret })
  } catch (error) {
    if (error instanceof CircuitBreakerOpenError) {
      // Circuit is open - fail gracefully
      return res.status(503).json({
        error: 'Payment service temporarily unavailable. Please try again in a few moments.',
        retryAfter: 30
      })
    }
    
    // Other errors (validation, network, etc.)
    res.status(500).json({ error: 'Payment failed' })
  }
})

Production-Ready Implementation with Opossum

For production systems, use a battle-tested library like Opossum. It's the most popular Node.js circuit breaker with 2,700+ stars and extensive features.

Installation

npm install opossum
npm install @types/opossum --save-dev

Basic Setup with Type Safety

import CircuitBreaker from 'opossum'
import axios from 'axios'

// Define your service function
async function fetchUserData(userId: string) {
  const response = await axios.get(`https://api.example.com/users/${userId}`)
  return response.data
}

// Wrap it with a circuit breaker
const breaker = new CircuitBreaker(fetchUserData, {
  timeout: 5000,              // Request timeout (ms)
  errorThresholdPercentage: 50, // Open at 50% error rate
  resetTimeout: 30000,        // Try recovery after 30s
  rollingCountTimeout: 10000, // 10s rolling window
  rollingCountBuckets: 10,    // 10 buckets = 1s each
  volumeThreshold: 10,        // Min requests before opening
})

// Use it
breaker.fire('user-123')
  .then(data => console.log('User:', data))
  .catch(err => console.error('Circuit breaker error:', err))

Advanced: Multiple Service Circuit Breakers

// services/circuit-breakers.ts
import CircuitBreaker from 'opossum'
import Stripe from 'stripe'
import { twilioClient } from './twilio'
import { auth0Client } from './auth0'

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!)

// Stripe circuit breaker
export const stripeBreaker = new CircuitBreaker(
  async (amount: number, currency: string) => {
    return stripe.paymentIntents.create({ amount, currency })
  },
  {
    timeout: 10000,
    errorThresholdPercentage: 40,
    resetTimeout: 60000,        // Longer timeout for payment provider
    volumeThreshold: 5,
  }
)

// Twilio SMS circuit breaker
export const twilioBreaker = new CircuitBreaker(
  async (to: string, message: string) => {
    return twilioClient.messages.create({
      body: message,
      to,
      from: process.env.TWILIO_PHONE_NUMBER!,
    })
  },
  {
    timeout: 5000,
    errorThresholdPercentage: 50,
    resetTimeout: 30000,
    volumeThreshold: 10,
  }
)

// Auth0 user creation circuit breaker
export const auth0Breaker = new CircuitBreaker(
  async (email: string, password: string) => {
    return auth0Client.createUser({
      connection: 'Username-Password-Authentication',
      email,
      password,
    })
  },
  {
    timeout: 8000,
    errorThresholdPercentage: 30,  // Lower threshold for auth
    resetTimeout: 45000,
    volumeThreshold: 5,
  }
)

// Setup logging for all breakers
const breakers = { stripeBreaker, twilioBreaker, auth0Breaker }

Object.entries(breakers).forEach(([name, breaker]) => {
  breaker.on('open', () => console.error(`[${name}] Circuit OPEN`))
  breaker.on('halfOpen', () => console.warn(`[${name}] Circuit HALF_OPEN`))
  breaker.on('close', () => console.log(`[${name}] Circuit CLOSED`))
  breaker.on('fallback', (result) => console.log(`[${name}] Fallback used`, result))
})

Fallback Strategies

Circuit breakers work best with fallback functions that provide degraded functionality when services fail:

import CircuitBreaker from 'opossum'

// Primary: Stripe payment
async function createPaymentIntent(amount: number) {
  const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!)
  return stripe.paymentIntents.create({ amount, currency: 'usd' })
}

// Fallback: Queue payment for later processing
async function queuePaymentForLater(amount: number) {
  await redis.lpush('pending-payments', JSON.stringify({
    amount,
    timestamp: Date.now(),
    status: 'queued',
  }))
  
  return {
    id: `queued-${Date.now()}`,
    status: 'queued',
    message: 'Payment queued for processing',
  }
}

const paymentBreaker = new CircuitBreaker(createPaymentIntent, {
  timeout: 10000,
  errorThresholdPercentage: 50,
  resetTimeout: 60000,
})

// Set fallback
paymentBreaker.fallback(queuePaymentForLater)

// Usage
app.post('/api/checkout', async (req, res) => {
  try {
    const result = await paymentBreaker.fire(req.body.amount)
    
    if (result.status === 'queued') {
      // Fallback was used
      return res.status(202).json({
        message: 'Payment processing is temporarily delayed. You will receive confirmation via email.',
        paymentId: result.id,
      })
    }
    
    // Normal Stripe response
    res.json({ clientSecret: result.client_secret })
  } catch (error) {
    res.status(500).json({ error: 'Payment failed' })
  }
})

Health Check Integration

import { stripeBreaker, twilioBreaker, auth0Breaker } from './circuit-breakers'

// Health check endpoint
app.get('/health', (req, res) => {
  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    services: {
      stripe: {
        state: stripeBreaker.opened ? 'OPEN' : 
               stripeBreaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
        stats: stripeBreaker.stats,
      },
      twilio: {
        state: twilioBreaker.opened ? 'OPEN' : 
               twilioBreaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
        stats: twilioBreaker.stats,
      },
      auth0: {
        state: auth0Breaker.opened ? 'OPEN' : 
               auth0Breaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
        stats: auth0Breaker.stats,
      },
    },
  }

  // Overall status is unhealthy if any critical service circuit is open
  const criticalServicesOpen = stripeBreaker.opened || auth0Breaker.opened
  
  if (criticalServicesOpen) {
    health.status = 'degraded'
    return res.status(503).json(health)
  }

  res.json(health)
})

Advanced Patterns

Pattern 1: Per-Endpoint Circuit Breakers

Different API endpoints have different reliability characteristics. Use separate circuit breakers:

const githubBreakers = {
  // Read operations - can tolerate more failures
  repos: new CircuitBreaker(
    (owner: string, repo: string) => octokit.repos.get({ owner, repo }),
    { errorThresholdPercentage: 60, timeout: 3000 }
  ),
  
  // Write operations - stricter failure threshold
  createIssue: new CircuitBreaker(
    (owner: string, repo: string, title: string) => 
      octokit.issues.create({ owner, repo, title }),
    { errorThresholdPercentage: 30, timeout: 5000 }
  ),
}

Pattern 2: Bulkhead Pattern (Isolation)

Prevent one slow service from exhausting all resources by limiting concurrent requests:

import pLimit from 'p-limit'

// Limit Stripe to max 5 concurrent requests
const stripeLimit = pLimit(5)

const stripeBreaker = new CircuitBreaker(
  async (amount: number) => {
    return stripeLimit(() => 
      stripe.paymentIntents.create({ amount, currency: 'usd' })
    )
  },
  { timeout: 10000 }
)

// Even if Stripe is slow, only 5 requests block at once
// Other services (Twilio, Auth0) remain unaffected

Pattern 3: Retry with Exponential Backoff

Combine circuit breakers with intelligent retries:

import retry from 'async-retry'

const resilientStripeCall = new CircuitBreaker(
  async (amount: number) => {
    return retry(
      async (bail) => {
        try {
          return await stripe.paymentIntents.create({ amount, currency: 'usd' })
        } catch (error: any) {
          // Don't retry 4xx errors (client errors)
          if (error.statusCode >= 400 && error.statusCode < 500) {
            bail(error)
            return
          }
          throw error // Retry 5xx and network errors
        }
      },
      {
        retries: 3,
        minTimeout: 1000,
        maxTimeout: 5000,
        factor: 2,  // Exponential backoff
      }
    )
  },
  { timeout: 15000 }  // Account for retries in timeout
)

// This gives you:
// 1. Retries for transient failures (network blips, 503s)
// 2. Circuit breaker for sustained outages
// 3. No retries for client errors (400, 401, etc.)

Pattern 4: Cache-Aside with Circuit Breakers

Use cached data when the circuit is open:

import NodeCache from 'node-cache'

const cache = new NodeCache({ stdTTL: 300 }) // 5 min cache

const userServiceBreaker = new CircuitBreaker(
  async (userId: string) => {
    const response = await axios.get(`https://api.example.com/users/${userId}`)
    
    // Cache successful response
    cache.set(`user:${userId}`, response.data)
    
    return response.data
  },
  { timeout: 5000 }
)

// Fallback to cache when circuit is open
userServiceBreaker.fallback(async (userId: string) => {
  const cached = cache.get(`user:${userId}`)
  
  if (cached) {
    console.log(`Serving stale data for user ${userId}`)
    return { ...cached, stale: true }
  }
  
  throw new Error('No cached data available')
})

// Usage
const userData = await userServiceBreaker.fire('user-123')
if (userData.stale) {
  // Show UI indicator that data may be outdated
}

Monitoring Circuit Breaker Health

Circuit breakers are only effective if you monitor their behavior and alert on problems.

Metrics to Track

State Changes

  • β€’ Frequency of OPEN state
  • β€’ Time spent in OPEN state
  • β€’ HALF_OPEN β†’ OPEN transitions (failed recovery)
  • β€’ HALF_OPEN β†’ CLOSED transitions (successful recovery)

Request Stats

  • β€’ Total requests
  • β€’ Success/failure rate
  • β€’ Rejected requests (circuit OPEN)
  • β€’ Fallback invocations

Timing

  • β€’ Response time (p50, p95, p99)
  • β€’ Timeout frequency
  • β€’ Time to recovery

Business Impact

  • β€’ Failed payments (Stripe OPEN)
  • β€’ Missed notifications (Twilio OPEN)
  • β€’ Failed logins (Auth0 OPEN)

Prometheus Metrics Integration

import { Counter, Gauge, Histogram, register } from 'prom-client'

// Define metrics
const circuitStateGauge = new Gauge({
  name: 'circuit_breaker_state',
  help: 'Current state of circuit breaker (0=CLOSED, 1=HALF_OPEN, 2=OPEN)',
  labelNames: ['service'],
})

const circuitRequestsTotal = new Counter({
  name: 'circuit_breaker_requests_total',
  help: 'Total requests through circuit breaker',
  labelNames: ['service', 'state', 'result'],
})

const circuitLatency = new Histogram({
  name: 'circuit_breaker_latency_seconds',
  help: 'Circuit breaker request latency',
  labelNames: ['service'],
  buckets: [0.1, 0.5, 1, 2, 5, 10],
})

// Instrument circuit breaker
function instrumentCircuitBreaker(breaker: CircuitBreaker, serviceName: string) {
  breaker.on('success', (result, latency) => {
    circuitRequestsTotal.inc({ service: serviceName, state: 'closed', result: 'success' })
    circuitLatency.observe({ service: serviceName }, latency / 1000)
  })

  breaker.on('failure', (error) => {
    circuitRequestsTotal.inc({ 
      service: serviceName, 
      state: breaker.opened ? 'open' : 'closed', 
      result: 'failure' 
    })
  })

  breaker.on('reject', () => {
    circuitRequestsTotal.inc({ service: serviceName, state: 'open', result: 'rejected' })
  })

  breaker.on('open', () => {
    circuitStateGauge.set({ service: serviceName }, 2)
  })

  breaker.on('halfOpen', () => {
    circuitStateGauge.set({ service: serviceName }, 1)
  })

  breaker.on('close', () => {
    circuitStateGauge.set({ service: serviceName }, 0)
  })
}

// Use it
instrumentCircuitBreaker(stripeBreaker, 'stripe')
instrumentCircuitBreaker(twilioBreaker, 'twilio')

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})

Alerting Rules (Prometheus + Alertmanager)

# Circuit breaker alert rules
groups:
  - name: circuit_breaker
    rules:
      # Alert if circuit is OPEN for more than 5 minutes
      - alert: CircuitBreakerOpenTooLong
        expr: circuit_breaker_state{state="2"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker {{ $labels.service }} is OPEN"
          description: "{{ $labels.service }} circuit has been OPEN for 5+ minutes"

      # Alert if failure rate > 20% over 5 minutes
      - alert: HighCircuitBreakerFailureRate
        expr: |
          (
            rate(circuit_breaker_requests_total{result="failure"}[5m]) / 
            rate(circuit_breaker_requests_total[5m])
          ) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High failure rate for {{ $labels.service }}"
          description: "{{ $labels.service }} has {{ $value | humanizePercentage }} failure rate"

      # Alert if critical service (Stripe, Auth0) circuit opens
      - alert: CriticalServiceCircuitOpen
        expr: circuit_breaker_state{service=~"stripe|auth0",state="2"} > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CRITICAL: {{ $labels.service }} circuit is OPEN"
          description: "Revenue-critical service {{ $labels.service }} is unavailable"

Datadog Integration

import { StatsD } from 'hot-shots'

const dogstatsd = new StatsD({
  host: 'localhost',
  port: 8125,
  prefix: 'circuit_breaker.',
})

function instrumentWithDatadog(breaker: CircuitBreaker, serviceName: string) {
  breaker.on('success', (result, latency) => {
    dogstatsd.increment(`${serviceName}.success`)
    dogstatsd.timing(`${serviceName}.latency`, latency)
  })

  breaker.on('failure', () => {
    dogstatsd.increment(`${serviceName}.failure`)
  })

  breaker.on('open', () => {
    dogstatsd.gauge(`${serviceName}.state`, 2)
    dogstatsd.event(
      `Circuit Open: ${serviceName}`,
      `Circuit breaker for ${serviceName} has opened`,
      { alert_type: 'error' }
    )
  })

  breaker.on('close', () => {
    dogstatsd.gauge(`${serviceName}.state`, 0)
    dogstatsd.event(
      `Circuit Closed: ${serviceName}`,
      `Circuit breaker for ${serviceName} has recovered`,
      { alert_type: 'success' }
    )
  })
}

instrumentWithDatadog(stripeBreaker, 'stripe')
instrumentWithDatadog(twilioBreaker, 'twilio')

Testing Circuit Breakers

Circuit breakers must be tested to ensure they behave correctly under failure conditions.

Unit Tests with Jest

import CircuitBreaker from 'opossum'

describe('Circuit Breaker', () => {
  let callCount = 0
  let shouldFail = false

  const flakeyService = async () => {
    callCount++
    if (shouldFail) {
      throw new Error('Service failure')
    }
    return { success: true }
  }

  beforeEach(() => {
    callCount = 0
    shouldFail = false
  })

  test('allows requests in CLOSED state', async () => {
    const breaker = new CircuitBreaker(flakeyService, {
      errorThresholdPercentage: 50,
      resetTimeout: 1000,
    })

    const result = await breaker.fire()
    expect(result.success).toBe(true)
    expect(callCount).toBe(1)
  })

  test('opens after threshold failures', async () => {
    const breaker = new CircuitBreaker(flakeyService, {
      errorThresholdPercentage: 50,
      volumeThreshold: 2,
      resetTimeout: 1000,
    })

    shouldFail = true

    // First 2 failures should reach threshold
    await expect(breaker.fire()).rejects.toThrow()
    await expect(breaker.fire()).rejects.toThrow()

    // Circuit should now be OPEN
    expect(breaker.opened).toBe(true)

    // Next request should be rejected without calling service
    shouldFail = false  // Service is "fixed"
    await expect(breaker.fire()).rejects.toThrow()
    expect(callCount).toBe(2)  // Service not called (circuit open)
  })

  test('transitions to HALF_OPEN after reset timeout', async () => {
    jest.useFakeTimers()

    const breaker = new CircuitBreaker(flakeyService, {
      errorThresholdPercentage: 50,
      volumeThreshold: 2,
      resetTimeout: 1000,
    })

    shouldFail = true
    await expect(breaker.fire()).rejects.toThrow()
    await expect(breaker.fire()).rejects.toThrow()

    expect(breaker.opened).toBe(true)

    // Fast-forward time
    jest.advanceTimersByTime(1100)

    // Circuit should allow test request
    shouldFail = false
    const result = await breaker.fire()
    expect(result.success).toBe(true)
    expect(breaker.closed).toBe(true)  // Back to CLOSED

    jest.useRealTimers()
  })

  test('uses fallback when circuit is open', async () => {
    const breaker = new CircuitBreaker(flakeyService, {
      errorThresholdPercentage: 50,
      volumeThreshold: 2,
      resetTimeout: 1000,
    })

    // Set fallback
    breaker.fallback(() => ({ success: false, fallback: true }))

    shouldFail = true
    await breaker.fire()  // 1st failure
    await breaker.fire()  // 2nd failure - circuit opens

    // Next request uses fallback
    const result = await breaker.fire()
    expect(result.fallback).toBe(true)
  })
})

Integration Tests with Real Services

import nock from 'nock'

describe('Stripe Circuit Breaker Integration', () => {
  test('opens circuit when Stripe returns 500s', async () => {
    // Mock Stripe API to return 500 errors
    nock('https://api.stripe.com')
      .post('/v1/payment_intents')
      .times(5)
      .reply(500, { error: 'Internal Server Error' })

    const breaker = new CircuitBreaker(
      async () => {
        const response = await fetch('https://api.stripe.com/v1/payment_intents', {
          method: 'POST',
          headers: { Authorization: 'Bearer test' },
        })
        if (!response.ok) throw new Error('Stripe API error')
        return response.json()
      },
      { errorThresholdPercentage: 50, volumeThreshold: 3 }
    )

    // First 3 failures
    await expect(breaker.fire()).rejects.toThrow()
    await expect(breaker.fire()).rejects.toThrow()
    await expect(breaker.fire()).rejects.toThrow()

    // Circuit should be open
    expect(breaker.opened).toBe(true)
  })
})

Chaos Testing with Toxiproxy

Use Toxiproxy to simulate network failures:

import { ToxiproxyClient } from 'toxiproxy-node-client'

const toxiproxy = new ToxiproxyClient('http://localhost:8474')

describe('Circuit Breaker Chaos Tests', () => {
  test('handles network latency', async () => {
    // Add 5-second latency to Stripe API
    const proxy = await toxiproxy.createProxy({
      name: 'stripe',
      listen: '127.0.0.1:3001',
      upstream: 'api.stripe.com:443',
    })

    await proxy.addToxic({
      type: 'latency',
      attributes: { latency: 5000 },
    })

    const breaker = new CircuitBreaker(stripeCall, { timeout: 3000 })

    // Should timeout and open circuit
    await expect(breaker.fire()).rejects.toThrow()
  })

  test('recovers when network is restored', async () => {
    const proxy = await toxiproxy.getProxy('stripe')
    
    // Remove latency toxic
    await proxy.removeToxic('latency')

    // Circuit should recover
    const result = await breaker.fire()
    expect(result).toBeDefined()
  })
})

Real-World Scenarios

Scenario 1: Stripe Payment Outage

Problem: Stripe has a 5-minute outage during peak checkout hours.

Without Circuit Breaker:

  • β€’ Every checkout waits 30 seconds for Stripe timeout
  • β€’ 1,000 concurrent users = 1,000 blocked threads
  • β€’ Server crashes from memory exhaustion
  • β€’ Total downtime: 5 min (Stripe) + 10 min (recovery) = 15 min
  • β€’ Lost revenue: $50,000 (assuming $200/min average)

With Circuit Breaker:

  • β€’ After 5 failures (10 seconds), circuit opens
  • β€’ Users see: "Payment processing temporarily unavailable. Please try again in a few minutes."
  • β€’ Some users queue their carts (fallback: email notification when payments restore)
  • β€’ Server remains stable, handling other requests
  • β€’ Circuit automatically recovers when Stripe is back
  • β€’ Total impact: 5 min of degraded checkout only
  • β€’ Lost revenue: $1,000 (90% of users return after notification)

Saved: $49,000 + prevented full site outage

Scenario 2: Third-Party API Rate Limiting

Problem: Your app integrates with GitHub API to fetch user repositories. GitHub rate limits you to 5,000 requests/hour.

const githubBreaker = new CircuitBreaker(
  async (username: string) => {
    const response = await octokit.repos.listForUser({ username })
    return response.data
  },
  {
    errorThresholdPercentage: 30,  // Lower threshold for rate limits
    resetTimeout: 3600000,          // 1 hour (GitHub's reset window)
    volumeThreshold: 10,
  }
)

// Fallback: Use cached data
githubBreaker.fallback(async (username: string) => {
  const cached = await redis.get(`github:repos:${username}`)
  if (cached) {
    return { ...JSON.parse(cached), stale: true }
  }
  throw new Error('No cached data available')
})

// In your API route
app.get('/api/users/:username/repos', async (req, res) => {
  try {
    const repos = await githubBreaker.fire(req.params.username)
    
    if (repos.stale) {
      res.setHeader('X-Data-Stale', 'true')
      res.setHeader('Cache-Control', 'max-age=60')
    }
    
    res.json(repos)
  } catch (error) {
    if (error instanceof CircuitBreakerOpenError) {
      return res.status(429).json({
        error: 'GitHub API rate limit exceeded. Please try again later.',
        retryAfter: 3600
      })
    }
    throw error
  }
})

Scenario 3: Microservices Cascade Failure

Architecture: Frontend β†’ API Gateway β†’ Order Service β†’ Inventory Service β†’ Database

Problem: Database becomes slow (5-second queries instead of 50ms).

Without Circuit Breakers:

  • β€’ Inventory Service threads block waiting for DB
  • β€’ Order Service threads block waiting for Inventory
  • β€’ API Gateway threads block waiting for Order Service
  • β€’ Entire system cascades to failure within 2 minutes

With Circuit Breakers:

// Order Service β†’ Inventory Service circuit breaker
const inventoryBreaker = new CircuitBreaker(inventoryService.checkStock, {
  timeout: 2000,
  errorThresholdPercentage: 50,
})

inventoryBreaker.fallback(() => ({
  available: true,  // Optimistic: assume items are available
  estimatedDelivery: '7-10 business days',
  disclaimer: 'Stock levels are being verified'
}))

// Result:
// - Inventory Service fails, circuit opens
// - Order Service continues accepting orders with disclaimer
// - API Gateway remains responsive
// - Only Inventory checks are degraded, not the entire order flow

Common Mistakes to Avoid

❌ Mistake 1: Too Aggressive Thresholds

// BAD: Opens after 1 failure
const breaker = new CircuitBreaker(service, {
  errorThresholdPercentage: 100,
  volumeThreshold: 1,
})

// GOOD: Opens after sustained failures
const breaker = new CircuitBreaker(service, {
  errorThresholdPercentage: 50,
  volumeThreshold: 10,  // Need 10 requests, 50% failing
})

Why: Transient network blips shouldn't open circuits. Wait for sustained failure patterns.

❌ Mistake 2: Sharing Circuit Breakers Across Endpoints

// BAD: One circuit for all Stripe operations
const stripeBreaker = new CircuitBreaker(stripeClient)

// GOOD: Separate circuits for different reliability profiles
const stripePaymentBreaker = new CircuitBreaker(stripeClient.paymentIntents.create, { ... })
const stripeRefundBreaker = new CircuitBreaker(stripeClient.refunds.create, { ... })
const stripeCustomerBreaker = new CircuitBreaker(stripeClient.customers.retrieve, { ... })

Why: Payment creation might fail while customer retrieval works fine. Don't block working endpoints.

❌ Mistake 3: No Fallback Strategy

// BAD: Just throw error
const breaker = new CircuitBreaker(service)

try {
  await breaker.fire()
} catch (error) {
  throw error  // User sees generic error
}

// GOOD: Provide fallback
breaker.fallback(() => getCachedData())

try {
  const data = await breaker.fire()
  res.json(data)
} catch (error) {
  res.status(503).json({
    error: 'Service temporarily unavailable',
    message: 'Please try again in a few moments',
    retryAfter: 30
  })
}

Why: Users need actionable information, not just "something went wrong."

❌ Mistake 4: Not Monitoring Circuit State

// BAD: No logging or alerting
const breaker = new CircuitBreaker(service)

// GOOD: Track state changes
breaker.on('open', () => {
  logger.error('Circuit breaker opened for Stripe')
  metrics.increment('circuit.open', { service: 'stripe' })
  sendAlert('CRITICAL: Stripe payments unavailable')
})

breaker.on('close', () => {
  logger.info('Circuit breaker recovered for Stripe')
  metrics.increment('circuit.recovered', { service: 'stripe' })
})

Why: You need to know immediately when critical services fail.

❌ Mistake 5: Timeout Longer Than Circuit Threshold

// BAD: 30-second timeout defeats the purpose
const breaker = new CircuitBreaker(service, {
  timeout: 30000,
  errorThresholdPercentage: 50,
  volumeThreshold: 5,
})

// GOOD: Short timeout to fail fast
const breaker = new CircuitBreaker(service, {
  timeout: 3000,  // Fail after 3 seconds, not 30
  errorThresholdPercentage: 50,
  volumeThreshold: 5,
})

Why: Circuit breakers should fail fast. Long timeouts defeat the purpose.

❌ Mistake 6: Ignoring 4xx Errors

// BAD: Count 400/401 as failures
const breaker = new CircuitBreaker(async () => {
  const response = await fetch(url)
  if (!response.ok) throw new Error('Request failed')
  return response.json()
})

// GOOD: Only count 5xx as circuit-worthy failures
const breaker = new CircuitBreaker(async () => {
  const response = await fetch(url)
  
  // 4xx = client error, don't count toward circuit failure
  if (response.status >= 400 && response.status < 500) {
    const error = new Error('Client error')
    error.statusCode = response.status
    error.skipCircuit = true  // Custom flag
    throw error
  }
  
  // 5xx = server error, count toward circuit
  if (!response.ok) throw new Error('Server error')
  
  return response.json()
})

// Configure breaker to ignore skipCircuit errors
breaker.on('failure', (error) => {
  if (error.skipCircuit) {
    // Don't count this failure
    return
  }
})

Why: 400 Bad Request and 401 Unauthorized are client problems, not service outages.

Production Readiness Checklist

Configuration

  • ☐ Separate circuit breakers for each external service
  • ☐ Different thresholds for read vs write operations
  • ☐ Timeout values based on service SLAs (not arbitrary)
  • ☐ Volume threshold accounts for traffic patterns
  • ☐ Reset timeout matches expected recovery time

Fallback Strategy

  • ☐ Cached data for read operations
  • ☐ Queue for write operations
  • ☐ Default/static values for non-critical features
  • ☐ Meaningful error messages for users
  • ☐ Retry-After headers in 503 responses

Monitoring & Alerting

  • ☐ Log all state transitions (OPEN, HALF_OPEN, CLOSED)
  • ☐ Track success/failure rates per service
  • ☐ Alert when critical circuits open (Stripe, Auth0, etc.)
  • ☐ Dashboard showing circuit states across all services
  • ☐ Metrics exported to Prometheus/Datadog/CloudWatch
  • ☐ /health endpoint reflects circuit states

Testing

  • ☐ Unit tests for all three states (CLOSED, OPEN, HALF_OPEN)
  • ☐ Integration tests with mocked failures
  • ☐ Chaos engineering tests (Toxiproxy, fault injection)
  • ☐ Load tests to verify circuit behavior under traffic
  • ☐ Test fallback functions independently

Documentation

  • ☐ Document why each threshold was chosen
  • ☐ Runbook for when circuits open
  • ☐ Contact info for service owners (Stripe, AWS, etc.)
  • ☐ Expected recovery time for each service
  • ☐ Business impact when each circuit opens

Conclusion

Circuit breakers are essential for building resilient distributed systems. They prevent cascade failures, protect resources, and improve user experience during outages. By implementing circuit breakers with proper fallback strategies and monitoring, you transform hard failures into graceful degradation.

Key Takeaways

  • Fail fast: Return errors in milliseconds, not seconds
  • Isolate failures: One broken service shouldn't crash your entire system
  • Auto-recover: Circuits automatically detect when services come back
  • Monitor everything: You can't fix what you don't measure
  • Test thoroughly: Chaos engineering reveals gaps in your resilience strategy

Next steps:

  1. Identify your critical external dependencies (Stripe, AWS, Auth0, etc.)
  2. Wrap each dependency in a circuit breaker with appropriate thresholds
  3. Implement fallback strategies for graceful degradation
  4. Add monitoring and alerting for circuit state changes
  5. Test failure scenarios with chaos engineering tools
  6. Document runbooks for when circuits open

Your users won't notice when services go down β€” they'll just see your app continuing to work.

Related Resources