Circuit Breaker Pattern: Build Resilient APIs That Handle Failures Gracefully
When your API depends on external services like Stripe, AWS, or Twilio, failures are inevitable. The circuit breaker pattern prevents cascade failures by automatically detecting when a service is down and temporarily stopping requests. This guide covers everything from basic concepts to production-ready TypeScript implementations.
What Is a Circuit Breaker?
The circuit breaker pattern is a design pattern used in software development to detect failures and prevent cascading failures in distributed systems. It's inspired by electrical circuit breakers that protect electrical circuits from damage caused by excess current.
In software, a circuit breaker wraps API calls to external services and monitors for failures. When failures reach a threshold, the circuit "opens" and subsequent calls fail immediately without hitting the failing service. After a timeout, the circuit enters a "half-open" state to test if the service has recovered.
π‘ The Electrical Analogy
Just like an electrical breaker:
- Closed (Normal): Current flows, requests succeed
- Open (Tripped): Current stops, requests fail fast
- Half-Open (Testing): Limited current flows to test if it's safe to reset
How It Works
// Pseudocode
if (circuitState === CLOSED) {
try {
result = await externalServiceCall()
recordSuccess()
return result
} catch (error) {
recordFailure()
if (failureThresholdReached()) {
circuitState = OPEN
startResetTimer()
}
throw error
}
} else if (circuitState === OPEN) {
throw new CircuitBreakerOpenError("Service unavailable")
} else if (circuitState === HALF_OPEN) {
try {
result = await externalServiceCall()
circuitState = CLOSED // Service recovered!
return result
} catch (error) {
circuitState = OPEN // Still broken
startResetTimer()
throw error
}
}Why Circuit Breakers Matter
Without circuit breakers, your application suffers when dependencies fail:
The Cascade Failure Problem
β οΈ Real-World Example: Black Friday Disaster
An e-commerce site depends on Stripe for payments. During Black Friday, Stripe experiences a 2-minute outage. Without circuit breakers:
- Every checkout request waits 30 seconds for Stripe to timeout
- Request threads pile up (100, 200, 500...)
- Memory exhaustion crashes the payment service
- Now the entire site is down, not just payments
- Recovery takes 15 minutes even after Stripe comes back
Total impact: 17 minutes of complete downtime instead of 2 minutes of graceful degradation.
Benefits of Circuit Breakers
β Fail Fast
Return errors in milliseconds instead of waiting for timeouts (30+ seconds).
β Prevent Cascade Failures
Stop one failing service from bringing down your entire system.
β Automatic Recovery
Detect when services recover and resume normal operations automatically.
β Better User Experience
Show meaningful errors immediately instead of long loading spinners.
β Resource Protection
Prevent thread pool exhaustion, memory leaks, and connection timeouts.
β Observability
Clear metrics on which dependencies are failing and when.
When to Use Circuit Breakers
Circuit breakers are essential for:
- External API calls: Payment gateways (Stripe, PayPal), auth providers (Auth0, Okta), cloud services (AWS, Azure)
- Database connections: Especially when using read replicas or distributed databases
- Microservices communication: Protect downstream services from upstream failures
- Third-party integrations: CRMs (Salesforce, HubSpot), monitoring tools (Datadog, New Relic)
- CDN and storage: Image processing, file uploads, content delivery
The Three States: Closed, Open, Half-Open
A circuit breaker has three states, each with different behavior:
π’ CLOSED (Normal Operation)
- Behavior: All requests pass through to the service
- Monitoring: Track success/failure rate
- Transition: Opens when failure threshold is reached (e.g., 5 failures in 10 seconds)
// Example: 50% failure rate over 10 requests triggers OPEN
if (failures / totalRequests >= 0.5 && totalRequests >= 10) {
circuitState = OPEN
}π΄ OPEN (Failing Fast)
- Behavior: All requests fail immediately without calling the service
- Duration: Stays open for a reset timeout (e.g., 30 seconds)
- Transition: Enters HALF_OPEN after timeout expires
// Fail fast - no network call
if (circuitState === OPEN) {
throw new CircuitBreakerOpenError(
'Stripe API circuit breaker is OPEN. Service unavailable.'
)
}π‘ HALF_OPEN (Testing Recovery)
- Behavior: Allow a limited number of test requests (e.g., 1-3)
- Success: If test requests succeed, transition to CLOSED
- Failure: If test requests fail, transition back to OPEN
// Allow 1 test request
if (circuitState === HALF_OPEN) {
try {
const result = await service.call()
circuitState = CLOSED // Success! Service recovered
return result
} catch (error) {
circuitState = OPEN // Still broken
startResetTimer()
throw error
}
}π State Transition Diagram
βββββββββββββββ
β CLOSED β β Normal operation
β (requests β
β pass thru) β
ββββββββ¬βββββββ
β
β Failure threshold reached
β (e.g., 50% fail rate)
βΌ
βββββββββββββββ
β OPEN β β Failing fast
β (reject β
β all req) β
ββββββββ¬βββββββ
β
β Reset timeout expires
β (e.g., 30 seconds)
βΌ
βββββββββββββββ
β HALF_OPEN β β Testing recovery
β (allow test β
β requests) β
ββββββββ¬βββββββ
β
ββSuccessβ CLOSED
β
ββFailureβ OPEN
Building a Circuit Breaker from Scratch
Let's build a production-ready circuit breaker in TypeScript to understand exactly how it works.
Step 1: Define the Core Types
// circuit-breaker.ts
export enum CircuitState {
CLOSED = 'CLOSED',
OPEN = 'OPEN',
HALF_OPEN = 'HALF_OPEN',
}
export interface CircuitBreakerConfig {
failureThreshold: number // % failures to trigger OPEN (0-1)
resetTimeout: number // ms to wait before HALF_OPEN
rollingWindowSize: number // number of requests to track
halfOpenMaxAttempts: number // test requests in HALF_OPEN
}
export class CircuitBreakerOpenError extends Error {
constructor(message: string) {
super(message)
this.name = 'CircuitBreakerOpenError'
}
}
interface RequestRecord {
timestamp: number
success: boolean
}Step 2: Implement the Circuit Breaker
export class CircuitBreaker<T> {
private state: CircuitState = CircuitState.CLOSED
private requestHistory: RequestRecord[] = []
private openedAt: number | null = null
private halfOpenAttempts = 0
constructor(
private serviceName: string,
private config: CircuitBreakerConfig,
private action: () => Promise<T>
) {}
async execute(): Promise<T> {
// Check if circuit should transition to HALF_OPEN
if (this.state === CircuitState.OPEN) {
if (this.shouldAttemptReset()) {
this.state = CircuitState.HALF_OPEN
this.halfOpenAttempts = 0
console.log(`[${this.serviceName}] Circuit HALF_OPEN - testing recovery`)
} else {
throw new CircuitBreakerOpenError(
`Circuit breaker is OPEN for ${this.serviceName}`
)
}
}
// In HALF_OPEN, limit test requests
if (this.state === CircuitState.HALF_OPEN) {
if (this.halfOpenAttempts >= this.config.halfOpenMaxAttempts) {
throw new CircuitBreakerOpenError(
`Circuit breaker is HALF_OPEN for ${this.serviceName}, max test attempts reached`
)
}
this.halfOpenAttempts++
}
// Execute the action
try {
const result = await this.action()
this.onSuccess()
return result
} catch (error) {
this.onFailure()
throw error
}
}
private onSuccess(): void {
this.recordRequest(true)
if (this.state === CircuitState.HALF_OPEN) {
// Success in HALF_OPEN β reset to CLOSED
this.state = CircuitState.CLOSED
this.requestHistory = []
console.log(`[${this.serviceName}] Circuit CLOSED - service recovered`)
}
}
private onFailure(): void {
this.recordRequest(false)
if (this.state === CircuitState.HALF_OPEN) {
// Failure in HALF_OPEN β back to OPEN
this.openCircuit()
} else if (this.state === CircuitState.CLOSED) {
// Check if we should open the circuit
if (this.shouldOpenCircuit()) {
this.openCircuit()
}
}
}
private recordRequest(success: boolean): void {
this.requestHistory.push({
timestamp: Date.now(),
success,
})
// Keep only recent requests (sliding window)
if (this.requestHistory.length > this.config.rollingWindowSize) {
this.requestHistory.shift()
}
}
private shouldOpenCircuit(): boolean {
if (this.requestHistory.length < this.config.rollingWindowSize) {
return false // Not enough data
}
const failures = this.requestHistory.filter(r => !r.success).length
const failureRate = failures / this.requestHistory.length
return failureRate >= this.config.failureThreshold
}
private openCircuit(): void {
this.state = CircuitState.OPEN
this.openedAt = Date.now()
this.requestHistory = []
console.log(`[${this.serviceName}] Circuit OPEN - service unavailable`)
}
private shouldAttemptReset(): boolean {
if (!this.openedAt) return false
return Date.now() - this.openedAt >= this.config.resetTimeout
}
getState(): CircuitState {
return this.state
}
getMetrics() {
const failures = this.requestHistory.filter(r => !r.success).length
const successes = this.requestHistory.length - failures
return {
state: this.state,
totalRequests: this.requestHistory.length,
failures,
successes,
failureRate: this.requestHistory.length > 0
? failures / this.requestHistory.length
: 0,
openedAt: this.openedAt,
}
}
}Step 3: Usage Example with Stripe API
import Stripe from 'stripe'
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!)
// Create circuit breaker for Stripe payment intents
const stripeCircuit = new CircuitBreaker(
'Stripe Payment Intent',
{
failureThreshold: 0.5, // Open after 50% failure rate
resetTimeout: 30000, // Try recovery after 30 seconds
rollingWindowSize: 10, // Track last 10 requests
halfOpenMaxAttempts: 2, // Allow 2 test requests
},
async () => {
return stripe.paymentIntents.create({
amount: 2000,
currency: 'usd',
})
}
)
// In your Express route
app.post('/api/checkout', async (req, res) => {
try {
const paymentIntent = await stripeCircuit.execute()
res.json({ clientSecret: paymentIntent.client_secret })
} catch (error) {
if (error instanceof CircuitBreakerOpenError) {
// Circuit is open - fail gracefully
return res.status(503).json({
error: 'Payment service temporarily unavailable. Please try again in a few moments.',
retryAfter: 30
})
}
// Other errors (validation, network, etc.)
res.status(500).json({ error: 'Payment failed' })
}
})Production-Ready Implementation with Opossum
For production systems, use a battle-tested library like Opossum. It's the most popular Node.js circuit breaker with 2,700+ stars and extensive features.
Installation
npm install opossum
npm install @types/opossum --save-devBasic Setup with Type Safety
import CircuitBreaker from 'opossum'
import axios from 'axios'
// Define your service function
async function fetchUserData(userId: string) {
const response = await axios.get(`https://api.example.com/users/${userId}`)
return response.data
}
// Wrap it with a circuit breaker
const breaker = new CircuitBreaker(fetchUserData, {
timeout: 5000, // Request timeout (ms)
errorThresholdPercentage: 50, // Open at 50% error rate
resetTimeout: 30000, // Try recovery after 30s
rollingCountTimeout: 10000, // 10s rolling window
rollingCountBuckets: 10, // 10 buckets = 1s each
volumeThreshold: 10, // Min requests before opening
})
// Use it
breaker.fire('user-123')
.then(data => console.log('User:', data))
.catch(err => console.error('Circuit breaker error:', err))Advanced: Multiple Service Circuit Breakers
// services/circuit-breakers.ts
import CircuitBreaker from 'opossum'
import Stripe from 'stripe'
import { twilioClient } from './twilio'
import { auth0Client } from './auth0'
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!)
// Stripe circuit breaker
export const stripeBreaker = new CircuitBreaker(
async (amount: number, currency: string) => {
return stripe.paymentIntents.create({ amount, currency })
},
{
timeout: 10000,
errorThresholdPercentage: 40,
resetTimeout: 60000, // Longer timeout for payment provider
volumeThreshold: 5,
}
)
// Twilio SMS circuit breaker
export const twilioBreaker = new CircuitBreaker(
async (to: string, message: string) => {
return twilioClient.messages.create({
body: message,
to,
from: process.env.TWILIO_PHONE_NUMBER!,
})
},
{
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
volumeThreshold: 10,
}
)
// Auth0 user creation circuit breaker
export const auth0Breaker = new CircuitBreaker(
async (email: string, password: string) => {
return auth0Client.createUser({
connection: 'Username-Password-Authentication',
email,
password,
})
},
{
timeout: 8000,
errorThresholdPercentage: 30, // Lower threshold for auth
resetTimeout: 45000,
volumeThreshold: 5,
}
)
// Setup logging for all breakers
const breakers = { stripeBreaker, twilioBreaker, auth0Breaker }
Object.entries(breakers).forEach(([name, breaker]) => {
breaker.on('open', () => console.error(`[${name}] Circuit OPEN`))
breaker.on('halfOpen', () => console.warn(`[${name}] Circuit HALF_OPEN`))
breaker.on('close', () => console.log(`[${name}] Circuit CLOSED`))
breaker.on('fallback', (result) => console.log(`[${name}] Fallback used`, result))
})Fallback Strategies
Circuit breakers work best with fallback functions that provide degraded functionality when services fail:
import CircuitBreaker from 'opossum'
// Primary: Stripe payment
async function createPaymentIntent(amount: number) {
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!)
return stripe.paymentIntents.create({ amount, currency: 'usd' })
}
// Fallback: Queue payment for later processing
async function queuePaymentForLater(amount: number) {
await redis.lpush('pending-payments', JSON.stringify({
amount,
timestamp: Date.now(),
status: 'queued',
}))
return {
id: `queued-${Date.now()}`,
status: 'queued',
message: 'Payment queued for processing',
}
}
const paymentBreaker = new CircuitBreaker(createPaymentIntent, {
timeout: 10000,
errorThresholdPercentage: 50,
resetTimeout: 60000,
})
// Set fallback
paymentBreaker.fallback(queuePaymentForLater)
// Usage
app.post('/api/checkout', async (req, res) => {
try {
const result = await paymentBreaker.fire(req.body.amount)
if (result.status === 'queued') {
// Fallback was used
return res.status(202).json({
message: 'Payment processing is temporarily delayed. You will receive confirmation via email.',
paymentId: result.id,
})
}
// Normal Stripe response
res.json({ clientSecret: result.client_secret })
} catch (error) {
res.status(500).json({ error: 'Payment failed' })
}
})Health Check Integration
import { stripeBreaker, twilioBreaker, auth0Breaker } from './circuit-breakers'
// Health check endpoint
app.get('/health', (req, res) => {
const health = {
status: 'healthy',
timestamp: new Date().toISOString(),
services: {
stripe: {
state: stripeBreaker.opened ? 'OPEN' :
stripeBreaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
stats: stripeBreaker.stats,
},
twilio: {
state: twilioBreaker.opened ? 'OPEN' :
twilioBreaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
stats: twilioBreaker.stats,
},
auth0: {
state: auth0Breaker.opened ? 'OPEN' :
auth0Breaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
stats: auth0Breaker.stats,
},
},
}
// Overall status is unhealthy if any critical service circuit is open
const criticalServicesOpen = stripeBreaker.opened || auth0Breaker.opened
if (criticalServicesOpen) {
health.status = 'degraded'
return res.status(503).json(health)
}
res.json(health)
})Advanced Patterns
Pattern 1: Per-Endpoint Circuit Breakers
Different API endpoints have different reliability characteristics. Use separate circuit breakers:
const githubBreakers = {
// Read operations - can tolerate more failures
repos: new CircuitBreaker(
(owner: string, repo: string) => octokit.repos.get({ owner, repo }),
{ errorThresholdPercentage: 60, timeout: 3000 }
),
// Write operations - stricter failure threshold
createIssue: new CircuitBreaker(
(owner: string, repo: string, title: string) =>
octokit.issues.create({ owner, repo, title }),
{ errorThresholdPercentage: 30, timeout: 5000 }
),
}Pattern 2: Bulkhead Pattern (Isolation)
Prevent one slow service from exhausting all resources by limiting concurrent requests:
import pLimit from 'p-limit'
// Limit Stripe to max 5 concurrent requests
const stripeLimit = pLimit(5)
const stripeBreaker = new CircuitBreaker(
async (amount: number) => {
return stripeLimit(() =>
stripe.paymentIntents.create({ amount, currency: 'usd' })
)
},
{ timeout: 10000 }
)
// Even if Stripe is slow, only 5 requests block at once
// Other services (Twilio, Auth0) remain unaffectedPattern 3: Retry with Exponential Backoff
Combine circuit breakers with intelligent retries:
import retry from 'async-retry'
const resilientStripeCall = new CircuitBreaker(
async (amount: number) => {
return retry(
async (bail) => {
try {
return await stripe.paymentIntents.create({ amount, currency: 'usd' })
} catch (error: any) {
// Don't retry 4xx errors (client errors)
if (error.statusCode >= 400 && error.statusCode < 500) {
bail(error)
return
}
throw error // Retry 5xx and network errors
}
},
{
retries: 3,
minTimeout: 1000,
maxTimeout: 5000,
factor: 2, // Exponential backoff
}
)
},
{ timeout: 15000 } // Account for retries in timeout
)
// This gives you:
// 1. Retries for transient failures (network blips, 503s)
// 2. Circuit breaker for sustained outages
// 3. No retries for client errors (400, 401, etc.)Pattern 4: Cache-Aside with Circuit Breakers
Use cached data when the circuit is open:
import NodeCache from 'node-cache'
const cache = new NodeCache({ stdTTL: 300 }) // 5 min cache
const userServiceBreaker = new CircuitBreaker(
async (userId: string) => {
const response = await axios.get(`https://api.example.com/users/${userId}`)
// Cache successful response
cache.set(`user:${userId}`, response.data)
return response.data
},
{ timeout: 5000 }
)
// Fallback to cache when circuit is open
userServiceBreaker.fallback(async (userId: string) => {
const cached = cache.get(`user:${userId}`)
if (cached) {
console.log(`Serving stale data for user ${userId}`)
return { ...cached, stale: true }
}
throw new Error('No cached data available')
})
// Usage
const userData = await userServiceBreaker.fire('user-123')
if (userData.stale) {
// Show UI indicator that data may be outdated
}Monitoring Circuit Breaker Health
Circuit breakers are only effective if you monitor their behavior and alert on problems.
Metrics to Track
State Changes
- β’ Frequency of OPEN state
- β’ Time spent in OPEN state
- β’ HALF_OPEN β OPEN transitions (failed recovery)
- β’ HALF_OPEN β CLOSED transitions (successful recovery)
Request Stats
- β’ Total requests
- β’ Success/failure rate
- β’ Rejected requests (circuit OPEN)
- β’ Fallback invocations
Timing
- β’ Response time (p50, p95, p99)
- β’ Timeout frequency
- β’ Time to recovery
Business Impact
- β’ Failed payments (Stripe OPEN)
- β’ Missed notifications (Twilio OPEN)
- β’ Failed logins (Auth0 OPEN)
Prometheus Metrics Integration
import { Counter, Gauge, Histogram, register } from 'prom-client'
// Define metrics
const circuitStateGauge = new Gauge({
name: 'circuit_breaker_state',
help: 'Current state of circuit breaker (0=CLOSED, 1=HALF_OPEN, 2=OPEN)',
labelNames: ['service'],
})
const circuitRequestsTotal = new Counter({
name: 'circuit_breaker_requests_total',
help: 'Total requests through circuit breaker',
labelNames: ['service', 'state', 'result'],
})
const circuitLatency = new Histogram({
name: 'circuit_breaker_latency_seconds',
help: 'Circuit breaker request latency',
labelNames: ['service'],
buckets: [0.1, 0.5, 1, 2, 5, 10],
})
// Instrument circuit breaker
function instrumentCircuitBreaker(breaker: CircuitBreaker, serviceName: string) {
breaker.on('success', (result, latency) => {
circuitRequestsTotal.inc({ service: serviceName, state: 'closed', result: 'success' })
circuitLatency.observe({ service: serviceName }, latency / 1000)
})
breaker.on('failure', (error) => {
circuitRequestsTotal.inc({
service: serviceName,
state: breaker.opened ? 'open' : 'closed',
result: 'failure'
})
})
breaker.on('reject', () => {
circuitRequestsTotal.inc({ service: serviceName, state: 'open', result: 'rejected' })
})
breaker.on('open', () => {
circuitStateGauge.set({ service: serviceName }, 2)
})
breaker.on('halfOpen', () => {
circuitStateGauge.set({ service: serviceName }, 1)
})
breaker.on('close', () => {
circuitStateGauge.set({ service: serviceName }, 0)
})
}
// Use it
instrumentCircuitBreaker(stripeBreaker, 'stripe')
instrumentCircuitBreaker(twilioBreaker, 'twilio')
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.end(await register.metrics())
})Alerting Rules (Prometheus + Alertmanager)
# Circuit breaker alert rules
groups:
- name: circuit_breaker
rules:
# Alert if circuit is OPEN for more than 5 minutes
- alert: CircuitBreakerOpenTooLong
expr: circuit_breaker_state{state="2"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.service }} is OPEN"
description: "{{ $labels.service }} circuit has been OPEN for 5+ minutes"
# Alert if failure rate > 20% over 5 minutes
- alert: HighCircuitBreakerFailureRate
expr: |
(
rate(circuit_breaker_requests_total{result="failure"}[5m]) /
rate(circuit_breaker_requests_total[5m])
) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High failure rate for {{ $labels.service }}"
description: "{{ $labels.service }} has {{ $value | humanizePercentage }} failure rate"
# Alert if critical service (Stripe, Auth0) circuit opens
- alert: CriticalServiceCircuitOpen
expr: circuit_breaker_state{service=~"stripe|auth0",state="2"} > 0
for: 1m
labels:
severity: critical
annotations:
summary: "CRITICAL: {{ $labels.service }} circuit is OPEN"
description: "Revenue-critical service {{ $labels.service }} is unavailable"Datadog Integration
import { StatsD } from 'hot-shots'
const dogstatsd = new StatsD({
host: 'localhost',
port: 8125,
prefix: 'circuit_breaker.',
})
function instrumentWithDatadog(breaker: CircuitBreaker, serviceName: string) {
breaker.on('success', (result, latency) => {
dogstatsd.increment(`${serviceName}.success`)
dogstatsd.timing(`${serviceName}.latency`, latency)
})
breaker.on('failure', () => {
dogstatsd.increment(`${serviceName}.failure`)
})
breaker.on('open', () => {
dogstatsd.gauge(`${serviceName}.state`, 2)
dogstatsd.event(
`Circuit Open: ${serviceName}`,
`Circuit breaker for ${serviceName} has opened`,
{ alert_type: 'error' }
)
})
breaker.on('close', () => {
dogstatsd.gauge(`${serviceName}.state`, 0)
dogstatsd.event(
`Circuit Closed: ${serviceName}`,
`Circuit breaker for ${serviceName} has recovered`,
{ alert_type: 'success' }
)
})
}
instrumentWithDatadog(stripeBreaker, 'stripe')
instrumentWithDatadog(twilioBreaker, 'twilio')Testing Circuit Breakers
Circuit breakers must be tested to ensure they behave correctly under failure conditions.
Unit Tests with Jest
import CircuitBreaker from 'opossum'
describe('Circuit Breaker', () => {
let callCount = 0
let shouldFail = false
const flakeyService = async () => {
callCount++
if (shouldFail) {
throw new Error('Service failure')
}
return { success: true }
}
beforeEach(() => {
callCount = 0
shouldFail = false
})
test('allows requests in CLOSED state', async () => {
const breaker = new CircuitBreaker(flakeyService, {
errorThresholdPercentage: 50,
resetTimeout: 1000,
})
const result = await breaker.fire()
expect(result.success).toBe(true)
expect(callCount).toBe(1)
})
test('opens after threshold failures', async () => {
const breaker = new CircuitBreaker(flakeyService, {
errorThresholdPercentage: 50,
volumeThreshold: 2,
resetTimeout: 1000,
})
shouldFail = true
// First 2 failures should reach threshold
await expect(breaker.fire()).rejects.toThrow()
await expect(breaker.fire()).rejects.toThrow()
// Circuit should now be OPEN
expect(breaker.opened).toBe(true)
// Next request should be rejected without calling service
shouldFail = false // Service is "fixed"
await expect(breaker.fire()).rejects.toThrow()
expect(callCount).toBe(2) // Service not called (circuit open)
})
test('transitions to HALF_OPEN after reset timeout', async () => {
jest.useFakeTimers()
const breaker = new CircuitBreaker(flakeyService, {
errorThresholdPercentage: 50,
volumeThreshold: 2,
resetTimeout: 1000,
})
shouldFail = true
await expect(breaker.fire()).rejects.toThrow()
await expect(breaker.fire()).rejects.toThrow()
expect(breaker.opened).toBe(true)
// Fast-forward time
jest.advanceTimersByTime(1100)
// Circuit should allow test request
shouldFail = false
const result = await breaker.fire()
expect(result.success).toBe(true)
expect(breaker.closed).toBe(true) // Back to CLOSED
jest.useRealTimers()
})
test('uses fallback when circuit is open', async () => {
const breaker = new CircuitBreaker(flakeyService, {
errorThresholdPercentage: 50,
volumeThreshold: 2,
resetTimeout: 1000,
})
// Set fallback
breaker.fallback(() => ({ success: false, fallback: true }))
shouldFail = true
await breaker.fire() // 1st failure
await breaker.fire() // 2nd failure - circuit opens
// Next request uses fallback
const result = await breaker.fire()
expect(result.fallback).toBe(true)
})
})Integration Tests with Real Services
import nock from 'nock'
describe('Stripe Circuit Breaker Integration', () => {
test('opens circuit when Stripe returns 500s', async () => {
// Mock Stripe API to return 500 errors
nock('https://api.stripe.com')
.post('/v1/payment_intents')
.times(5)
.reply(500, { error: 'Internal Server Error' })
const breaker = new CircuitBreaker(
async () => {
const response = await fetch('https://api.stripe.com/v1/payment_intents', {
method: 'POST',
headers: { Authorization: 'Bearer test' },
})
if (!response.ok) throw new Error('Stripe API error')
return response.json()
},
{ errorThresholdPercentage: 50, volumeThreshold: 3 }
)
// First 3 failures
await expect(breaker.fire()).rejects.toThrow()
await expect(breaker.fire()).rejects.toThrow()
await expect(breaker.fire()).rejects.toThrow()
// Circuit should be open
expect(breaker.opened).toBe(true)
})
})Chaos Testing with Toxiproxy
Use Toxiproxy to simulate network failures:
import { ToxiproxyClient } from 'toxiproxy-node-client'
const toxiproxy = new ToxiproxyClient('http://localhost:8474')
describe('Circuit Breaker Chaos Tests', () => {
test('handles network latency', async () => {
// Add 5-second latency to Stripe API
const proxy = await toxiproxy.createProxy({
name: 'stripe',
listen: '127.0.0.1:3001',
upstream: 'api.stripe.com:443',
})
await proxy.addToxic({
type: 'latency',
attributes: { latency: 5000 },
})
const breaker = new CircuitBreaker(stripeCall, { timeout: 3000 })
// Should timeout and open circuit
await expect(breaker.fire()).rejects.toThrow()
})
test('recovers when network is restored', async () => {
const proxy = await toxiproxy.getProxy('stripe')
// Remove latency toxic
await proxy.removeToxic('latency')
// Circuit should recover
const result = await breaker.fire()
expect(result).toBeDefined()
})
})Real-World Scenarios
Scenario 1: Stripe Payment Outage
Problem: Stripe has a 5-minute outage during peak checkout hours.
Without Circuit Breaker:
- β’ Every checkout waits 30 seconds for Stripe timeout
- β’ 1,000 concurrent users = 1,000 blocked threads
- β’ Server crashes from memory exhaustion
- β’ Total downtime: 5 min (Stripe) + 10 min (recovery) = 15 min
- β’ Lost revenue: $50,000 (assuming $200/min average)
With Circuit Breaker:
- β’ After 5 failures (10 seconds), circuit opens
- β’ Users see: "Payment processing temporarily unavailable. Please try again in a few minutes."
- β’ Some users queue their carts (fallback: email notification when payments restore)
- β’ Server remains stable, handling other requests
- β’ Circuit automatically recovers when Stripe is back
- β’ Total impact: 5 min of degraded checkout only
- β’ Lost revenue: $1,000 (90% of users return after notification)
Saved: $49,000 + prevented full site outage
Scenario 2: Third-Party API Rate Limiting
Problem: Your app integrates with GitHub API to fetch user repositories. GitHub rate limits you to 5,000 requests/hour.
const githubBreaker = new CircuitBreaker(
async (username: string) => {
const response = await octokit.repos.listForUser({ username })
return response.data
},
{
errorThresholdPercentage: 30, // Lower threshold for rate limits
resetTimeout: 3600000, // 1 hour (GitHub's reset window)
volumeThreshold: 10,
}
)
// Fallback: Use cached data
githubBreaker.fallback(async (username: string) => {
const cached = await redis.get(`github:repos:${username}`)
if (cached) {
return { ...JSON.parse(cached), stale: true }
}
throw new Error('No cached data available')
})
// In your API route
app.get('/api/users/:username/repos', async (req, res) => {
try {
const repos = await githubBreaker.fire(req.params.username)
if (repos.stale) {
res.setHeader('X-Data-Stale', 'true')
res.setHeader('Cache-Control', 'max-age=60')
}
res.json(repos)
} catch (error) {
if (error instanceof CircuitBreakerOpenError) {
return res.status(429).json({
error: 'GitHub API rate limit exceeded. Please try again later.',
retryAfter: 3600
})
}
throw error
}
})Scenario 3: Microservices Cascade Failure
Architecture: Frontend β API Gateway β Order Service β Inventory Service β Database
Problem: Database becomes slow (5-second queries instead of 50ms).
Without Circuit Breakers:
- β’ Inventory Service threads block waiting for DB
- β’ Order Service threads block waiting for Inventory
- β’ API Gateway threads block waiting for Order Service
- β’ Entire system cascades to failure within 2 minutes
With Circuit Breakers:
// Order Service β Inventory Service circuit breaker
const inventoryBreaker = new CircuitBreaker(inventoryService.checkStock, {
timeout: 2000,
errorThresholdPercentage: 50,
})
inventoryBreaker.fallback(() => ({
available: true, // Optimistic: assume items are available
estimatedDelivery: '7-10 business days',
disclaimer: 'Stock levels are being verified'
}))
// Result:
// - Inventory Service fails, circuit opens
// - Order Service continues accepting orders with disclaimer
// - API Gateway remains responsive
// - Only Inventory checks are degraded, not the entire order flowCommon Mistakes to Avoid
β Mistake 1: Too Aggressive Thresholds
// BAD: Opens after 1 failure
const breaker = new CircuitBreaker(service, {
errorThresholdPercentage: 100,
volumeThreshold: 1,
})
// GOOD: Opens after sustained failures
const breaker = new CircuitBreaker(service, {
errorThresholdPercentage: 50,
volumeThreshold: 10, // Need 10 requests, 50% failing
})Why: Transient network blips shouldn't open circuits. Wait for sustained failure patterns.
β Mistake 2: Sharing Circuit Breakers Across Endpoints
// BAD: One circuit for all Stripe operations
const stripeBreaker = new CircuitBreaker(stripeClient)
// GOOD: Separate circuits for different reliability profiles
const stripePaymentBreaker = new CircuitBreaker(stripeClient.paymentIntents.create, { ... })
const stripeRefundBreaker = new CircuitBreaker(stripeClient.refunds.create, { ... })
const stripeCustomerBreaker = new CircuitBreaker(stripeClient.customers.retrieve, { ... })Why: Payment creation might fail while customer retrieval works fine. Don't block working endpoints.
β Mistake 3: No Fallback Strategy
// BAD: Just throw error
const breaker = new CircuitBreaker(service)
try {
await breaker.fire()
} catch (error) {
throw error // User sees generic error
}
// GOOD: Provide fallback
breaker.fallback(() => getCachedData())
try {
const data = await breaker.fire()
res.json(data)
} catch (error) {
res.status(503).json({
error: 'Service temporarily unavailable',
message: 'Please try again in a few moments',
retryAfter: 30
})
}Why: Users need actionable information, not just "something went wrong."
β Mistake 4: Not Monitoring Circuit State
// BAD: No logging or alerting
const breaker = new CircuitBreaker(service)
// GOOD: Track state changes
breaker.on('open', () => {
logger.error('Circuit breaker opened for Stripe')
metrics.increment('circuit.open', { service: 'stripe' })
sendAlert('CRITICAL: Stripe payments unavailable')
})
breaker.on('close', () => {
logger.info('Circuit breaker recovered for Stripe')
metrics.increment('circuit.recovered', { service: 'stripe' })
})Why: You need to know immediately when critical services fail.
β Mistake 5: Timeout Longer Than Circuit Threshold
// BAD: 30-second timeout defeats the purpose
const breaker = new CircuitBreaker(service, {
timeout: 30000,
errorThresholdPercentage: 50,
volumeThreshold: 5,
})
// GOOD: Short timeout to fail fast
const breaker = new CircuitBreaker(service, {
timeout: 3000, // Fail after 3 seconds, not 30
errorThresholdPercentage: 50,
volumeThreshold: 5,
})Why: Circuit breakers should fail fast. Long timeouts defeat the purpose.
β Mistake 6: Ignoring 4xx Errors
// BAD: Count 400/401 as failures
const breaker = new CircuitBreaker(async () => {
const response = await fetch(url)
if (!response.ok) throw new Error('Request failed')
return response.json()
})
// GOOD: Only count 5xx as circuit-worthy failures
const breaker = new CircuitBreaker(async () => {
const response = await fetch(url)
// 4xx = client error, don't count toward circuit failure
if (response.status >= 400 && response.status < 500) {
const error = new Error('Client error')
error.statusCode = response.status
error.skipCircuit = true // Custom flag
throw error
}
// 5xx = server error, count toward circuit
if (!response.ok) throw new Error('Server error')
return response.json()
})
// Configure breaker to ignore skipCircuit errors
breaker.on('failure', (error) => {
if (error.skipCircuit) {
// Don't count this failure
return
}
})Why: 400 Bad Request and 401 Unauthorized are client problems, not service outages.
Production Readiness Checklist
Configuration
- β Separate circuit breakers for each external service
- β Different thresholds for read vs write operations
- β Timeout values based on service SLAs (not arbitrary)
- β Volume threshold accounts for traffic patterns
- β Reset timeout matches expected recovery time
Fallback Strategy
- β Cached data for read operations
- β Queue for write operations
- β Default/static values for non-critical features
- β Meaningful error messages for users
- β Retry-After headers in 503 responses
Monitoring & Alerting
- β Log all state transitions (OPEN, HALF_OPEN, CLOSED)
- β Track success/failure rates per service
- β Alert when critical circuits open (Stripe, Auth0, etc.)
- β Dashboard showing circuit states across all services
- β Metrics exported to Prometheus/Datadog/CloudWatch
- β /health endpoint reflects circuit states
Testing
- β Unit tests for all three states (CLOSED, OPEN, HALF_OPEN)
- β Integration tests with mocked failures
- β Chaos engineering tests (Toxiproxy, fault injection)
- β Load tests to verify circuit behavior under traffic
- β Test fallback functions independently
Documentation
- β Document why each threshold was chosen
- β Runbook for when circuits open
- β Contact info for service owners (Stripe, AWS, etc.)
- β Expected recovery time for each service
- β Business impact when each circuit opens
Conclusion
Circuit breakers are essential for building resilient distributed systems. They prevent cascade failures, protect resources, and improve user experience during outages. By implementing circuit breakers with proper fallback strategies and monitoring, you transform hard failures into graceful degradation.
Key Takeaways
- Fail fast: Return errors in milliseconds, not seconds
- Isolate failures: One broken service shouldn't crash your entire system
- Auto-recover: Circuits automatically detect when services come back
- Monitor everything: You can't fix what you don't measure
- Test thoroughly: Chaos engineering reveals gaps in your resilience strategy
Next steps:
- Identify your critical external dependencies (Stripe, AWS, Auth0, etc.)
- Wrap each dependency in a circuit breaker with appropriate thresholds
- Implement fallback strategies for graceful degradation
- Add monitoring and alerting for circuit state changes
- Test failure scenarios with chaos engineering tools
- Document runbooks for when circuits open
Your users won't notice when services go down β they'll just see your app continuing to work.
Related Resources
- Is Stripe Down? Real-time Status Monitoring
- Is AWS Down? Check Amazon Web Services Status
- Is Auth0 Down? Authentication Service Status
- Is Twilio Down? SMS and Voice API Status
- Is GitHub Down? Git Repository Hosting Status
- Is Datadog Down? Monitoring Platform Status
- Is New Relic Down? APM Service Status
- Complete Webhook Implementation Guide