API Load Balancing & Scaling: Complete Implementation Guide
๐ก Monitor your APIs โ know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link โ we may earn a commission at no extra cost to you
Learn how to implement production-ready load balancing and scaling strategies for your API. This comprehensive guide covers load balancer algorithms, horizontal and vertical scaling, auto-scaling patterns, health checks, session persistence, and real-world implementation with NGINX, AWS, and Google Cloud.
What is API Load Balancing?
Load balancing distributes incoming API requests across multiple servers to improve performance, reliability, and availability. Instead of a single server handling all traffic, requests are distributed to multiple backend servers.
Why Load Balancing Matters
Real-world impact:
- Stripe processes 8,000+ requests/second across distributed infrastructure
- GitHub handles 100M+ API requests/day with zero downtime during deployments
- Netflix serves 200M+ users with 99.99% uptime using load-balanced microservices
- Shopify survived Black Friday 2025 (10.3M requests/min peak) with auto-scaling
Benefits:
- High availability: If one server fails, others handle requests (eliminates single point of failure)
- Horizontal scalability: Add more servers to handle increased traffic
- Zero-downtime deployments: Update servers one at a time while load balancer routes traffic to healthy instances
- Better performance: Distribute load evenly, prevent server overload
- Geographic distribution: Route users to nearest server for lower latency
Load Balancing Algorithms
Different strategies for distributing requests to backend servers:
1. Round Robin (Simple & Most Common)
Distributes requests sequentially to each server in rotation.
// NGINX round robin (default)
upstream api_backend {
server api1.example.com:3000;
server api2.example.com:3000;
server api3.example.com:3000;
}
server {
listen 80;
location /api {
proxy_pass http://api_backend;
}
}When to use: Equal server capacity, stateless requests
Pros: Simple, fair distribution
Cons: Doesn't account for server load or capacity differences
2. Least Connections
Routes requests to server with fewest active connections (best for varying request durations).
upstream api_backend {
least_conn; # Enable least connections
server api1.example.com:3000;
server api2.example.com:3000;
server api3.example.com:3000;
}When to use: APIs with long-running requests (file uploads, complex queries, streaming)
Example: Video encoding API where requests take 10-60 seconds
3. IP Hash (Session Persistence)
Routes requests from same IP address to same server (maintains session affinity).
upstream api_backend {
ip_hash; # Route same IP to same server
server api1.example.com:3000;
server api2.example.com:3000;
server api3.example.com:3000;
}When to use: Session-based APIs without shared session storage
Warning: Users behind NAT/proxy share IP โ uneven distribution
4. Weighted Round Robin
Assign different weights to servers based on capacity.
upstream api_backend {
server api1.example.com:3000 weight=3; # High-capacity server (gets 3x traffic)
server api2.example.com:3000 weight=2; # Medium-capacity
server api3.example.com:3000 weight=1; # Low-capacity (gets 1x traffic)
}When to use: Servers with different CPU/RAM (e.g., mix of m5.large and m5.2xlarge instances)
5. Least Response Time (Advanced)
Routes to server with fastest response time + fewest connections.
When to use: Geo-distributed servers (route to nearest/fastest)
Available in: AWS Application Load Balancer, Google Cloud Load Balancer, Cloudflare
Horizontal vs Vertical Scaling
Horizontal Scaling (Scale Out)
Add more servers to handle increased load.
// Start with 2 servers
upstream api_backend {
server api1.example.com:3000;
server api2.example.com:3000;
}
// Scale to 5 servers during peak traffic
upstream api_backend {
server api1.example.com:3000;
server api2.example.com:3000;
server api3.example.com:3000;
server api4.example.com:3000;
server api5.example.com:3000;
}Pros:
- No theoretical limit (can add infinite servers)
- High availability (if one server fails, others continue)
- Cost-effective (use commodity hardware)
- Matches modern cloud architecture
Cons:
- Requires load balancer
- Must handle distributed state (shared database, Redis sessions)
- More complex deployment
Vertical Scaling (Scale Up)
Increase server capacity (more CPU/RAM).
Example: Upgrade from AWS t3.medium (2 vCPU, 4GB RAM) โ m5.2xlarge (8 vCPU, 32GB RAM)
Pros:
- Simple (no architecture changes)
- No distributed state issues
- Works for databases (PostgreSQL, MySQL)
Cons:
- Hard limit (maximum instance size)
- Expensive (diminishing returns above 32GB RAM)
- Single point of failure
- Downtime during upgrade
Decision rule: Start with vertical scaling for simplicity, switch to horizontal when you hit limits or need high availability.
Health Checks & Failure Detection
Load balancers must detect when servers are unhealthy and stop routing traffic to them.
NGINX Health Checks
upstream api_backend {
server api1.example.com:3000 max_fails=3 fail_timeout=30s;
server api2.example.com:3000 max_fails=3 fail_timeout=30s;
server api3.example.com:3000 max_fails=3 fail_timeout=30s;
}
# max_fails=3: Mark server as down after 3 failed requests
# fail_timeout=30s: Wait 30 seconds before trying againActive Health Check Endpoint
Create a dedicated /health endpoint that verifies all dependencies.
// Express health check endpoint
import express from 'express';
import { PrismaClient } from '@prisma/client';
import Redis from 'ioredis';
const app = express();
const prisma = new PrismaClient();
const redis = new Redis();
app.get('/health', async (req, res) => {
const health = {
status: 'ok',
timestamp: new Date().toISOString(),
checks: {}
};
// Check database connection
try {
await prisma.$queryRaw`SELECT 1`;
health.checks.database = 'ok';
} catch (error) {
health.status = 'degraded';
health.checks.database = 'failed';
}
// Check Redis connection
try {
await redis.ping();
health.checks.redis = 'ok';
} catch (error) {
health.status = 'degraded';
health.checks.redis = 'failed';
}
// Return 200 if ok, 503 if degraded
const statusCode = health.status === 'ok' ? 200 : 503;
res.status(statusCode).json(health);
});AWS Application Load Balancer Health Checks
// Terraform configuration
resource "aws_lb_target_group" "api" {
name = "api-target-group"
port = 3000
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
enabled = true
healthy_threshold = 2 # Mark healthy after 2 successful checks
unhealthy_threshold = 3 # Mark unhealthy after 3 failed checks
timeout = 5 # 5 second timeout
interval = 30 # Check every 30 seconds
path = "/health"
matcher = "200" # Expect 200 status code
}
}Manage infrastructure secrets securely
Load balancer configs, SSL certificates, and API keys need secure management. 1Password keeps your infrastructure credentials safe and accessible to your team.
Try 1Password Free โSession Persistence (Sticky Sessions)
Some APIs need to route user's requests to same server for session continuity.
Cookie-Based Sticky Sessions (NGINX)
upstream api_backend {
# Use cookie-based sticky sessions
sticky cookie srv_id expires=1h domain=.example.com path=/;
server api1.example.com:3000;
server api2.example.com:3000;
server api3.example.com:3000;
}Better Approach: Shared Session Storage
Store sessions in Redis (accessible to all servers) instead of sticky sessions.
// Express with Redis session storage
import session from 'express-session';
import RedisStore from 'connect-redis';
import Redis from 'ioredis';
const redisClient = new Redis({
host: 'redis.example.com',
port: 6379,
});
app.use(
session({
store: new RedisStore({ client: redisClient }),
secret: process.env.SESSION_SECRET,
resave: false,
saveUninitialized: false,
cookie: {
secure: true,
httpOnly: true,
maxAge: 1000 * 60 * 60 * 24, // 24 hours
},
})
);
// Now any server can access session data
app.get('/profile', (req, res) => {
const userId = req.session.userId;
// All servers read from same Redis instance
});Auto-Scaling Strategies
Metric-Based Auto-Scaling
Automatically add/remove servers based on metrics.
// AWS Auto Scaling Group (Terraform)
resource "aws_autoscaling_policy" "api_scale_up" {
name = "api-scale-up"
autoscaling_group_name = aws_autoscaling_group.api.name
adjustment_type = "ChangeInCapacity"
scaling_adjustment = 2 # Add 2 instances
cooldown = 300
# Trigger when CPU > 70% for 5 minutes
}
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "api-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300 # 5 minutes
statistic = "Average"
threshold = 70
alarm_actions = [aws_autoscaling_policy.api_scale_up.arn]
}Schedule-Based Auto-Scaling
Scale based on known traffic patterns.
// Scale up before morning peak traffic
resource "aws_autoscaling_schedule" "morning_scale_up" {
scheduled_action_name = "morning-scale-up"
min_size = 5
max_size = 10
desired_capacity = 8
recurrence = "0 8 * * MON-FRI" # 8 AM weekdays
autoscaling_group_name = aws_autoscaling_group.api.name
}
// Scale down after evening peak
resource "aws_autoscaling_schedule" "evening_scale_down" {
scheduled_action_name = "evening-scale-down"
min_size = 2
max_size = 5
desired_capacity = 3
recurrence = "0 22 * * *" # 10 PM daily
autoscaling_group_name = aws_autoscaling_group.api.name
}Predictive Auto-Scaling
AWS and Google Cloud offer ML-based predictive scaling that learns traffic patterns.
resource "aws_autoscaling_policy" "predictive" {
name = "api-predictive-scaling"
autoscaling_group_name = aws_autoscaling_group.api.name
policy_type = "PredictiveScaling"
predictive_scaling_configuration {
metric_specification {
target_value = 70 # Target 70% CPU
predefined_load_metric_specification {
predefined_metric_type = "ASGTotalCPUUtilization"
}
}
mode = "ForecastAndScale" # Proactively scale before traffic spike
}
}Database Scaling Patterns
Read Replicas
Route read queries to replicas, write queries to primary.
import { PrismaClient } from '@prisma/client';
// Primary database (writes)
const prismaWrite = new PrismaClient({
datasources: {
db: { url: process.env.DATABASE_WRITE_URL }
}
});
// Read replica (reads)
const prismaRead = new PrismaClient({
datasources: {
db: { url: process.env.DATABASE_READ_URL }
}
});
// Write operations use primary
async function createUser(data) {
return prismaWrite.user.create({ data });
}
// Read operations use replica
async function getUser(id) {
return prismaRead.user.findUnique({ where: { id } });
}
// List queries use replica
async function listUsers() {
return prismaRead.user.findMany();
}Connection Pooling
Limit database connections per server to prevent overwhelming database.
// PostgreSQL connection pooling
const prisma = new PrismaClient({
datasources: {
db: {
url: process.env.DATABASE_URL + "?connection_limit=10"
}
}
});
// Rule of thumb: connection_limit = number_of_servers * connections_per_server
// Example: 5 servers * 10 connections = 50 total connections to databaseCaching Layers
Application-Level Caching (Redis)
import Redis from 'ioredis';
const redis = new Redis();
async function getUser(userId) {
// Check cache first
const cached = await redis.get(`user:${userId}`);
if (cached) {
return JSON.parse(cached);
}
// Cache miss: query database
const user = await prisma.user.findUnique({ where: { id: userId } });
// Store in cache for 5 minutes
await redis.setex(`user:${userId}`, 300, JSON.stringify(user));
return user;
}CDN for Static API Responses
Use Cloudflare or AWS CloudFront to cache static/infrequently-changing API responses.
// Express: Set cache headers for CDN
app.get('/api/products', async (req, res) => {
const products = await getProducts();
// Cache at CDN for 1 hour
res.set('Cache-Control', 'public, max-age=3600');
res.json(products);
});
// Purge cache when data changes
app.post('/api/products', async (req, res) => {
const product = await createProduct(req.body);
// Purge CDN cache (Cloudflare example)
await fetch('https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache', {
method: 'POST',
headers: {
'X-Auth-Email': process.env.CLOUDFLARE_EMAIL,
'X-Auth-Key': process.env.CLOUDFLARE_API_KEY,
'Content-Type': 'application/json',
},
body: JSON.stringify({
files: ['https://api.example.com/api/products']
})
});
res.json(product);
});Real-World Examples
GitHub API: Multi-Region Load Balancing
GitHub runs API servers in multiple AWS regions and routes users to nearest region using GeoDNS.
- US East: api-us-east.github.com (Virginia)
- EU West: api-eu-west.github.com (Ireland)
- Asia Pacific: api-ap-southeast.github.com (Singapore)
Users in Europe get routed to Ireland (<50ms latency vs >150ms to Virginia).
Stripe API: Auto-Scaling Payment Processing
Stripe processes 8,000+ requests/second with auto-scaling based on queue depth.
- Queue depth < 100: 10 servers
- Queue depth 100-1000: Scale to 20 servers
- Queue depth > 1000: Scale to 50 servers
Black Friday 2025: Scaled from 20 โ 200 servers in 5 minutes during traffic spike.
Netflix: Chaos Engineering for Load Balancers
Netflix randomly kills servers in production to test load balancer failover.
- Chaos Monkey: Randomly terminates EC2 instances
- Result: Load balancers automatically route traffic to healthy servers
- Uptime: 99.99% despite constant instance failures
Protect your team's personal data
Infrastructure engineers are prime targets. Optery scans 400+ data brokers and removes personal information โ protect your team from social engineering attacks.
Scan Free with Optery โCommon Mistakes
1. Not Testing Failover
Problem: Load balancer configured but never tested when server fails.
Solution: Regularly kill servers in staging to verify failover works.
// Kill server and verify requests route to others
docker stop api-server-1
curl https://api.example.com/health # Should still return 2002. Health Check Endpoint Too Simple
Problem: /health just returns 200 without checking database.
// โ Bad: Server is "healthy" even with broken database
app.get('/health', (req, res) => {
res.sendStatus(200);
});
// โ
Good: Actually verify dependencies
app.get('/health', async (req, res) => {
try {
await prisma.$queryRaw`SELECT 1`;
await redis.ping();
res.sendStatus(200);
} catch (error) {
res.sendStatus(503);
}
});3. No Connection Limits to Database
Problem: 10 servers * 100 connections = 1,000 database connections (database crashes).
Solution: Limit connections per server (e.g., 10 connections * 10 servers = 100 total).
4. Sticky Sessions Without Shared State
Problem: Using sticky sessions but sessions stored on local server (if server dies, sessions lost).
Solution: Store sessions in Redis (survives server failures).
5. Not Handling Server Drain During Deployments
Problem: Killing server immediately during deployment (active requests fail).
Solution: Drain connections gracefully before shutdown.
// Graceful shutdown (wait for active requests to finish)
process.on('SIGTERM', async () => {
console.log('SIGTERM received, draining connections...');
// Stop accepting new requests
server.close(async () => {
// Wait for active requests to finish (max 30 seconds)
await new Promise(resolve => setTimeout(resolve, 30000));
// Close database connections
await prisma.$disconnect();
await redis.quit();
process.exit(0);
});
});6. Auto-Scaling Too Slow
Problem: Servers take 5 minutes to start โ traffic spike causes outage before new servers ready.
Solution: Use predictive scaling or keep warm standby servers.
7. No Rate Limiting at Load Balancer
Problem: DDoS attack overwhelms load balancer before reaching application.
Solution: Implement rate limiting at load balancer level.
// NGINX rate limiting (1000 requests/minute per IP)
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=1000r/m;
server {
location /api {
limit_req zone=api_limit burst=100;
proxy_pass http://api_backend;
}
}Production Checklist
Load Balancer Configuration
- โ Health checks configured (check database + dependencies)
- โ Unhealthy servers removed from rotation
- โ Health check interval: 10-30 seconds
- โ Timeout: 5 seconds max
- โ SSL/TLS termination at load balancer
- โ HTTP/2 enabled for better performance
Scaling Configuration
- โ Auto-scaling enabled (CPU or request count)
- โ Minimum instances: 2+ (for high availability)
- โ Maximum instances: Set reasonable limit (avoid runaway costs)
- โ Cooldown period: 5 minutes (prevent rapid scaling)
- โ Connection limits per server configured
Session & State
- โ Sessions stored in Redis (not on server)
- โ Database connection pooling configured
- โ Read replicas for read-heavy workloads
- โ Cache invalidation strategy defined
Monitoring & Alerts
- โ Monitor request distribution across servers
- โ Alert on unhealthy servers
- โ Alert on auto-scaling events
- โ Track response time per server
- โ Monitor connection pool saturation
Deployment
- โ Graceful shutdown implemented (SIGTERM handler)
- โ Zero-downtime deployment strategy (rolling updates)
- โ Rollback plan for failed deployments
- โ Tested failover in staging
Security
- โ Rate limiting at load balancer
- โ DDoS protection enabled (AWS Shield, Cloudflare)
- โ HTTPS only (redirect HTTP to HTTPS)
- โ Security headers configured (HSTS, CSP)
Related Resources
Monitor the APIs mentioned in this guide:
- AWS Status โ Cloud infrastructure and load balancers
- Google Cloud Status โ GCP load balancing services
- Cloudflare Status โ CDN and DDoS protection
- Stripe Status โ Payment processing at scale
- GitHub Status โ API availability and performance
- Datadog Status โ Infrastructure monitoring
- New Relic Status โ Application performance monitoring
- Vercel Status โ Serverless deployments and edge network
Related guides:
- API Observability & Distributed Tracing
- API Caching Strategies
- Circuit Breaker Pattern
- API Error Handling
Summary
Key takeaways:
- Load balancing distributes traffic across multiple servers for high availability and performance
- Choose algorithm based on use case: Round robin (simple), least connections (long requests), IP hash (sticky sessions)
- Horizontal scaling (add servers) beats vertical scaling (bigger server) for modern APIs
- Health checks must verify database + dependencies (not just HTTP 200)
- Auto-scaling prevents outages during traffic spikes (metric-based or predictive)
- Shared state (Redis sessions, read replicas) enables true horizontal scaling
- Test failover regularly โ kill servers in staging to verify load balancer works
- Graceful shutdown prevents request failures during deployments
Start simple (2 servers + round robin), then optimize based on traffic patterns. GitHub, Stripe, and Netflix all started with basic load balancing before evolving to sophisticated multi-region, auto-scaling architectures.
Monitor your APIs and infrastructure in real-time
Better Stack combines uptime monitoring, incident management, and log aggregation. Free tier includes 10 monitors with 3-minute checks.
Try Better Stack Free โ๐ Tools We Use & Recommend
Tested across our own infrastructure monitoring 200+ APIs daily
Uptime Monitoring & Incident Management
Used by 100,000+ websites
Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.
โWe use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.โ
Secrets Management & Developer Security
Trusted by 150,000+ businesses
Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.
โAfter covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.โ
SEO & Site Performance Monitoring
Used by 10M+ marketers
Track your site health, uptime, search rankings, and competitor movements from one dashboard.
โWe use SEMrush to track how our API status pages rank and catch site health issues early.โ