API Load Balancing & Scaling: Complete Implementation Guide

Learn how to implement production-ready load balancing and scaling strategies for your API. This comprehensive guide covers load balancer algorithms, horizontal and vertical scaling, auto-scaling patterns, health checks, session persistence, and real-world implementation with NGINX, AWS, and Google Cloud.

What is API Load Balancing?

Load balancing distributes incoming API requests across multiple servers to improve performance, reliability, and availability. Instead of a single server handling all traffic, requests are distributed to multiple backend servers.

Why Load Balancing Matters

Real-world impact:

Stripe processes 8,000+ requests/second across distributed infrastructure
GitHub handles 100M+ API requests/day with zero downtime during deployments
Netflix serves 200M+ users with 99.99% uptime using load-balanced microservices
Shopify survived Black Friday 2025 (10.3M requests/min peak) with auto-scaling

Benefits:

High availability: If one server fails, others handle requests (eliminates single point of failure)
Horizontal scalability: Add more servers to handle increased traffic
Zero-downtime deployments: Update servers one at a time while load balancer routes traffic to healthy instances
Better performance: Distribute load evenly, prevent server overload
Geographic distribution: Route users to nearest server for lower latency

Load Balancing Algorithms

Different strategies for distributing requests to backend servers:

1. Round Robin (Simple & Most Common)

Distributes requests sequentially to each server in rotation.

// NGINX round robin (default)
upstream api_backend {
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
}

server {
    listen 80;
    location /api {
        proxy_pass http://api_backend;
    }
}

When to use: Equal server capacity, stateless requests
Pros: Simple, fair distribution
Cons: Doesn't account for server load or capacity differences

2. Least Connections

Routes requests to server with fewest active connections (best for varying request durations).

upstream api_backend {
    least_conn;  # Enable least connections
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
}

When to use: APIs with long-running requests (file uploads, complex queries, streaming)
Example: Video encoding API where requests take 10-60 seconds

3. IP Hash (Session Persistence)

Routes requests from same IP address to same server (maintains session affinity).

upstream api_backend {
    ip_hash;  # Route same IP to same server
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
}

When to use: Session-based APIs without shared session storage
Warning: Users behind NAT/proxy share IP → uneven distribution

4. Weighted Round Robin

Assign different weights to servers based on capacity.

upstream api_backend {
    server api1.example.com:3000 weight=3;  # High-capacity server (gets 3x traffic)
    server api2.example.com:3000 weight=2;  # Medium-capacity
    server api3.example.com:3000 weight=1;  # Low-capacity (gets 1x traffic)
}

When to use: Servers with different CPU/RAM (e.g., mix of m5.large and m5.2xlarge instances)

5. Least Response Time (Advanced)

Routes to server with fastest response time + fewest connections.

When to use: Geo-distributed servers (route to nearest/fastest)
Available in: AWS Application Load Balancer, Google Cloud Load Balancer, Cloudflare

Horizontal vs Vertical Scaling

Horizontal Scaling (Scale Out)

Add more servers to handle increased load.

// Start with 2 servers
upstream api_backend {
    server api1.example.com:3000;
    server api2.example.com:3000;
}

// Scale to 5 servers during peak traffic
upstream api_backend {
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
    server api4.example.com:3000;
    server api5.example.com:3000;
}

Pros:

No theoretical limit (can add infinite servers)
High availability (if one server fails, others continue)
Cost-effective (use commodity hardware)
Matches modern cloud architecture

Cons:

Requires load balancer
Must handle distributed state (shared database, Redis sessions)
More complex deployment

Vertical Scaling (Scale Up)

Increase server capacity (more CPU/RAM).

Example: Upgrade from AWS t3.medium (2 vCPU, 4GB RAM) → m5.2xlarge (8 vCPU, 32GB RAM)

Pros:

Simple (no architecture changes)
No distributed state issues
Works for databases (PostgreSQL, MySQL)

Cons:

Hard limit (maximum instance size)
Expensive (diminishing returns above 32GB RAM)
Single point of failure
Downtime during upgrade

Decision rule: Start with vertical scaling for simplicity, switch to horizontal when you hit limits or need high availability.

Health Checks & Failure Detection

Load balancers must detect when servers are unhealthy and stop routing traffic to them.

NGINX Health Checks

upstream api_backend {
    server api1.example.com:3000 max_fails=3 fail_timeout=30s;
    server api2.example.com:3000 max_fails=3 fail_timeout=30s;
    server api3.example.com:3000 max_fails=3 fail_timeout=30s;
}

# max_fails=3: Mark server as down after 3 failed requests
# fail_timeout=30s: Wait 30 seconds before trying again

Active Health Check Endpoint

Create a dedicated /health endpoint that verifies all dependencies.

// Express health check endpoint
import express from 'express';
import { PrismaClient } from '@prisma/client';
import Redis from 'ioredis';

const app = express();
const prisma = new PrismaClient();
const redis = new Redis();

app.get('/health', async (req, res) => {
  const health = {
    status: 'ok',
    timestamp: new Date().toISOString(),
    checks: {}
  };

  // Check database connection
  try {
    await prisma.$queryRaw`SELECT 1`;
    health.checks.database = 'ok';
  } catch (error) {
    health.status = 'degraded';
    health.checks.database = 'failed';
  }

  // Check Redis connection
  try {
    await redis.ping();
    health.checks.redis = 'ok';
  } catch (error) {
    health.status = 'degraded';
    health.checks.redis = 'failed';
  }

  // Return 200 if ok, 503 if degraded
  const statusCode = health.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(health);
});

AWS Application Load Balancer Health Checks

// Terraform configuration
resource "aws_lb_target_group" "api" {
  name     = "api-target-group"
  port     = 3000
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    enabled             = true
    healthy_threshold   = 2      # Mark healthy after 2 successful checks
    unhealthy_threshold = 3      # Mark unhealthy after 3 failed checks
    timeout             = 5      # 5 second timeout
    interval            = 30     # Check every 30 seconds
    path                = "/health"
    matcher             = "200"  # Expect 200 status code
  }
}

🔐

Recommended

Manage infrastructure secrets securely

Load balancer configs, SSL certificates, and API keys need secure management. 1Password keeps your infrastructure credentials safe and accessible to your team.

Try 1Password Free →

Session Persistence (Sticky Sessions)

Some APIs need to route user's requests to same server for session continuity.

Cookie-Based Sticky Sessions (NGINX)

upstream api_backend {
    # Use cookie-based sticky sessions
    sticky cookie srv_id expires=1h domain=.example.com path=/;
    
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
}

Better Approach: Shared Session Storage

Store sessions in Redis (accessible to all servers) instead of sticky sessions.

// Express with Redis session storage
import session from 'express-session';
import RedisStore from 'connect-redis';
import Redis from 'ioredis';

const redisClient = new Redis({
  host: 'redis.example.com',
  port: 6379,
});

app.use(
  session({
    store: new RedisStore({ client: redisClient }),
    secret: process.env.SESSION_SECRET,
    resave: false,
    saveUninitialized: false,
    cookie: {
      secure: true,
      httpOnly: true,
      maxAge: 1000 * 60 * 60 * 24, // 24 hours
    },
  })
);

// Now any server can access session data
app.get('/profile', (req, res) => {
  const userId = req.session.userId;
  // All servers read from same Redis instance
});

Auto-Scaling Strategies

Metric-Based Auto-Scaling

Automatically add/remove servers based on metrics.

// AWS Auto Scaling Group (Terraform)
resource "aws_autoscaling_policy" "api_scale_up" {
  name                   = "api-scale-up"
  autoscaling_group_name = aws_autoscaling_group.api.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 2  # Add 2 instances
  cooldown               = 300

  # Trigger when CPU > 70% for 5 minutes
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name          = "api-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300  # 5 minutes
  statistic           = "Average"
  threshold           = 70
  alarm_actions       = [aws_autoscaling_policy.api_scale_up.arn]
}

Schedule-Based Auto-Scaling

Scale based on known traffic patterns.

// Scale up before morning peak traffic
resource "aws_autoscaling_schedule" "morning_scale_up" {
  scheduled_action_name  = "morning-scale-up"
  min_size               = 5
  max_size               = 10
  desired_capacity       = 8
  recurrence             = "0 8 * * MON-FRI"  # 8 AM weekdays
  autoscaling_group_name = aws_autoscaling_group.api.name
}

// Scale down after evening peak
resource "aws_autoscaling_schedule" "evening_scale_down" {
  scheduled_action_name  = "evening-scale-down"
  min_size               = 2
  max_size               = 5
  desired_capacity       = 3
  recurrence             = "0 22 * * *"  # 10 PM daily
  autoscaling_group_name = aws_autoscaling_group.api.name
}

Predictive Auto-Scaling

AWS and Google Cloud offer ML-based predictive scaling that learns traffic patterns.

resource "aws_autoscaling_policy" "predictive" {
  name                   = "api-predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.api.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70  # Target 70% CPU
      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
      }
    }
    mode = "ForecastAndScale"  # Proactively scale before traffic spike
  }
}

Database Scaling Patterns

Read Replicas

Route read queries to replicas, write queries to primary.

import { PrismaClient } from '@prisma/client';

// Primary database (writes)
const prismaWrite = new PrismaClient({
  datasources: {
    db: { url: process.env.DATABASE_WRITE_URL }
  }
});

// Read replica (reads)
const prismaRead = new PrismaClient({
  datasources: {
    db: { url: process.env.DATABASE_READ_URL }
  }
});

// Write operations use primary
async function createUser(data) {
  return prismaWrite.user.create({ data });
}

// Read operations use replica
async function getUser(id) {
  return prismaRead.user.findUnique({ where: { id } });
}

// List queries use replica
async function listUsers() {
  return prismaRead.user.findMany();
}

Connection Pooling

Limit database connections per server to prevent overwhelming database.

// PostgreSQL connection pooling
const prisma = new PrismaClient({
  datasources: {
    db: {
      url: process.env.DATABASE_URL + "?connection_limit=10"
    }
  }
});

// Rule of thumb: connection_limit = number_of_servers * connections_per_server
// Example: 5 servers * 10 connections = 50 total connections to database

Caching Layers

Application-Level Caching (Redis)

import Redis from 'ioredis';
const redis = new Redis();

async function getUser(userId) {
  // Check cache first
  const cached = await redis.get(`user:${userId}`);
  if (cached) {
    return JSON.parse(cached);
  }

  // Cache miss: query database
  const user = await prisma.user.findUnique({ where: { id: userId } });
  
  // Store in cache for 5 minutes
  await redis.setex(`user:${userId}`, 300, JSON.stringify(user));
  
  return user;
}

CDN for Static API Responses

Use Cloudflare or AWS CloudFront to cache static/infrequently-changing API responses.

// Express: Set cache headers for CDN
app.get('/api/products', async (req, res) => {
  const products = await getProducts();
  
  // Cache at CDN for 1 hour
  res.set('Cache-Control', 'public, max-age=3600');
  res.json(products);
});

// Purge cache when data changes
app.post('/api/products', async (req, res) => {
  const product = await createProduct(req.body);
  
  // Purge CDN cache (Cloudflare example)
  await fetch('https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache', {
    method: 'POST',
    headers: {
      'X-Auth-Email': process.env.CLOUDFLARE_EMAIL,
      'X-Auth-Key': process.env.CLOUDFLARE_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      files: ['https://api.example.com/api/products']
    })
  });
  
  res.json(product);
});

Real-World Examples

GitHub API: Multi-Region Load Balancing

GitHub runs API servers in multiple AWS regions and routes users to nearest region using GeoDNS.

US East: api-us-east.github.com (Virginia)
EU West: api-eu-west.github.com (Ireland)
Asia Pacific: api-ap-southeast.github.com (Singapore)

Users in Europe get routed to Ireland (<50ms latency vs >150ms to Virginia).

Stripe API: Auto-Scaling Payment Processing

Stripe processes 8,000+ requests/second with auto-scaling based on queue depth.

Queue depth < 100: 10 servers
Queue depth 100-1000: Scale to 20 servers
Queue depth > 1000: Scale to 50 servers

Black Friday 2025: Scaled from 20 → 200 servers in 5 minutes during traffic spike.

Netflix: Chaos Engineering for Load Balancers

Netflix randomly kills servers in production to test load balancer failover.

Chaos Monkey: Randomly terminates EC2 instances
Result: Load balancers automatically route traffic to healthy servers
Uptime: 99.99% despite constant instance failures

🛡️

Recommended

Protect your team's personal data

Infrastructure engineers are prime targets. Optery scans 400+ data brokers and removes personal information — protect your team from social engineering attacks.

Scan Free with Optery →

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your API goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your API + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

Common Mistakes

1. Not Testing Failover

Problem: Load balancer configured but never tested when server fails.

Solution: Regularly kill servers in staging to verify failover works.

// Kill server and verify requests route to others
docker stop api-server-1
curl https://api.example.com/health  # Should still return 200

2. Health Check Endpoint Too Simple

Problem: /health just returns 200 without checking database.

// ❌ Bad: Server is "healthy" even with broken database
app.get('/health', (req, res) => {
  res.sendStatus(200);
});

// ✅ Good: Actually verify dependencies
app.get('/health', async (req, res) => {
  try {
    await prisma.$queryRaw`SELECT 1`;
    await redis.ping();
    res.sendStatus(200);
  } catch (error) {
    res.sendStatus(503);
  }
});

3. No Connection Limits to Database

Problem: 10 servers * 100 connections = 1,000 database connections (database crashes).

Solution: Limit connections per server (e.g., 10 connections * 10 servers = 100 total).

4. Sticky Sessions Without Shared State

Problem: Using sticky sessions but sessions stored on local server (if server dies, sessions lost).

Solution: Store sessions in Redis (survives server failures).

5. Not Handling Server Drain During Deployments

Problem: Killing server immediately during deployment (active requests fail).

Solution: Drain connections gracefully before shutdown.

// Graceful shutdown (wait for active requests to finish)
process.on('SIGTERM', async () => {
  console.log('SIGTERM received, draining connections...');
  
  // Stop accepting new requests
  server.close(async () => {
    // Wait for active requests to finish (max 30 seconds)
    await new Promise(resolve => setTimeout(resolve, 30000));
    
    // Close database connections
    await prisma.$disconnect();
    await redis.quit();
    
    process.exit(0);
  });
});

6. Auto-Scaling Too Slow

Problem: Servers take 5 minutes to start → traffic spike causes outage before new servers ready.

Solution: Use predictive scaling or keep warm standby servers.

7. No Rate Limiting at Load Balancer

Problem: DDoS attack overwhelms load balancer before reaching application.

Solution: Implement rate limiting at load balancer level.

// NGINX rate limiting (1000 requests/minute per IP)
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=1000r/m;

server {
    location /api {
        limit_req zone=api_limit burst=100;
        proxy_pass http://api_backend;
    }
}

Production Checklist

Load Balancer Configuration

✅ Health checks configured (check database + dependencies)
✅ Unhealthy servers removed from rotation
✅ Health check interval: 10-30 seconds
✅ Timeout: 5 seconds max
✅ SSL/TLS termination at load balancer
✅ HTTP/2 enabled for better performance

Scaling Configuration

✅ Auto-scaling enabled (CPU or request count)
✅ Minimum instances: 2+ (for high availability)
✅ Maximum instances: Set reasonable limit (avoid runaway costs)
✅ Cooldown period: 5 minutes (prevent rapid scaling)
✅ Connection limits per server configured

Session & State

✅ Sessions stored in Redis (not on server)
✅ Database connection pooling configured
✅ Read replicas for read-heavy workloads
✅ Cache invalidation strategy defined

Monitoring & Alerts

✅ Monitor request distribution across servers
✅ Alert on unhealthy servers
✅ Alert on auto-scaling events
✅ Track response time per server
✅ Monitor connection pool saturation

Deployment

✅ Graceful shutdown implemented (SIGTERM handler)
✅ Zero-downtime deployment strategy (rolling updates)
✅ Rollback plan for failed deployments
✅ Tested failover in staging

Security

✅ Rate limiting at load balancer
✅ DDoS protection enabled (AWS Shield, Cloudflare)
✅ HTTPS only (redirect HTTP to HTTPS)
✅ Security headers configured (HSTS, CSP)

Related Resources

Monitor the APIs mentioned in this guide:

AWS Status — Cloud infrastructure and load balancers
Google Cloud Status — GCP load balancing services
Cloudflare Status — CDN and DDoS protection
Stripe Status — Payment processing at scale
GitHub Status — API availability and performance
Datadog Status — Infrastructure monitoring
New Relic Status — Application performance monitoring
Vercel Status — Serverless deployments and edge network

Related guides:

Summary

Key takeaways:

Load balancing distributes traffic across multiple servers for high availability and performance
Choose algorithm based on use case: Round robin (simple), least connections (long requests), IP hash (sticky sessions)
Horizontal scaling (add servers) beats vertical scaling (bigger server) for modern APIs
Health checks must verify database + dependencies (not just HTTP 200)
Auto-scaling prevents outages during traffic spikes (metric-based or predictive)
Shared state (Redis sessions, read replicas) enables true horizontal scaling
Test failover regularly — kill servers in staging to verify load balancer works
Graceful shutdown prevents request failures during deployments

Start simple (2 servers + round robin), then optimize based on traffic patterns. GitHub, Stripe, and Netflix all started with basic load balancing before evolving to sophisticated multi-region, auto-scaling architectures.