API Load Balancing & Scaling: Complete Implementation Guide

Staff Pick

๐Ÿ“ก Monitor your APIs โ€” know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free โ†’

Affiliate link โ€” we may earn a commission at no extra cost to you

Learn how to implement production-ready load balancing and scaling strategies for your API. This comprehensive guide covers load balancer algorithms, horizontal and vertical scaling, auto-scaling patterns, health checks, session persistence, and real-world implementation with NGINX, AWS, and Google Cloud.

What is API Load Balancing?

Load balancing distributes incoming API requests across multiple servers to improve performance, reliability, and availability. Instead of a single server handling all traffic, requests are distributed to multiple backend servers.

Why Load Balancing Matters

Real-world impact:

Benefits:

Load Balancing Algorithms

Different strategies for distributing requests to backend servers:

1. Round Robin (Simple & Most Common)

Distributes requests sequentially to each server in rotation.

// NGINX round robin (default)
upstream api_backend {
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
}

server {
    listen 80;
    location /api {
        proxy_pass http://api_backend;
    }
}

When to use: Equal server capacity, stateless requests
Pros: Simple, fair distribution
Cons: Doesn't account for server load or capacity differences

2. Least Connections

Routes requests to server with fewest active connections (best for varying request durations).

upstream api_backend {
    least_conn;  # Enable least connections
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
}

When to use: APIs with long-running requests (file uploads, complex queries, streaming)
Example: Video encoding API where requests take 10-60 seconds

3. IP Hash (Session Persistence)

Routes requests from same IP address to same server (maintains session affinity).

upstream api_backend {
    ip_hash;  # Route same IP to same server
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
}

When to use: Session-based APIs without shared session storage
Warning: Users behind NAT/proxy share IP โ†’ uneven distribution

4. Weighted Round Robin

Assign different weights to servers based on capacity.

upstream api_backend {
    server api1.example.com:3000 weight=3;  # High-capacity server (gets 3x traffic)
    server api2.example.com:3000 weight=2;  # Medium-capacity
    server api3.example.com:3000 weight=1;  # Low-capacity (gets 1x traffic)
}

When to use: Servers with different CPU/RAM (e.g., mix of m5.large and m5.2xlarge instances)

5. Least Response Time (Advanced)

Routes to server with fastest response time + fewest connections.

When to use: Geo-distributed servers (route to nearest/fastest)
Available in: AWS Application Load Balancer, Google Cloud Load Balancer, Cloudflare

Horizontal vs Vertical Scaling

Horizontal Scaling (Scale Out)

Add more servers to handle increased load.

// Start with 2 servers
upstream api_backend {
    server api1.example.com:3000;
    server api2.example.com:3000;
}

// Scale to 5 servers during peak traffic
upstream api_backend {
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
    server api4.example.com:3000;
    server api5.example.com:3000;
}

Pros:

Cons:

Vertical Scaling (Scale Up)

Increase server capacity (more CPU/RAM).

Example: Upgrade from AWS t3.medium (2 vCPU, 4GB RAM) โ†’ m5.2xlarge (8 vCPU, 32GB RAM)

Pros:

Cons:

Decision rule: Start with vertical scaling for simplicity, switch to horizontal when you hit limits or need high availability.

Health Checks & Failure Detection

Load balancers must detect when servers are unhealthy and stop routing traffic to them.

NGINX Health Checks

upstream api_backend {
    server api1.example.com:3000 max_fails=3 fail_timeout=30s;
    server api2.example.com:3000 max_fails=3 fail_timeout=30s;
    server api3.example.com:3000 max_fails=3 fail_timeout=30s;
}

# max_fails=3: Mark server as down after 3 failed requests
# fail_timeout=30s: Wait 30 seconds before trying again

Active Health Check Endpoint

Create a dedicated /health endpoint that verifies all dependencies.

// Express health check endpoint
import express from 'express';
import { PrismaClient } from '@prisma/client';
import Redis from 'ioredis';

const app = express();
const prisma = new PrismaClient();
const redis = new Redis();

app.get('/health', async (req, res) => {
  const health = {
    status: 'ok',
    timestamp: new Date().toISOString(),
    checks: {}
  };

  // Check database connection
  try {
    await prisma.$queryRaw`SELECT 1`;
    health.checks.database = 'ok';
  } catch (error) {
    health.status = 'degraded';
    health.checks.database = 'failed';
  }

  // Check Redis connection
  try {
    await redis.ping();
    health.checks.redis = 'ok';
  } catch (error) {
    health.status = 'degraded';
    health.checks.redis = 'failed';
  }

  // Return 200 if ok, 503 if degraded
  const statusCode = health.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(health);
});

AWS Application Load Balancer Health Checks

// Terraform configuration
resource "aws_lb_target_group" "api" {
  name     = "api-target-group"
  port     = 3000
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    enabled             = true
    healthy_threshold   = 2      # Mark healthy after 2 successful checks
    unhealthy_threshold = 3      # Mark unhealthy after 3 failed checks
    timeout             = 5      # 5 second timeout
    interval            = 30     # Check every 30 seconds
    path                = "/health"
    matcher             = "200"  # Expect 200 status code
  }
}
๐Ÿ”
Recommended

Manage infrastructure secrets securely

Load balancer configs, SSL certificates, and API keys need secure management. 1Password keeps your infrastructure credentials safe and accessible to your team.

Try 1Password Free โ†’

Session Persistence (Sticky Sessions)

Some APIs need to route user's requests to same server for session continuity.

Cookie-Based Sticky Sessions (NGINX)

upstream api_backend {
    # Use cookie-based sticky sessions
    sticky cookie srv_id expires=1h domain=.example.com path=/;
    
    server api1.example.com:3000;
    server api2.example.com:3000;
    server api3.example.com:3000;
}

Better Approach: Shared Session Storage

Store sessions in Redis (accessible to all servers) instead of sticky sessions.

// Express with Redis session storage
import session from 'express-session';
import RedisStore from 'connect-redis';
import Redis from 'ioredis';

const redisClient = new Redis({
  host: 'redis.example.com',
  port: 6379,
});

app.use(
  session({
    store: new RedisStore({ client: redisClient }),
    secret: process.env.SESSION_SECRET,
    resave: false,
    saveUninitialized: false,
    cookie: {
      secure: true,
      httpOnly: true,
      maxAge: 1000 * 60 * 60 * 24, // 24 hours
    },
  })
);

// Now any server can access session data
app.get('/profile', (req, res) => {
  const userId = req.session.userId;
  // All servers read from same Redis instance
});

Auto-Scaling Strategies

Metric-Based Auto-Scaling

Automatically add/remove servers based on metrics.

// AWS Auto Scaling Group (Terraform)
resource "aws_autoscaling_policy" "api_scale_up" {
  name                   = "api-scale-up"
  autoscaling_group_name = aws_autoscaling_group.api.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 2  # Add 2 instances
  cooldown               = 300

  # Trigger when CPU > 70% for 5 minutes
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name          = "api-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300  # 5 minutes
  statistic           = "Average"
  threshold           = 70
  alarm_actions       = [aws_autoscaling_policy.api_scale_up.arn]
}

Schedule-Based Auto-Scaling

Scale based on known traffic patterns.

// Scale up before morning peak traffic
resource "aws_autoscaling_schedule" "morning_scale_up" {
  scheduled_action_name  = "morning-scale-up"
  min_size               = 5
  max_size               = 10
  desired_capacity       = 8
  recurrence             = "0 8 * * MON-FRI"  # 8 AM weekdays
  autoscaling_group_name = aws_autoscaling_group.api.name
}

// Scale down after evening peak
resource "aws_autoscaling_schedule" "evening_scale_down" {
  scheduled_action_name  = "evening-scale-down"
  min_size               = 2
  max_size               = 5
  desired_capacity       = 3
  recurrence             = "0 22 * * *"  # 10 PM daily
  autoscaling_group_name = aws_autoscaling_group.api.name
}

Predictive Auto-Scaling

AWS and Google Cloud offer ML-based predictive scaling that learns traffic patterns.

resource "aws_autoscaling_policy" "predictive" {
  name                   = "api-predictive-scaling"
  autoscaling_group_name = aws_autoscaling_group.api.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 70  # Target 70% CPU
      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
      }
    }
    mode = "ForecastAndScale"  # Proactively scale before traffic spike
  }
}

Database Scaling Patterns

Read Replicas

Route read queries to replicas, write queries to primary.

import { PrismaClient } from '@prisma/client';

// Primary database (writes)
const prismaWrite = new PrismaClient({
  datasources: {
    db: { url: process.env.DATABASE_WRITE_URL }
  }
});

// Read replica (reads)
const prismaRead = new PrismaClient({
  datasources: {
    db: { url: process.env.DATABASE_READ_URL }
  }
});

// Write operations use primary
async function createUser(data) {
  return prismaWrite.user.create({ data });
}

// Read operations use replica
async function getUser(id) {
  return prismaRead.user.findUnique({ where: { id } });
}

// List queries use replica
async function listUsers() {
  return prismaRead.user.findMany();
}

Connection Pooling

Limit database connections per server to prevent overwhelming database.

// PostgreSQL connection pooling
const prisma = new PrismaClient({
  datasources: {
    db: {
      url: process.env.DATABASE_URL + "?connection_limit=10"
    }
  }
});

// Rule of thumb: connection_limit = number_of_servers * connections_per_server
// Example: 5 servers * 10 connections = 50 total connections to database

Caching Layers

Application-Level Caching (Redis)

import Redis from 'ioredis';
const redis = new Redis();

async function getUser(userId) {
  // Check cache first
  const cached = await redis.get(`user:${userId}`);
  if (cached) {
    return JSON.parse(cached);
  }

  // Cache miss: query database
  const user = await prisma.user.findUnique({ where: { id: userId } });
  
  // Store in cache for 5 minutes
  await redis.setex(`user:${userId}`, 300, JSON.stringify(user));
  
  return user;
}

CDN for Static API Responses

Use Cloudflare or AWS CloudFront to cache static/infrequently-changing API responses.

// Express: Set cache headers for CDN
app.get('/api/products', async (req, res) => {
  const products = await getProducts();
  
  // Cache at CDN for 1 hour
  res.set('Cache-Control', 'public, max-age=3600');
  res.json(products);
});

// Purge cache when data changes
app.post('/api/products', async (req, res) => {
  const product = await createProduct(req.body);
  
  // Purge CDN cache (Cloudflare example)
  await fetch('https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache', {
    method: 'POST',
    headers: {
      'X-Auth-Email': process.env.CLOUDFLARE_EMAIL,
      'X-Auth-Key': process.env.CLOUDFLARE_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      files: ['https://api.example.com/api/products']
    })
  });
  
  res.json(product);
});

Real-World Examples

GitHub API: Multi-Region Load Balancing

GitHub runs API servers in multiple AWS regions and routes users to nearest region using GeoDNS.

Users in Europe get routed to Ireland (<50ms latency vs >150ms to Virginia).

Stripe API: Auto-Scaling Payment Processing

Stripe processes 8,000+ requests/second with auto-scaling based on queue depth.

Black Friday 2025: Scaled from 20 โ†’ 200 servers in 5 minutes during traffic spike.

Netflix: Chaos Engineering for Load Balancers

Netflix randomly kills servers in production to test load balancer failover.

๐Ÿ›ก๏ธ
Recommended

Protect your team's personal data

Infrastructure engineers are prime targets. Optery scans 400+ data brokers and removes personal information โ€” protect your team from social engineering attacks.

Scan Free with Optery โ†’

Common Mistakes

1. Not Testing Failover

Problem: Load balancer configured but never tested when server fails.

Solution: Regularly kill servers in staging to verify failover works.

// Kill server and verify requests route to others
docker stop api-server-1
curl https://api.example.com/health  # Should still return 200

2. Health Check Endpoint Too Simple

Problem: /health just returns 200 without checking database.

// โŒ Bad: Server is "healthy" even with broken database
app.get('/health', (req, res) => {
  res.sendStatus(200);
});

// โœ… Good: Actually verify dependencies
app.get('/health', async (req, res) => {
  try {
    await prisma.$queryRaw`SELECT 1`;
    await redis.ping();
    res.sendStatus(200);
  } catch (error) {
    res.sendStatus(503);
  }
});

3. No Connection Limits to Database

Problem: 10 servers * 100 connections = 1,000 database connections (database crashes).

Solution: Limit connections per server (e.g., 10 connections * 10 servers = 100 total).

4. Sticky Sessions Without Shared State

Problem: Using sticky sessions but sessions stored on local server (if server dies, sessions lost).

Solution: Store sessions in Redis (survives server failures).

5. Not Handling Server Drain During Deployments

Problem: Killing server immediately during deployment (active requests fail).

Solution: Drain connections gracefully before shutdown.

// Graceful shutdown (wait for active requests to finish)
process.on('SIGTERM', async () => {
  console.log('SIGTERM received, draining connections...');
  
  // Stop accepting new requests
  server.close(async () => {
    // Wait for active requests to finish (max 30 seconds)
    await new Promise(resolve => setTimeout(resolve, 30000));
    
    // Close database connections
    await prisma.$disconnect();
    await redis.quit();
    
    process.exit(0);
  });
});

6. Auto-Scaling Too Slow

Problem: Servers take 5 minutes to start โ†’ traffic spike causes outage before new servers ready.

Solution: Use predictive scaling or keep warm standby servers.

7. No Rate Limiting at Load Balancer

Problem: DDoS attack overwhelms load balancer before reaching application.

Solution: Implement rate limiting at load balancer level.

// NGINX rate limiting (1000 requests/minute per IP)
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=1000r/m;

server {
    location /api {
        limit_req zone=api_limit burst=100;
        proxy_pass http://api_backend;
    }
}

Production Checklist

Load Balancer Configuration

Scaling Configuration

Session & State

Monitoring & Alerts

Deployment

Security

Related Resources

Monitor the APIs mentioned in this guide:

Related guides:

Summary

Key takeaways:

Start simple (2 servers + round robin), then optimize based on traffic patterns. GitHub, Stripe, and Netflix all started with basic load balancing before evolving to sophisticated multi-region, auto-scaling architectures.

๐Ÿ“ก
Recommended

Monitor your APIs and infrastructure in real-time

Better Stack combines uptime monitoring, incident management, and log aggregation. Free tier includes 10 monitors with 3-minute checks.

Try Better Stack Free โ†’

๐Ÿ›  Tools We Use & Recommend

Tested across our own infrastructure monitoring 200+ APIs daily

Better StackBest for API Teams

Uptime Monitoring & Incident Management

Used by 100,000+ websites

Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.

โ€œWe use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.โ€

Free tier ยท Paid from $24/moStart Free Monitoring
1PasswordBest for Credential Security

Secrets Management & Developer Security

Trusted by 150,000+ businesses

Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.

โ€œAfter covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.โ€

SEMrushBest for SEO

SEO & Site Performance Monitoring

Used by 10M+ marketers

Track your site health, uptime, search rankings, and competitor movements from one dashboard.

โ€œWe use SEMrush to track how our API status pages rank and catch site health issues early.โ€

From $129.95/moTry SEMrush Free
View full comparison & more tools โ†’Affiliate links โ€” we earn a commission at no extra cost to you