The Four Zero Downtime Deployment Strategies
| Strategy | Traffic Switch | Rollback Speed | Infrastructure Cost | Best For |
|---|---|---|---|---|
| Blue-Green | All-at-once | Instant | 2x | High-stakes releases |
| Canary | Gradual % | Fast | Minimal | Risky changes |
| Rolling | Instance-by-instance | Moderate | None | Most deployments |
| Feature Flags | Code-level | Instant | None | Feature testing |
Blue-Green Deployments
Blue-green deployment maintains two identical production environments. At any time, one is live (receiving all traffic) and the other is idle (staging the next version). A deployment means switching the load balancer to point to the other environment.
How Blue-Green Works
- Production traffic goes to "blue" environment (current version)
- Deploy new version to "green" environment (no traffic)
- Run smoke tests and health checks against green
- Switch load balancer: 100% of traffic goes to green
- Blue environment stays idle (instant rollback: flip load balancer back)
- After confidence period, blue becomes the staging environment for next release
What to Monitor During Blue-Green Deployment
- Error rate spike in the first 5 minutes after flip โ the highest-risk window. Any error rate above baseline triggers immediate rollback.
- Response time increase โ new version may have performance regressions invisible in testing
- Memory and CPU on the green environment โ under real production load, resource usage may differ from staging tests
- Database connection pool on both environments โ during transition, both environments may hold connections
Blue-Green Limitations
- Double infrastructure cost: You need two full production environments running. For large deployments, this doubles your cloud bill during deployment windows.
- Database compatibility: Both environments share the database. Your new application version must be backward compatible with the current database schema until rollback is no longer possible.
- Session handling: In-flight user sessions (e.g., multi-step checkouts) during the traffic switch can fail if session state is in-memory rather than persisted.
๐ก Monitor your deployment uptime every 30 seconds โ get alerted in under a minute
Trusted by 100,000+ websites ยท Free tier available
Canary Deployments
Canary deployments gradually shift traffic to the new version โ starting with 1-10% and increasing as confidence grows. Named after the "canary in a coal mine" โ if the canary (small % of real traffic) experiences errors, you rollback before the majority of users are affected.
Canary Traffic Progression
| Stage | Traffic to New Version | Wait Period | Automatic Rollback Trigger |
|---|---|---|---|
| Stage 1 | 1% | 10 minutes | Error rate > 2x baseline |
| Stage 2 | 10% | 20 minutes | Error rate > 1.5x baseline |
| Stage 3 | 50% | 30 minutes | Error rate > 1.2x baseline |
| Complete | 100% | โ | โ |
Automated Canary Analysis
The most powerful canary deployments use automated analysis to decide whether to proceed. Tools like Argo Rollouts, Flagger, and Spinnaker compare metrics between the canary and baseline automatically:
# Argo Rollouts canary analysis example
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
spec:
metrics:
- name: error-rate
successCondition: result[0] < 0.02 # < 2% error rate
failureCondition: result[0] >= 0.05 # rollback if >= 5%
provider:
prometheus:
query: |
sum(rate(http_requests_total{status=~"5..",
version="{{args.canary-version}}"}[5m]))
/
sum(rate(http_requests_total{
version="{{args.canary-version}}"}[5m]))Rolling Deployments
Rolling deployments update instances one at a time (or in small batches). Kubernetes uses rolling updates by default. At any point during the deployment, some pods run the old version and some run the new version.
Kubernetes Rolling Update Configuration
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Never take a pod offline until replacement is ready
minReadySeconds: 10 # Wait 10s after pod is Ready before marking it availableThe key setting is maxUnavailable: 0 โ this ensures Kubernetes never removes an old pod until a new pod has passed its readiness check. Combine with a meaningful readiness probe that checks your application is actually serving traffic, not just that the process is running.
The Critical Requirement: Backward Compatibility
During a rolling deployment, both the old and new versions of your application run simultaneously. This means: API contracts must be backward compatible, database schema changes must work with both versions, and feature flags must handle gradual rollout. Breaking changes require a two-phase deployment โ first deploy compatibility, then break the old behavior.
Monitor your deployments with Better Stack
Set up deployment markers in Better Stack to correlate error spikes with specific releases. Detect regressions within minutes of deployment.
Try Better Stack Free โZero Downtime Database Migrations
Database migrations are the hardest part of zero-downtime deployments. The expand-contract pattern (also called parallel change) is the standard approach:
Expand-Contract Migration Pattern
- Expand: Add the new column/table/index without removing the old one. Deploy application code that writes to both old and new locations. Old code still works โ it only reads/writes old columns.
- Backfill: Migrate existing data from old to new column in batches (avoid locking the table with a single massive UPDATE).
- Deploy new code: Update application to read from new column exclusively. Both old and new code can now coexist safely during rolling deployment.
- Contract: Remove the old column in a separate migration, after confirming no old-version pods are still running.
โ ๏ธ Never do this in one migration: ALTER TABLE users RENAME COLUMN username TO name; โ any old pod still running will immediately fail when it looks for "username". Always use expand-contract instead.
Monitoring Your Zero Downtime Deployment
Regardless of deployment strategy, you need real-time visibility during every deployment window. Monitor these metrics in the 30 minutes following every deployment:
- HTTP error rate (5xx): Any increase above normal baseline triggers rollback review
- P95 response time: Latency regressions are often the first signal of a bad deployment
- Business metrics: Checkout completion rate, login success rate โ the metrics that matter to users
- External uptime checks: Confirm the deployment didn't break external reachability
Frequently Asked Questions
What is a zero downtime deployment?
A zero downtime deployment is a process for updating an application without any period where it is unavailable. It requires running multiple versions simultaneously during transition, intelligent traffic routing, and health checks to verify the new version before it receives full production traffic.
What is the difference between blue-green and canary deployment?
Blue-green switches all traffic at once between two full environments โ providing instant rollback but requiring double infrastructure. Canary gradually increases traffic to the new version (1% โ 10% โ 50% โ 100%), limiting blast radius but requiring more sophisticated routing configuration.
How do you achieve zero downtime database migrations?
Use the expand-contract pattern: Add new columns without removing old ones, deploy code that writes to both, backfill data, then remove old columns in a separate migration. This ensures both old and new application code can run against the same database simultaneously.
What is a rolling deployment?
A rolling deployment updates instances one at a time โ some run the old version, some the new, until all are updated. Kubernetes uses this by default. The key requirement is that both versions must be compatible with each other and with the current database schema.
Alert Pro
14-day free trialStop checking โ get alerted instantly
Next time your deployment pipeline goes down, you'll know in under 60 seconds โ not when your users start complaining.
- Email alerts for your deployment pipeline + 9 more APIs
- $0 due today for trial
- Cancel anytime โ $9/mo after trial