📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you
Chaos Engineering: Complete Guide for Building Resilient Systems (2026)
Netflix terminated random EC2 instances in production every day. Amazon runs scheduled game days where they intentionally fail entire availability zones. Google practices DiRT (Disaster Recovery Testing). They all do it for the same reason: you can't trust your system is resilient until you've tried to break it. This is chaos engineering.
The Core Insight
Production failures are inevitable. The question is whether they happen on your schedule (controlled experiments) or at 3 AM on Black Friday (real outage). Chaos engineering shifts failure discovery from the worst possible moment to a controlled experiment with your team watching and ready to respond.
What Is Chaos Engineering?
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. The term was coined by Netflix in 2011 when they introduced Chaos Monkey — a tool that randomly killed production instances to force engineers to build resilient services.
The formal definition from the Principles of Chaos Engineering (principlesofchaos.org):"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."
Chaos engineering is not randomly breaking things. It's a scientific method: form a hypothesis about how your system behaves under specific conditions, run a controlled experiment, measure the results, and either gain confidence or discover a weakness to fix.
The 5 Principles of Chaos Engineering
Build a hypothesis around steady state
Define what "normal" looks like with measurable metrics before running any experiment. Example: "Checkout success rate will stay above 99.5% when the recommendations service is down." Without a hypothesis, you're just breaking things randomly.
Vary real-world events
Inject failures that actually happen: instance crashes, network latency, disk full, CPU saturation, DNS failures, upstream API timeouts. Don't invent exotic failure modes — focus on the things that have failed before or will fail eventually.
Run experiments in production
Staging environments don't match production traffic patterns, data volumes, or caching behavior. The closer to production you run, the more confidence you gain. Start in staging to develop process; graduate to production for real resilience.
Automate experiments to run continuously
One-off experiments have limited value. Automated, recurring experiments catch regressions — a change that breaks resilience is detected in days, not discovered during the next real outage. Netflix's Chaos Monkey runs continuously.
Minimize blast radius
Limit the scope of each experiment: affect 1% of instances, a single availability zone, a non-critical service first. Expand scope as you gain confidence. The goal is learning, not widespread failure.
Types of Chaos Experiments
Infrastructure failures
EXAMPLES
- • Kill a pod or VM randomly
- • Drain a Kubernetes node
- • Simulate AZ outage
WHAT IT TESTS
Tests auto-healing, redundancy, failover
Network degradation
EXAMPLES
- • Add 200ms latency to a service
- • Simulate 10% packet loss
- • Block traffic to a dependency
WHAT IT TESTS
Tests timeout handling, circuit breakers, retry logic
Resource exhaustion
EXAMPLES
- • CPU at 90% for 5 minutes
- • Fill disk to 95%
- • Memory pressure (OOM)
WHAT IT TESTS
Tests resource limits, graceful degradation
Dependency failures
EXAMPLES
- • Return 500s from a downstream API
- • Add 10-second timeout to database
- • DNS failure for external service
WHAT IT TESTS
Tests fallbacks, circuit breakers, default behavior
State corruption
EXAMPLES
- • Corrupt cache entries
- • Inject bad data into queue
- • Simulate split-brain in database
WHAT IT TESTS
Tests data validation, idempotency, reconciliation
Chaos experiments require real-time monitoring — this is the right stack
Better Stack provides 30-second monitoring, instant alerting, and on-call management. Run chaos experiments with confidence knowing you'll see impact in seconds.
Try Better Stack Free →Top Chaos Engineering Tools (2026)
Chaos Monkey
Netflix (open source) · AWS
The original chaos tool. Randomly terminates EC2 instances during business hours. Deployable to your own Spinnaker setup. Best for AWS shops wanting the classic Chaos Monkey experience.
LitmusChaos
CNCF (ChaosNative) · Kubernetes
CNCF incubating project for Kubernetes chaos. 50+ fault types: pod delete, node drain, network latency, disk fill, CPU/memory stress. Uses Kubernetes CRDs — experiments are declarative YAML. Strong community.
Chaos Toolkit
ChaosIQ (open source) · Multi-cloud
Python-based framework for writing chaos experiments as JSON/YAML. Extensible via plugins for AWS, GCP, Azure, Kubernetes, and more. Experiments are declarative and version-controlled.
Gremlin
Gremlin Inc. · Multi-cloud, Kubernetes, bare metal
Commercial chaos-as-a-service platform. Rich UI, blast radius controls, scheduled experiments, team management, and compliance reporting. Used by DoorDash, Target, Twilio.
AWS Fault Injection Service (FIS)
Amazon Web Services · AWS
AWS-managed chaos service. Injects faults across EC2, ECS, EKS, RDS, and other AWS services. Native IAM integration, CloudWatch integration, and experiment templates.
Azure Chaos Studio
Microsoft · Azure
Azure-native chaos platform. Supports AKS, VMs, App Services, and network faults. Similar to AWS FIS — managed, no agent, Azure-centric.
How to Run Your First Chaos Experiment
Start small. Pick a non-critical service. Have monitoring in place before you begin.
Define steady state
What does healthy look like? Define measurable metrics: "checkout success rate > 99%", "p99 latency < 500ms", "error rate < 0.5%". These are your pass/fail criteria.
Form a hypothesis
Predict the outcome: "If we kill one instance of the payment service, checkout success rate will stay above 99% because we have 3 replicas and load balancing." Be specific.
Set your blast radius
Scope the experiment: "Affect only 1 of 3 payment service pods", "Limit to 10% of traffic", "Run for 5 minutes only". Have a kill switch — a runbook step to stop the experiment immediately.
Instrument and monitor
Open your dashboards before starting: error rate, latency, business KPIs. Have your alerting active. You want to see anomalies in real-time, not after the fact.
Run the experiment
Inject the fault (e.g., kubectl delete pod payment-service-xxx). Watch your metrics for 5 minutes. Note any deviation from steady state — even small ones.
Measure and analyze
Did steady state hold? If yes — confidence gained. If no — you found a weakness. Document what broke, why, and create a ticket to fix it. Both outcomes are successes.
Fix and iterate
Remediate any discovered weaknesses. Re-run the experiment to confirm the fix. Then expand scope: add more fault types, increase blast radius, graduate to production.
Chaos Engineering Maturity Model
Where does your organization sit?
No chaos engineering
You wait for production to tell you what breaks. Outages are surprises.
Chaos in staging
You run experiments in staging to validate system behavior. Limited real-world signal.
Chaos in production (manual)
Occasional manual experiments in production with team approval. Controlled blast radius.
Automated chaos in production
Experiments run automatically on a schedule. Failures feed into CI/CD gates. Teams expect and own resilience.
Continuous chaos + gamedays
Always-on chaos plus quarterly gamedays testing large-scale scenarios (AZ outage, region failover). Netflix level.
Most organizations are at Level 1-2. Moving from Level 2 to Level 3 is the biggest resilience jump — automation catches regressions that manual experiments miss.
Prerequisites Before Running Chaos
Chaos engineering on a poorly instrumented system is just random breaking. These foundations are required:
✓ Real-time monitoring
You must see impact in seconds. Without observability, you can't tell if the experiment is affecting production.
✓ Alerting configured
Alerts should fire during degraded experiments. If they don't, your alerting has gaps — that's also valuable to know.
✓ On-call runbooks
If an experiment goes wrong, your team needs a documented recovery path. Chaos is not the time to improvise.
✓ Graceful degradation
Services should have fallbacks for when dependencies fail. Chaos tests whether those fallbacks actually work.
✓ Defined SLOs
Steady state = your SLOs. Without SLOs, you have no pass/fail criteria for experiments.
✓ Rollback capability
You need to stop experiments instantly. Automated rollback (e.g., Kubernetes health probes) is better than manual.
The prerequisite for chaos engineering: know when things are down
Better Stack monitors your endpoints every 30 seconds. Know immediately if a chaos experiment causes unexpected degradation beyond your blast radius.
Try Better Stack Free →Frequently Asked Questions
What is chaos engineering?
Chaos engineering is the practice of deliberately injecting failures, latency, and disruptions into a production (or production-like) system to discover weaknesses before they cause customer-facing outages. The goal is to build confidence that a system can withstand unexpected conditions — not to cause harm, but to surface hidden dependencies and failure modes in a controlled way.
What is Chaos Monkey?
Chaos Monkey is the original chaos engineering tool, created by Netflix in 2011 to test resilience on AWS. It randomly terminates virtual machine instances in production during business hours. The premise: if a random instance dies at any time, and your system survives, you've proven resilience. Netflix open-sourced Chaos Monkey in 2012. It inspired the broader "Simian Army" of chaos tools and the modern chaos engineering discipline.
Is chaos engineering safe to run in production?
Yes, with proper controls. Start with a steady-state hypothesis (define what "healthy" looks like with metrics). Set a blast radius — limit the experiment to a small percentage of instances or a non-critical service. Have a kill switch ready to stop the experiment. Monitor closely during the experiment. Run during business hours when your team can respond. Start in staging, then expand to production gradually. Most mature organizations run chaos in production — it's the only way to truly validate resilience.
What is the difference between chaos engineering and penetration testing?
Chaos engineering tests operational resilience — how your system behaves during infrastructure failures, network partitions, and dependency outages. Penetration testing tests security posture — how your system behaves under adversarial attack. They are complementary: chaos engineering might uncover a cascading failure when a dependency goes down; pen testing might uncover an exploitable vulnerability in that dependency. Some teams run both; they address different risk dimensions.
How do you measure the success of a chaos experiment?
Success = confirming the steady-state hypothesis OR discovering a weakness. Define before the experiment: what metrics define "healthy" (e.g., error rate < 0.1%, p99 latency < 500ms, checkout success rate > 99%). During the experiment, watch those metrics. If they hold — confidence gained. If they degrade — weakness found. Both outcomes are wins: one confirms resilience, the other reveals something to fix before a real outage.
Related Guides
SLA vs SLO vs SLI
Reliability targets — the foundation of steady state
Incident Postmortem Guide
Learn from failures after they happen
On-Call Management Guide
Incident response best practices
DORA Metrics Guide
Change Failure Rate — chaos helps improve this metric
Circuit Breaker Pattern
The pattern chaos testing most often validates
Kubernetes Monitoring Guide
Monitor the systems you're running chaos against
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your services goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your services + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial