Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

Blog/Chaos Engineering Guide

Chaos Engineering: Complete Guide for Building Resilient Systems (2026)

Netflix terminated random EC2 instances in production every day. Amazon runs scheduled game days where they intentionally fail entire availability zones. Google practices DiRT (Disaster Recovery Testing). They all do it for the same reason: you can't trust your system is resilient until you've tried to break it. This is chaos engineering.

The Core Insight

Production failures are inevitable. The question is whether they happen on your schedule (controlled experiments) or at 3 AM on Black Friday (real outage). Chaos engineering shifts failure discovery from the worst possible moment to a controlled experiment with your team watching and ready to respond.

What Is Chaos Engineering?

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. The term was coined by Netflix in 2011 when they introduced Chaos Monkey — a tool that randomly killed production instances to force engineers to build resilient services.

The formal definition from the Principles of Chaos Engineering (principlesofchaos.org):"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

Chaos engineering is not randomly breaking things. It's a scientific method: form a hypothesis about how your system behaves under specific conditions, run a controlled experiment, measure the results, and either gain confidence or discover a weakness to fix.

The 5 Principles of Chaos Engineering

1

Build a hypothesis around steady state

Define what "normal" looks like with measurable metrics before running any experiment. Example: "Checkout success rate will stay above 99.5% when the recommendations service is down." Without a hypothesis, you're just breaking things randomly.

2

Vary real-world events

Inject failures that actually happen: instance crashes, network latency, disk full, CPU saturation, DNS failures, upstream API timeouts. Don't invent exotic failure modes — focus on the things that have failed before or will fail eventually.

3

Run experiments in production

Staging environments don't match production traffic patterns, data volumes, or caching behavior. The closer to production you run, the more confidence you gain. Start in staging to develop process; graduate to production for real resilience.

4

Automate experiments to run continuously

One-off experiments have limited value. Automated, recurring experiments catch regressions — a change that breaks resilience is detected in days, not discovered during the next real outage. Netflix's Chaos Monkey runs continuously.

5

Minimize blast radius

Limit the scope of each experiment: affect 1% of instances, a single availability zone, a non-critical service first. Expand scope as you gain confidence. The goal is learning, not widespread failure.

Types of Chaos Experiments

Infrastructure failures

EXAMPLES

  • Kill a pod or VM randomly
  • Drain a Kubernetes node
  • Simulate AZ outage

WHAT IT TESTS

Tests auto-healing, redundancy, failover

Network degradation

EXAMPLES

  • Add 200ms latency to a service
  • Simulate 10% packet loss
  • Block traffic to a dependency

WHAT IT TESTS

Tests timeout handling, circuit breakers, retry logic

Resource exhaustion

EXAMPLES

  • CPU at 90% for 5 minutes
  • Fill disk to 95%
  • Memory pressure (OOM)

WHAT IT TESTS

Tests resource limits, graceful degradation

Dependency failures

EXAMPLES

  • Return 500s from a downstream API
  • Add 10-second timeout to database
  • DNS failure for external service

WHAT IT TESTS

Tests fallbacks, circuit breakers, default behavior

State corruption

EXAMPLES

  • Corrupt cache entries
  • Inject bad data into queue
  • Simulate split-brain in database

WHAT IT TESTS

Tests data validation, idempotency, reconciliation

📡
Recommended

Chaos experiments require real-time monitoring — this is the right stack

Better Stack provides 30-second monitoring, instant alerting, and on-call management. Run chaos experiments with confidence knowing you'll see impact in seconds.

Try Better Stack Free →

Top Chaos Engineering Tools (2026)

Chaos Monkey

Netflix (open source) · AWS

Infrastructure

The original chaos tool. Randomly terminates EC2 instances during business hours. Deployable to your own Spinnaker setup. Best for AWS shops wanting the classic Chaos Monkey experience.

Battle-tested Netflix approachAWS-nativeOpen source

LitmusChaos

CNCF (ChaosNative) · Kubernetes

Kubernetes

CNCF incubating project for Kubernetes chaos. 50+ fault types: pod delete, node drain, network latency, disk fill, CPU/memory stress. Uses Kubernetes CRDs — experiments are declarative YAML. Strong community.

100% Kubernetes-nativeCNCF backedGitOps-friendly

Chaos Toolkit

ChaosIQ (open source) · Multi-cloud

Multi-cloud

Python-based framework for writing chaos experiments as JSON/YAML. Extensible via plugins for AWS, GCP, Azure, Kubernetes, and more. Experiments are declarative and version-controlled.

Multi-cloudCode-as-experiment (version control)Extensible

Gremlin

Gremlin Inc. · Multi-cloud, Kubernetes, bare metal

Commercial SaaS

Commercial chaos-as-a-service platform. Rich UI, blast radius controls, scheduled experiments, team management, and compliance reporting. Used by DoorDash, Target, Twilio.

Polished UIEnterprise controlsSupport + SLA

AWS Fault Injection Service (FIS)

Amazon Web Services · AWS

Managed (AWS)

AWS-managed chaos service. Injects faults across EC2, ECS, EKS, RDS, and other AWS services. Native IAM integration, CloudWatch integration, and experiment templates.

No agents to deployDeep AWS integrationCloudWatch observability built-in

Azure Chaos Studio

Microsoft · Azure

Managed (Azure)

Azure-native chaos platform. Supports AKS, VMs, App Services, and network faults. Similar to AWS FIS — managed, no agent, Azure-centric.

No agentsAzure-nativeRBAC integration

How to Run Your First Chaos Experiment

Start small. Pick a non-critical service. Have monitoring in place before you begin.

1

Define steady state

What does healthy look like? Define measurable metrics: "checkout success rate > 99%", "p99 latency < 500ms", "error rate < 0.5%". These are your pass/fail criteria.

2

Form a hypothesis

Predict the outcome: "If we kill one instance of the payment service, checkout success rate will stay above 99% because we have 3 replicas and load balancing." Be specific.

3

Set your blast radius

Scope the experiment: "Affect only 1 of 3 payment service pods", "Limit to 10% of traffic", "Run for 5 minutes only". Have a kill switch — a runbook step to stop the experiment immediately.

4

Instrument and monitor

Open your dashboards before starting: error rate, latency, business KPIs. Have your alerting active. You want to see anomalies in real-time, not after the fact.

5

Run the experiment

Inject the fault (e.g., kubectl delete pod payment-service-xxx). Watch your metrics for 5 minutes. Note any deviation from steady state — even small ones.

6

Measure and analyze

Did steady state hold? If yes — confidence gained. If no — you found a weakness. Document what broke, why, and create a ticket to fix it. Both outcomes are successes.

7

Fix and iterate

Remediate any discovered weaknesses. Re-run the experiment to confirm the fix. Then expand scope: add more fault types, increase blast radius, graduate to production.

Chaos Engineering Maturity Model

Where does your organization sit?

Level 0

No chaos engineering

You wait for production to tell you what breaks. Outages are surprises.

Level 1

Chaos in staging

You run experiments in staging to validate system behavior. Limited real-world signal.

Level 2

Chaos in production (manual)

Occasional manual experiments in production with team approval. Controlled blast radius.

Level 3

Automated chaos in production

Experiments run automatically on a schedule. Failures feed into CI/CD gates. Teams expect and own resilience.

Level 4

Continuous chaos + gamedays

Always-on chaos plus quarterly gamedays testing large-scale scenarios (AZ outage, region failover). Netflix level.

Most organizations are at Level 1-2. Moving from Level 2 to Level 3 is the biggest resilience jump — automation catches regressions that manual experiments miss.

Prerequisites Before Running Chaos

Chaos engineering on a poorly instrumented system is just random breaking. These foundations are required:

Real-time monitoring

You must see impact in seconds. Without observability, you can't tell if the experiment is affecting production.

Alerting configured

Alerts should fire during degraded experiments. If they don't, your alerting has gaps — that's also valuable to know.

On-call runbooks

If an experiment goes wrong, your team needs a documented recovery path. Chaos is not the time to improvise.

Graceful degradation

Services should have fallbacks for when dependencies fail. Chaos tests whether those fallbacks actually work.

Defined SLOs

Steady state = your SLOs. Without SLOs, you have no pass/fail criteria for experiments.

Rollback capability

You need to stop experiments instantly. Automated rollback (e.g., Kubernetes health probes) is better than manual.

📡
Recommended

The prerequisite for chaos engineering: know when things are down

Better Stack monitors your endpoints every 30 seconds. Know immediately if a chaos experiment causes unexpected degradation beyond your blast radius.

Try Better Stack Free →

Frequently Asked Questions

What is chaos engineering?

Chaos engineering is the practice of deliberately injecting failures, latency, and disruptions into a production (or production-like) system to discover weaknesses before they cause customer-facing outages. The goal is to build confidence that a system can withstand unexpected conditions — not to cause harm, but to surface hidden dependencies and failure modes in a controlled way.

What is Chaos Monkey?

Chaos Monkey is the original chaos engineering tool, created by Netflix in 2011 to test resilience on AWS. It randomly terminates virtual machine instances in production during business hours. The premise: if a random instance dies at any time, and your system survives, you've proven resilience. Netflix open-sourced Chaos Monkey in 2012. It inspired the broader "Simian Army" of chaos tools and the modern chaos engineering discipline.

Is chaos engineering safe to run in production?

Yes, with proper controls. Start with a steady-state hypothesis (define what "healthy" looks like with metrics). Set a blast radius — limit the experiment to a small percentage of instances or a non-critical service. Have a kill switch ready to stop the experiment. Monitor closely during the experiment. Run during business hours when your team can respond. Start in staging, then expand to production gradually. Most mature organizations run chaos in production — it's the only way to truly validate resilience.

What is the difference between chaos engineering and penetration testing?

Chaos engineering tests operational resilience — how your system behaves during infrastructure failures, network partitions, and dependency outages. Penetration testing tests security posture — how your system behaves under adversarial attack. They are complementary: chaos engineering might uncover a cascading failure when a dependency goes down; pen testing might uncover an exploitable vulnerability in that dependency. Some teams run both; they address different risk dimensions.

How do you measure the success of a chaos experiment?

Success = confirming the steady-state hypothesis OR discovering a weakness. Define before the experiment: what metrics define "healthy" (e.g., error rate < 0.1%, p99 latency < 500ms, checkout success rate > 99%). During the experiment, watch those metrics. If they hold — confidence gained. If they degrade — weakness found. Both outcomes are wins: one confirms resilience, the other reveals something to fix before a real outage.

Related Guides

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your services goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for your services + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial