SRE FundamentalsError Budgets2026 Guide

Error Budget Guide: SRE Concept, Calculation & Implementation (2026)

Q: What is an error budget in SRE?

An error budget is the maximum amount of downtime or unreliability a service is allowed to have based on its SLO. If your SLO is 99.9% availability, your error budget is 0.1% — which equals 43.8 minutes per month. Error budgets make the reliability vs velocity trade-off explicit: if you have error budget remaining, you can ship faster; if you've burned through it, you stop deploying and focus on reliability.

Q: How do you calculate an error budget?

Error budget = 100% - SLO. For a 99.9% SLO: error budget = 0.1%. Convert to time: 0.1% × 30 days × 24 hours × 60 minutes = 43.2 minutes per month. Convert to request count: if you serve 1M requests/month, error budget = 0.1% × 1,000,000 = 1,000 allowed failures. For multi-window SLOs: calculate separately per window (monthly, quarterly) and track burn rate against each.

Q: What is error budget burn rate?

Burn rate is how fast you're consuming your error budget relative to the normal pace. A burn rate of 1x means you're burning budget at exactly the pace your SLO allows — you'll use it all by the end of the window. A burn rate of 10x means you're consuming in 1/10th the time — at that rate, a monthly budget would be gone in 3 days. Google SRE Book recommends alerting at 2 windows: fast burn (14.4x over 1 hour) and slow burn (6x over 6 hours).

Q: What should happen when you burn through your error budget?

When the error budget is exhausted, the error budget policy kicks in. Typical policy: freeze non-critical deployments, allocate engineering capacity to reliability work (postmortems, tech debt, testing), and require ops team sign-off for any release. This isn't punitive — it's a mutual agreement between engineering and product to protect reliability. Without a policy, error budgets are just vanity metrics.

Q: What is the difference between 99.9% and 99.99% SLO?

The difference is 10× stricter and significantly harder to achieve. 99.9% (three nines) = 43.8 min/month downtime allowed. 99.99% (four nines) = 4.4 min/month allowed. 99.999% (five nines) = 26 seconds/month allowed. The engineering cost to go from 99.9% to 99.99% is typically 2-4× more infrastructure, redundancy, and operational complexity. Always question whether your users actually need four nines — most consumer apps don't.

Error budgets are the core mechanism in SRE for making reliability vs. velocity trade-offs explicit and data-driven. Instead of arguing about whether a deploy is "safe," you check the budget. This guide explains the math, how to set burn rate alerts, and how to write an error budget policy that engineering and product both agree to.

Updated April 2026•12 min read•SRE / Platform Engineering

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

TL;DR — Error Budget Essentials

✅ Error budget = 100% - SLO (e.g., 99.9% SLO → 0.1% budget = 43.8 min/month)
✅ Burn rate > 1× = consuming budget faster than your SLO allows
✅ Alert on: fast burn (14.4× over 1h) + slow burn (6× over 6h)
✅ When budget is exhausted: freeze deploys, focus engineering on reliability
✅ Error budget policy makes the freeze automatic — not a debate every time
✅ 99.9% (three nines) is right for most products; 99.99% costs 4× more

What Is an Error Budget?

An error budget is simply the flip side of your SLO (Service Level Objective). If you commit to 99.9% availability, you're saying: "We will be unavailable at most 0.1% of the time." That 0.1% is your error budget.

The genius of error budgets — as described in Google's Site Reliability Engineering book — is that they make an implicit trade-off explicit. Engineers want to ship fast. Operations teams want stability. Error budgets answer both: ship as fast as you want, as long as you're not burning the budget. When the budget runs out, everyone agrees in advance to stop and fix reliability.

"The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow."

— Google SRE Book

Error Budget Calculation

The formula is simple. The tricky part is choosing the right SLI (Service Level Indicator) to measure.

# Error budget formula

error_budget = 100% - SLO

# Convert to time (30-day month)

budget_minutes = (100 - slo_percent) / 100 × 30 × 24 × 60

# Convert to allowed failures (request count)

allowed_failures = (100 - slo_percent) / 100 × total_requests

SLO Comparison Table

SLO	Nines	Monthly Budget	Weekly Budget	Daily Budget
99%	Two nines	7h 18m	1h 40m	14m 24s
99.9%	Three nines ✓ Most teams	43m 49s	10m 4s	1m 26s
99.95%	Three and a half nines	21m 54s	5m 2s	43.2s
99.99%	Four nines (enterprise APIs)	4m 22s	1m 0s	8.6s
99.999%	Five nines (financial, telco)	26.3s	6.1s	0.86s

Practical note: A 99.9% SLO gives you 43 minutes of downtime per month. A single botched deploy that takes 20 minutes to roll back uses nearly half your monthly budget. That's exactly the point — it makes the cost of incidents real and motivates better testing and deployment practices.

📡

Recommended

Track SLO compliance with Better Stack

Better Stack gives you uptime history, incident timelines, and SLA reports out of the box. See exactly how much error budget you've consumed this month.

Try Better Stack Free →

Error Budget Burn Rate

Burn rate measures how fast you're consuming your error budget. A burn rate of 1× means you're on track to use exactly your budget by the end of the window. 10× means you'd exhaust it in 1/10th the time.

# Burn rate formula

burn_rate = error_rate_over_window / (1 - slo)

# Example: SLO = 99.9%, current error rate = 1%

burn_rate = 0.01 / 0.001 = 10×

# At 10× burn: a 30-day budget exhausts in 3 days

# Example: current error rate = 0.05%

burn_rate = 0.0005 / 0.001 = 0.5×

# At 0.5× burn: budget will outlast the window

Google's Multi-Window Burn Rate Alerting

Alerting on a single burn rate window has problems: fast incidents get caught late; slow creeping degradation goes undetected. Google SRE Book recommends two windows:

Alert	Burn Rate	Window	Budget Consumed	Urgency
Page immediately	14.4×	1 hour	2% in 1h (exhausts in ~2 days)	P0 — wake someone up
Page immediately	14.4×	5 minutes	Short but steep	P0 — very fast incident
Ticket (next business day)	6×	6 hours	5% in 6h (slow degradation)	P1 — investigate during business hours
Ticket	6×	30 minutes	Moderate sustained degradation	P1 — monitor closely

Prometheus Burn Rate Alert Rules

# Multi-window burn rate alerts for 99.9% SLO
# error_budget = 0.001 (0.1%)

groups:
  - name: error-budget
    rules:
      # Page: fast burn (14.4x over 1h and 5m)
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[1h]) /
            rate(http_requests_total[1h]) > 14.4 * 0.001
          ) and (
            rate(http_requests_total{status=~"5.."}[5m]) /
            rate(http_requests_total[5m]) > 14.4 * 0.001
          )
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "Fast error budget burn — 2% of monthly budget consumed in 1h"
          runbook: "https://wiki.example.com/runbooks/error-budget-fast-burn"

      # Ticket: slow burn (6x over 6h and 30m)
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[6h]) /
            rate(http_requests_total[6h]) > 6 * 0.001
          ) and (
            rate(http_requests_total{status=~"5.."}[30m]) /
            rate(http_requests_total[30m]) > 6 * 0.001
          )
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn — 5% consumed in 6h"
          description: "Investigate during business hours. Not a page."

      # Budget exhausted alert
      - alert: ErrorBudgetExhausted
        expr: |
          sum_over_time(
            (rate(http_requests_total{status=~"5.."}[5m]) /
             rate(http_requests_total[5m]))[30d:5m]
          ) / (30 * 24 * 12) > 0.001
        labels:
          severity: critical
        annotations:
          summary: "Error budget for this month is exhausted — freeze non-critical deploys"

Writing an Error Budget Policy

An error budget without a policy is just a number. The policy defines what happens when the budget is consumed — and it must be agreed upon by engineering and product before an incident, not negotiated during one.

✅ Budget Healthy (< 50% consumed)

→ Full deployment velocity — ship features freely
→ Experimentation welcome — A/B tests, risky deploys OK
→ SRE time allocated to feature work and tooling

⚠️ Budget Warning (50–90% consumed)

→ Deploy only well-tested, low-risk changes
→ Require SRE sign-off on deployments to production
→ Postmortem required for any incident consuming > 5% of budget
→ Begin reliability sprint planning for next cycle

🚨 Budget Exhausted (100%+ consumed)

→ Freeze all non-critical deployments until next budget window
→ 100% of SRE engineering time goes to reliability: postmortems, testing, infrastructure hardening
→ Product team must agree to defer features in exchange for stability investment
→ Emergency deploys require VP+ sign-off
→ SLO review: consider whether the SLO is set at the right level

Key insight: The error budget policy isn't punishment — it's a contract. Engineering and product both agree that when reliability suffers, the team pauses feature work to fix it. This prevents the slow accumulation of technical debt where reliability keeps degrading while everyone ships features.

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your critical services goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your critical services + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

Choosing the Right SLI

Your error budget is only as meaningful as the SLI (Service Level Indicator) it's based on. The SLI must measure what users actually care about.

Service Type	Good SLI	Why It Works
HTTP API	% requests with 2xx status in < 500ms	Captures both errors and latency in one metric
Storage service	% reads returning correct data in < 100ms	Durability + latency; don't just measure uptime
Background pipeline	% jobs completing successfully within SLA	Measures correctness + timeliness for batch work
Database	% queries completing in < 50ms p95	Latency percentiles better than average for DB performance
Pub/Sub queue	% messages delivered in < 30 seconds	End-to-end delivery SLO, not just queue depth

Avoid SLIs nobody cares about: CPU usage, container restarts, and deployment success rate are operational metrics, not user-facing SLIs. If your service is running at 90% CPU but serving all requests correctly at low latency, your error budget is fine. Measure what users experience, not what your monitoring system sees.

Frequently Asked Questions

What is an error budget in SRE?

An error budget is the maximum unreliability allowed per your SLO. If your SLO is 99.9%, your error budget is 0.1% — about 43 minutes of downtime per month. It makes reliability trade-offs explicit: when you have budget, ship fast; when it's exhausted, stop and fix reliability.

How do you calculate an error budget?

Error budget = 100% - SLO. For 99.9% SLO: 0.1% budget = 43.8 min/month. In requests: 0.1% × total monthly requests = allowed failures. Track burn rate (current_error_rate / budget_error_rate) to know how fast you're consuming it.

What is error budget burn rate?

Burn rate = actual error rate / SLO error budget rate. Burn rate of 1× = on track to use exactly your budget by window end. 14.4× = you'll exhaust your monthly budget in ~2 days. Alert at 14.4× (fast burn, 1h window) and 6× (slow burn, 6h window) to catch both sudden incidents and gradual degradation.

What should happen when you burn through your error budget?

Your pre-agreed error budget policy kicks in: freeze non-critical deployments, allocate engineering time to reliability work (postmortems, testing, hardening), require sign-off for any new releases. This isn't punitive — it's the agreed trade-off between engineering and product for operating at the chosen SLO level.

What is the difference between 99.9% and 99.99% SLO?

99.9% (three nines) = 43 min/month downtime allowed. 99.99% (four nines) = 4.4 min/month. The cost to go from 99.9% to 99.99% is roughly 4× more redundancy, testing, and operational complexity. Most consumer apps are fine at 99.9%. Financial systems, payment APIs, and infrastructure primitives often justify 99.99%.

Related SRE Guides

→ SLA vs SLO vs SLI Explained → On-Call Management Guide → Incident Response Guide → Alert Fatigue Guide → MTTR, MTTD, MTBF Guide → DORA Metrics Guide

🛠 Tools We Use & Recommend

Tested across our own infrastructure monitoring 200+ APIs daily

See all →

Better StackBest for API Teams

Uptime Monitoring & Incident Management

Used by 100,000+ websites

Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.

“We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.”

Free tier · Paid from $24/moStart Free Monitoring

1PasswordBest for Credential Security

Secrets Management & Developer Security

Trusted by 150,000+ businesses

Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.

“After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.”

From $2.99/moTry Free for 14 Days

OpteryBest for Privacy

Automated Personal Data Removal

Removes data from 350+ brokers

Removes your personal data from 350+ data broker sites. Protects against phishing and social engineering attacks.

“Service outages sometimes involve data breaches. Optery keeps your personal info off the sites attackers use first.”

From $9.99/moFree Privacy Scan

ElevenLabsBest for AI Voice

AI Voice & Audio Generation

Used by 1M+ developers

Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.

“The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.”

Free tier · Paid from $5/moTry ElevenLabs Free

SEMrushBest for SEO

SEO & Site Performance Monitoring

Used by 10M+ marketers

Track your site health, uptime, search rankings, and competitor movements from one dashboard.

“We use SEMrush to track how our API status pages rank and catch site health issues early.”

From $129.95/moTry SEMrush Free

View full comparison & more tools →Affiliate links — we earn a commission at no extra cost to you