Error Budget Guide: SRE Concept, Calculation & Implementation (2026)
Error budgets are the core mechanism in SRE for making reliability vs. velocity trade-offs explicit and data-driven. Instead of arguing about whether a deploy is "safe," you check the budget. This guide explains the math, how to set burn rate alerts, and how to write an error budget policy that engineering and product both agree to.
📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you
TL;DR — Error Budget Essentials
- ✅ Error budget = 100% - SLO (e.g., 99.9% SLO → 0.1% budget = 43.8 min/month)
- ✅ Burn rate > 1× = consuming budget faster than your SLO allows
- ✅ Alert on: fast burn (14.4× over 1h) + slow burn (6× over 6h)
- ✅ When budget is exhausted: freeze deploys, focus engineering on reliability
- ✅ Error budget policy makes the freeze automatic — not a debate every time
- ✅ 99.9% (three nines) is right for most products; 99.99% costs 4× more
What Is an Error Budget?
An error budget is simply the flip side of your SLO (Service Level Objective). If you commit to 99.9% availability, you're saying: "We will be unavailable at most 0.1% of the time." That 0.1% is your error budget.
The genius of error budgets — as described in Google's Site Reliability Engineering book — is that they make an implicit trade-off explicit. Engineers want to ship fast. Operations teams want stability. Error budgets answer both: ship as fast as you want, as long as you're not burning the budget. When the budget runs out, everyone agrees in advance to stop and fix reliability.
"The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow."
— Google SRE Book
Error Budget Calculation
The formula is simple. The tricky part is choosing the right SLI (Service Level Indicator) to measure.
# Error budget formula
error_budget = 100% - SLO
# Convert to time (30-day month)
budget_minutes = (100 - slo_percent) / 100 × 30 × 24 × 60
# Convert to allowed failures (request count)
allowed_failures = (100 - slo_percent) / 100 × total_requests
SLO Comparison Table
| SLO | Nines | Monthly Budget | Weekly Budget | Daily Budget |
|---|---|---|---|---|
| 99% | Two nines | 7h 18m | 1h 40m | 14m 24s |
| 99.9% | Three nines ✓ Most teams | 43m 49s | 10m 4s | 1m 26s |
| 99.95% | Three and a half nines | 21m 54s | 5m 2s | 43.2s |
| 99.99% | Four nines (enterprise APIs) | 4m 22s | 1m 0s | 8.6s |
| 99.999% | Five nines (financial, telco) | 26.3s | 6.1s | 0.86s |
Practical note: A 99.9% SLO gives you 43 minutes of downtime per month. A single botched deploy that takes 20 minutes to roll back uses nearly half your monthly budget. That's exactly the point — it makes the cost of incidents real and motivates better testing and deployment practices.
Track SLO compliance with Better Stack
Better Stack gives you uptime history, incident timelines, and SLA reports out of the box. See exactly how much error budget you've consumed this month.
Try Better Stack Free →Error Budget Burn Rate
Burn rate measures how fast you're consuming your error budget. A burn rate of 1× means you're on track to use exactly your budget by the end of the window. 10× means you'd exhaust it in 1/10th the time.
# Burn rate formula
burn_rate = error_rate_over_window / (1 - slo)
# Example: SLO = 99.9%, current error rate = 1%
burn_rate = 0.01 / 0.001 = 10×
# At 10× burn: a 30-day budget exhausts in 3 days
# Example: current error rate = 0.05%
burn_rate = 0.0005 / 0.001 = 0.5×
# At 0.5× burn: budget will outlast the window
Google's Multi-Window Burn Rate Alerting
Alerting on a single burn rate window has problems: fast incidents get caught late; slow creeping degradation goes undetected. Google SRE Book recommends two windows:
| Alert | Burn Rate | Window | Budget Consumed | Urgency |
|---|---|---|---|---|
| Page immediately | 14.4× | 1 hour | 2% in 1h (exhausts in ~2 days) | P0 — wake someone up |
| Page immediately | 14.4× | 5 minutes | Short but steep | P0 — very fast incident |
| Ticket (next business day) | 6× | 6 hours | 5% in 6h (slow degradation) | P1 — investigate during business hours |
| Ticket | 6× | 30 minutes | Moderate sustained degradation | P1 — monitor closely |
Prometheus Burn Rate Alert Rules
# Multi-window burn rate alerts for 99.9% SLO
# error_budget = 0.001 (0.1%)
groups:
- name: error-budget
rules:
# Page: fast burn (14.4x over 1h and 5m)
- alert: ErrorBudgetFastBurn
expr: |
(
rate(http_requests_total{status=~"5.."}[1h]) /
rate(http_requests_total[1h]) > 14.4 * 0.001
) and (
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 14.4 * 0.001
)
labels:
severity: critical
page: "true"
annotations:
summary: "Fast error budget burn — 2% of monthly budget consumed in 1h"
runbook: "https://wiki.example.com/runbooks/error-budget-fast-burn"
# Ticket: slow burn (6x over 6h and 30m)
- alert: ErrorBudgetSlowBurn
expr: |
(
rate(http_requests_total{status=~"5.."}[6h]) /
rate(http_requests_total[6h]) > 6 * 0.001
) and (
rate(http_requests_total{status=~"5.."}[30m]) /
rate(http_requests_total[30m]) > 6 * 0.001
)
labels:
severity: warning
annotations:
summary: "Slow error budget burn — 5% consumed in 6h"
description: "Investigate during business hours. Not a page."
# Budget exhausted alert
- alert: ErrorBudgetExhausted
expr: |
sum_over_time(
(rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]))[30d:5m]
) / (30 * 24 * 12) > 0.001
labels:
severity: critical
annotations:
summary: "Error budget for this month is exhausted — freeze non-critical deploys"Writing an Error Budget Policy
An error budget without a policy is just a number. The policy defines what happens when the budget is consumed — and it must be agreed upon by engineering and product before an incident, not negotiated during one.
✅ Budget Healthy (< 50% consumed)
- → Full deployment velocity — ship features freely
- → Experimentation welcome — A/B tests, risky deploys OK
- → SRE time allocated to feature work and tooling
⚠️ Budget Warning (50–90% consumed)
- → Deploy only well-tested, low-risk changes
- → Require SRE sign-off on deployments to production
- → Postmortem required for any incident consuming > 5% of budget
- → Begin reliability sprint planning for next cycle
🚨 Budget Exhausted (100%+ consumed)
- → Freeze all non-critical deployments until next budget window
- → 100% of SRE engineering time goes to reliability: postmortems, testing, infrastructure hardening
- → Product team must agree to defer features in exchange for stability investment
- → Emergency deploys require VP+ sign-off
- → SLO review: consider whether the SLO is set at the right level
Key insight: The error budget policy isn't punishment — it's a contract. Engineering and product both agree that when reliability suffers, the team pauses feature work to fix it. This prevents the slow accumulation of technical debt where reliability keeps degrading while everyone ships features.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your critical services goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your critical services + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Choosing the Right SLI
Your error budget is only as meaningful as the SLI (Service Level Indicator) it's based on. The SLI must measure what users actually care about.
| Service Type | Good SLI | Why It Works |
|---|---|---|
| HTTP API | % requests with 2xx status in < 500ms | Captures both errors and latency in one metric |
| Storage service | % reads returning correct data in < 100ms | Durability + latency; don't just measure uptime |
| Background pipeline | % jobs completing successfully within SLA | Measures correctness + timeliness for batch work |
| Database | % queries completing in < 50ms p95 | Latency percentiles better than average for DB performance |
| Pub/Sub queue | % messages delivered in < 30 seconds | End-to-end delivery SLO, not just queue depth |
Avoid SLIs nobody cares about: CPU usage, container restarts, and deployment success rate are operational metrics, not user-facing SLIs. If your service is running at 90% CPU but serving all requests correctly at low latency, your error budget is fine. Measure what users experience, not what your monitoring system sees.
Frequently Asked Questions
What is an error budget in SRE?
An error budget is the maximum unreliability allowed per your SLO. If your SLO is 99.9%, your error budget is 0.1% — about 43 minutes of downtime per month. It makes reliability trade-offs explicit: when you have budget, ship fast; when it's exhausted, stop and fix reliability.
How do you calculate an error budget?
Error budget = 100% - SLO. For 99.9% SLO: 0.1% budget = 43.8 min/month. In requests: 0.1% × total monthly requests = allowed failures. Track burn rate (current_error_rate / budget_error_rate) to know how fast you're consuming it.
What is error budget burn rate?
Burn rate = actual error rate / SLO error budget rate. Burn rate of 1× = on track to use exactly your budget by window end. 14.4× = you'll exhaust your monthly budget in ~2 days. Alert at 14.4× (fast burn, 1h window) and 6× (slow burn, 6h window) to catch both sudden incidents and gradual degradation.
What should happen when you burn through your error budget?
Your pre-agreed error budget policy kicks in: freeze non-critical deployments, allocate engineering time to reliability work (postmortems, testing, hardening), require sign-off for any new releases. This isn't punitive — it's the agreed trade-off between engineering and product for operating at the chosen SLO level.
What is the difference between 99.9% and 99.99% SLO?
99.9% (three nines) = 43 min/month downtime allowed. 99.99% (four nines) = 4.4 min/month. The cost to go from 99.9% to 99.99% is roughly 4× more redundancy, testing, and operational complexity. Most consumer apps are fine at 99.9%. Financial systems, payment APIs, and infrastructure primitives often justify 99.99%.
Related SRE Guides
🛠 Tools We Use & Recommend
Tested across our own infrastructure monitoring 200+ APIs daily
Uptime Monitoring & Incident Management
Used by 100,000+ websites
Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.
“We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.”
Secrets Management & Developer Security
Trusted by 150,000+ businesses
Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.
“After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.”
Automated Personal Data Removal
Removes data from 350+ brokers
Removes your personal data from 350+ data broker sites. Protects against phishing and social engineering attacks.
“Service outages sometimes involve data breaches. Optery keeps your personal info off the sites attackers use first.”
AI Voice & Audio Generation
Used by 1M+ developers
Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.
“The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.”
SEO & Site Performance Monitoring
Used by 10M+ marketers
Track your site health, uptime, search rankings, and competitor movements from one dashboard.
“We use SEMrush to track how our API status pages rank and catch site health issues early.”