SRE FundamentalsError Budgets2026 Guide

Error Budget Guide: SRE Concept, Calculation & Implementation (2026)

Error budgets are the core mechanism in SRE for making reliability vs. velocity trade-offs explicit and data-driven. Instead of arguing about whether a deploy is "safe," you check the budget. This guide explains the math, how to set burn rate alerts, and how to write an error budget policy that engineering and product both agree to.

Updated April 202612 min readSRE / Platform Engineering
Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

TL;DR — Error Budget Essentials

  • ✅ Error budget = 100% - SLO (e.g., 99.9% SLO → 0.1% budget = 43.8 min/month)
  • ✅ Burn rate > 1× = consuming budget faster than your SLO allows
  • ✅ Alert on: fast burn (14.4× over 1h) + slow burn (6× over 6h)
  • ✅ When budget is exhausted: freeze deploys, focus engineering on reliability
  • ✅ Error budget policy makes the freeze automatic — not a debate every time
  • ✅ 99.9% (three nines) is right for most products; 99.99% costs 4× more

What Is an Error Budget?

An error budget is simply the flip side of your SLO (Service Level Objective). If you commit to 99.9% availability, you're saying: "We will be unavailable at most 0.1% of the time." That 0.1% is your error budget.

The genius of error budgets — as described in Google's Site Reliability Engineering book — is that they make an implicit trade-off explicit. Engineers want to ship fast. Operations teams want stability. Error budgets answer both: ship as fast as you want, as long as you're not burning the budget. When the budget runs out, everyone agrees in advance to stop and fix reliability.

"The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow."

— Google SRE Book

Error Budget Calculation

The formula is simple. The tricky part is choosing the right SLI (Service Level Indicator) to measure.

# Error budget formula

error_budget = 100% - SLO

# Convert to time (30-day month)

budget_minutes = (100 - slo_percent) / 100 × 30 × 24 × 60

# Convert to allowed failures (request count)

allowed_failures = (100 - slo_percent) / 100 × total_requests

SLO Comparison Table

SLONinesMonthly BudgetWeekly BudgetDaily Budget
99%Two nines7h 18m1h 40m14m 24s
99.9%Three nines ✓ Most teams43m 49s10m 4s1m 26s
99.95%Three and a half nines21m 54s5m 2s43.2s
99.99%Four nines (enterprise APIs)4m 22s1m 0s8.6s
99.999%Five nines (financial, telco)26.3s6.1s0.86s

Practical note: A 99.9% SLO gives you 43 minutes of downtime per month. A single botched deploy that takes 20 minutes to roll back uses nearly half your monthly budget. That's exactly the point — it makes the cost of incidents real and motivates better testing and deployment practices.

📡
Recommended

Track SLO compliance with Better Stack

Better Stack gives you uptime history, incident timelines, and SLA reports out of the box. See exactly how much error budget you've consumed this month.

Try Better Stack Free →

Error Budget Burn Rate

Burn rate measures how fast you're consuming your error budget. A burn rate of 1× means you're on track to use exactly your budget by the end of the window. 10× means you'd exhaust it in 1/10th the time.

# Burn rate formula

burn_rate = error_rate_over_window / (1 - slo)

# Example: SLO = 99.9%, current error rate = 1%

burn_rate = 0.01 / 0.001 = 10×

# At 10× burn: a 30-day budget exhausts in 3 days

# Example: current error rate = 0.05%

burn_rate = 0.0005 / 0.001 = 0.5×

# At 0.5× burn: budget will outlast the window

Google's Multi-Window Burn Rate Alerting

Alerting on a single burn rate window has problems: fast incidents get caught late; slow creeping degradation goes undetected. Google SRE Book recommends two windows:

AlertBurn RateWindowBudget ConsumedUrgency
Page immediately14.4×1 hour2% in 1h (exhausts in ~2 days)P0 — wake someone up
Page immediately14.4×5 minutesShort but steepP0 — very fast incident
Ticket (next business day)6 hours5% in 6h (slow degradation)P1 — investigate during business hours
Ticket30 minutesModerate sustained degradationP1 — monitor closely

Prometheus Burn Rate Alert Rules

# Multi-window burn rate alerts for 99.9% SLO
# error_budget = 0.001 (0.1%)

groups:
  - name: error-budget
    rules:
      # Page: fast burn (14.4x over 1h and 5m)
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[1h]) /
            rate(http_requests_total[1h]) > 14.4 * 0.001
          ) and (
            rate(http_requests_total{status=~"5.."}[5m]) /
            rate(http_requests_total[5m]) > 14.4 * 0.001
          )
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "Fast error budget burn — 2% of monthly budget consumed in 1h"
          runbook: "https://wiki.example.com/runbooks/error-budget-fast-burn"

      # Ticket: slow burn (6x over 6h and 30m)
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[6h]) /
            rate(http_requests_total[6h]) > 6 * 0.001
          ) and (
            rate(http_requests_total{status=~"5.."}[30m]) /
            rate(http_requests_total[30m]) > 6 * 0.001
          )
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn — 5% consumed in 6h"
          description: "Investigate during business hours. Not a page."

      # Budget exhausted alert
      - alert: ErrorBudgetExhausted
        expr: |
          sum_over_time(
            (rate(http_requests_total{status=~"5.."}[5m]) /
             rate(http_requests_total[5m]))[30d:5m]
          ) / (30 * 24 * 12) > 0.001
        labels:
          severity: critical
        annotations:
          summary: "Error budget for this month is exhausted — freeze non-critical deploys"

Writing an Error Budget Policy

An error budget without a policy is just a number. The policy defines what happens when the budget is consumed — and it must be agreed upon by engineering and product before an incident, not negotiated during one.

Budget Healthy (< 50% consumed)

  • → Full deployment velocity — ship features freely
  • → Experimentation welcome — A/B tests, risky deploys OK
  • → SRE time allocated to feature work and tooling

⚠️ Budget Warning (50–90% consumed)

  • → Deploy only well-tested, low-risk changes
  • → Require SRE sign-off on deployments to production
  • → Postmortem required for any incident consuming > 5% of budget
  • → Begin reliability sprint planning for next cycle

🚨 Budget Exhausted (100%+ consumed)

  • Freeze all non-critical deployments until next budget window
  • → 100% of SRE engineering time goes to reliability: postmortems, testing, infrastructure hardening
  • → Product team must agree to defer features in exchange for stability investment
  • → Emergency deploys require VP+ sign-off
  • → SLO review: consider whether the SLO is set at the right level

Key insight: The error budget policy isn't punishment — it's a contract. Engineering and product both agree that when reliability suffers, the team pauses feature work to fix it. This prevents the slow accumulation of technical debt where reliability keeps degrading while everyone ships features.

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your critical services goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for your critical services + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial

Choosing the Right SLI

Your error budget is only as meaningful as the SLI (Service Level Indicator) it's based on. The SLI must measure what users actually care about.

Service TypeGood SLIWhy It Works
HTTP API% requests with 2xx status in < 500msCaptures both errors and latency in one metric
Storage service% reads returning correct data in < 100msDurability + latency; don't just measure uptime
Background pipeline% jobs completing successfully within SLAMeasures correctness + timeliness for batch work
Database% queries completing in < 50ms p95Latency percentiles better than average for DB performance
Pub/Sub queue% messages delivered in < 30 secondsEnd-to-end delivery SLO, not just queue depth

Avoid SLIs nobody cares about: CPU usage, container restarts, and deployment success rate are operational metrics, not user-facing SLIs. If your service is running at 90% CPU but serving all requests correctly at low latency, your error budget is fine. Measure what users experience, not what your monitoring system sees.

Frequently Asked Questions

What is an error budget in SRE?

An error budget is the maximum unreliability allowed per your SLO. If your SLO is 99.9%, your error budget is 0.1% — about 43 minutes of downtime per month. It makes reliability trade-offs explicit: when you have budget, ship fast; when it's exhausted, stop and fix reliability.

How do you calculate an error budget?

Error budget = 100% - SLO. For 99.9% SLO: 0.1% budget = 43.8 min/month. In requests: 0.1% × total monthly requests = allowed failures. Track burn rate (current_error_rate / budget_error_rate) to know how fast you're consuming it.

What is error budget burn rate?

Burn rate = actual error rate / SLO error budget rate. Burn rate of 1× = on track to use exactly your budget by window end. 14.4× = you'll exhaust your monthly budget in ~2 days. Alert at 14.4× (fast burn, 1h window) and 6× (slow burn, 6h window) to catch both sudden incidents and gradual degradation.

What should happen when you burn through your error budget?

Your pre-agreed error budget policy kicks in: freeze non-critical deployments, allocate engineering time to reliability work (postmortems, testing, hardening), require sign-off for any new releases. This isn't punitive — it's the agreed trade-off between engineering and product for operating at the chosen SLO level.

What is the difference between 99.9% and 99.99% SLO?

99.9% (three nines) = 43 min/month downtime allowed. 99.99% (four nines) = 4.4 min/month. The cost to go from 99.9% to 99.99% is roughly 4× more redundancy, testing, and operational complexity. Most consumer apps are fine at 99.9%. Financial systems, payment APIs, and infrastructure primitives often justify 99.99%.

Related SRE Guides

🛠 Tools We Use & Recommend

Tested across our own infrastructure monitoring 200+ APIs daily

Better StackBest for API Teams

Uptime Monitoring & Incident Management

Used by 100,000+ websites

Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.

We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.

Free tier · Paid from $24/moStart Free Monitoring
1PasswordBest for Credential Security

Secrets Management & Developer Security

Trusted by 150,000+ businesses

Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.

After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.

OpteryBest for Privacy

Automated Personal Data Removal

Removes data from 350+ brokers

Removes your personal data from 350+ data broker sites. Protects against phishing and social engineering attacks.

Service outages sometimes involve data breaches. Optery keeps your personal info off the sites attackers use first.

From $9.99/moFree Privacy Scan
ElevenLabsBest for AI Voice

AI Voice & Audio Generation

Used by 1M+ developers

Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.

The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.

Free tier · Paid from $5/moTry ElevenLabs Free
SEMrushBest for SEO

SEO & Site Performance Monitoring

Used by 10M+ marketers

Track your site health, uptime, search rankings, and competitor movements from one dashboard.

We use SEMrush to track how our API status pages rank and catch site health issues early.

From $129.95/moTry SEMrush Free
View full comparison & more tools →Affiliate links — we earn a commission at no extra cost to you