SLA vs SLO vs SLI: The Complete Guide to Service Level Reliability (2026)
You know your API needs to be reliable. Your customers expect it. Your boss demands it. But when someone asks you to define "reliable," things get fuzzy fast. Is 99.9% uptime good enough? Who decides? And what happens when you miss the target?
That's where SLAs, SLOs, and SLIs come in. These three concepts — Service Level Agreements, Service Level Objectives, and Service Level Indicators — form the backbone of how modern engineering teams define, measure, and maintain reliability. They originated in Google's Site Reliability Engineering (SRE) practice and have since become industry standard.
But here's the problem: most teams conflate them, confuse them, or skip straight to SLAs without doing the foundational work of defining SLIs and SLOs first. That's like signing a contract before knowing what you're promising.
This guide breaks down each concept, shows how they work together, and gives you practical frameworks for implementing them — whether you're a solo developer running a SaaS product or an SRE at a Fortune 500 company.
The 30-Second Version
Before we go deep, here's the core distinction:
- SLI (Service Level Indicator): A metric that measures a specific aspect of your service's reliability. The measurement.
- SLO (Service Level Objective): A target value or range for an SLI that your team commits to internally. The goal.
- SLA (Service Level Agreement): A contract with customers that includes consequences (usually financial) if SLOs are missed. The promise.
Think of it as a funnel:
SLI (What you measure) → SLO (What you aim for) → SLA (What you guarantee)
The SLI tells you how things are going. The SLO tells you how things should be going. The SLA tells customers what you'll do if things aren't going well enough.
Most teams work backwards — they negotiate an SLA first and then scramble to figure out what to measure. The right approach is to start from the bottom: define your SLIs, set SLOs around them, and only then formalize SLAs.
SLIs: What You Actually Measure
A Service Level Indicator is a carefully defined quantitative measure of some aspect of the level of service being provided. The key word is "carefully" — a bad SLI leads to bad SLOs, which leads to broken SLAs and unhappy users.
Characteristics of Good SLIs
Good SLIs share several properties:
They measure user experience. An SLI should reflect what your users care about, not what's convenient to measure. CPU utilization is easy to track, but users don't care about your CPU — they care about whether the page loads quickly and correctly.
They're a ratio. The most useful SLIs are expressed as a proportion of good events to total events:
SLI = (Good events / Total events) × 100%
For example: "99.3% of API requests completed successfully in under 300ms." The ratio format makes SLIs directly comparable across time periods and easy to set objectives around.
They're measured over a window. An SLI value is meaningless without a time period. "99.9% availability" could mean many things — over the last hour? Day? Month? Quarter? The window matters enormously.
They exclude noise. Good SLIs filter out events that don't reflect real user experience. If your health check endpoint gets hit 10,000 times per minute by your load balancer, those pings shouldn't dilute your latency SLI.
The Four Golden SLIs
Google's SRE book identifies four categories that cover most services:
1. Availability — Could the user complete their request?
Availability = (Successful requests / Total requests) × 100%
This is the most fundamental SLI. If your API returns a 500 error, the user's request failed. Typically, you'd count HTTP 5xx responses as failures and everything else (including 4xx, which are client errors) as successes.
What to watch out for: partial failures. An API that returns 200 but with incomplete data is technically "available" but not useful. Consider defining success more strictly — a request succeeds only if it returns the complete, correct response.
2. Latency — How long did the request take?
Latency SLI = (Requests < threshold / Total requests) × 100%
For example: "95% of requests completed in under 200ms." Note that average latency is a terrible SLI — it hides tail latency where the real pain lives. Use percentiles instead (p50, p95, p99).
A more sophisticated approach: define multiple latency SLIs at different thresholds. Your SLO might require 99% of requests under 500ms AND 90% under 100ms. This catches both slowdowns and tail-latency spikes.
3. Throughput — Can the system handle the load?
Throughput = Requests processed per second over a time window
Throughput SLIs matter most for batch processing systems, data pipelines, and APIs with rate limits. For request-serving systems, availability and latency usually capture throughput problems indirectly (if the system can't handle load, requests start failing or slowing down).
4. Correctness — Did the response contain the right data?
Correctness = (Correct responses / Total responses) × 100%
This is the hardest SLI to measure but often the most important. An API that returns 200 OK with stale data from cache is "available" and "fast" but not correct. Measuring correctness typically requires application-level checks: checksums, data validation, consistency probes, or synthetic monitoring that verifies response content.
SLI Specification vs. Implementation
There's an important distinction between what you want to measure (the specification) and how you measure it (the implementation):
Specification: "The proportion of valid API requests that return a successful response in under 300ms."
Implementation options:
- Application-level metrics: Your API code records latency and success/failure for every request. Most accurate, but requires instrumentation.
- Load balancer logs: Parse logs from your ALB/NLB/Cloudflare. Easy to set up, but may not capture application-level failures.
- Synthetic monitoring: A probe makes requests from the outside and measures success/latency. Captures the full user experience, but only for the specific requests you test.
- Client-side telemetry: JavaScript/mobile SDKs report real user metrics. The most accurate picture of user experience, but requires client instrumentation and introduces privacy concerns.
The best approach combines multiple implementations. Use application metrics for your primary SLIs, and validate them with synthetic monitoring to catch infrastructure-level issues that application metrics might miss.
Real-World SLI Examples
Here's what SLIs look like for different types of services:
REST API:
- Availability: Proportion of non-5xx responses, excluding health checks
- Latency: Proportion of requests with time-to-first-byte under 200ms
- Correctness: Proportion of responses matching expected schema (validated by synthetic tests)
E-commerce checkout:
- Availability: Proportion of checkout attempts that complete without server error
- Latency: Proportion of checkout flows completing within 3 seconds
- Correctness: Proportion of orders where charged amount matches cart total
Data pipeline:
- Freshness: Proportion of records processed within 5 minutes of creation
- Correctness: Proportion of records passing validation checks
- Coverage: Proportion of expected input records that appear in output
Status page / monitoring service:
- Detection time: Proportion of outages detected within 2 minutes
- Alert delivery: Proportion of alerts delivered within 30 seconds of detection
- Accuracy: Proportion of alerts that correspond to real incidents (vs. false positives)
SLOs: The Targets You Set
A Service Level Objective is a target value or range for a service level measured by an SLI. It's the line in the sand where you say "above this, we're happy; below this, we need to act."
Why SLOs Matter More Than SLAs
Here's a truth that surprises many engineers: your SLOs are more important than your SLAs.
An SLA is a business contract — it tells customers what you'll do if things go wrong. But by the time you're breaching an SLA, you've already failed. Users are angry. Credits are being issued. Trust is eroding.
SLOs, on the other hand, are your early warning system. They're set tighter than SLAs (always), and they're what your team actually operates against. When you start burning through your error budget, the SLO alerts you before you breach the SLA.
Think of SLOs as the "yellow zone" and SLA breach as the "red zone." You want to spend most of your time in the green, occasionally dip into yellow, and never hit red.
Setting Good SLOs
Setting SLOs is part science, part art. Here are the principles:
Start with user expectations, not system capabilities. Don't set 99.99% availability just because your system can do it. Set it based on what users actually need. For an internal tool used during business hours, 99.5% might be perfectly fine. For a payment processing API, 99.99% might be the minimum.
Use historical data as a starting point. Look at your last 30-90 days of SLI data. Where does your service naturally sit? If you're currently at 99.95% availability, setting an SLO of 99.99% is aspirational but possibly unrealistic without significant investment. Setting 99.9% gives you a meaningful target with some breathing room.
SLOs should be achievable but aspirational. An SLO you never miss isn't useful — it means you're over-investing in reliability at the expense of feature velocity. An SLO you constantly miss isn't useful either — it means the target is unrealistic and the team will stop caring about it.
The sweet spot: you should miss your SLO occasionally (a few times per year), and each miss should trigger meaningful action.
Set different SLOs for different user segments. Your free-tier users might get 99.5% availability, while enterprise customers get 99.95%. This isn't just cost optimization — it reflects genuine differences in user needs and willingness to pay.
Document why, not just what. An SLO of "99.9% availability" is incomplete without context: Why 99.9%? What user impact does 0.1% downtime represent? What would it cost to improve to 99.95%? Who approved this target? When will it be reviewed? The "why" prevents future teams from blindly tightening SLOs without understanding the trade-offs.
SLO Windows: Rolling vs. Calendar
SLOs are measured over a time window. The two main approaches:
Rolling window (e.g., "last 30 days"): Updates continuously. Every hour, the oldest hour of data drops off and the newest hour is added. This provides a consistent, real-time view of your service's health.
Advantage: No "reset" effect. A bad incident on the 28th of the month doesn't get forgiven when the calendar flips to the 1st.
Disadvantage: A single bad incident haunts you for the entire window duration, even after you've fixed the root cause.
Calendar window (e.g., "this month" or "this quarter"): Resets at a fixed interval. April 1 starts a fresh SLO measurement regardless of what happened in March.
Advantage: Clean breaks. Teams get a fresh start, which can be motivating after a rough period.
Disadvantage: Perverse incentives. If you've already blown your monthly SLO by the 15th, what's the motivation to be careful for the rest of the month?
Recommendation: Use a rolling 30-day window for operational SLOs (what your team watches daily) and a calendar quarter for business reporting (what you report to stakeholders). The rolling window keeps urgency consistent; the quarterly view provides enough data to be statistically meaningful.
SLO Examples
| Service | SLI | SLO Target | Window |
|---|---|---|---|
| Payment API | Availability (non-5xx) | 99.95% | 30-day rolling |
| Payment API | Latency (p99) | 99% under 500ms | 30-day rolling |
| Search API | Availability | 99.9% | 30-day rolling |
| Search API | Latency (p50) | 95% under 100ms | 30-day rolling |
| Data pipeline | Freshness | 99% within 5 min | 30-day rolling |
| Status page | Detection time | 95% within 2 min | 30-day rolling |
Error Budgets: The Bridge Between Reliability and Velocity
The error budget is the concept that makes SLOs actionable. It answers the question: "How much unreliability can we tolerate before we need to act?"
How Error Budgets Work
If your SLO is 99.9% availability over 30 days, your error budget is 0.1% of total requests (or equivalently, approximately 43 minutes of total downtime). That's the amount of "bad" you're allowed.
Error budget = 100% - SLO target
For a 99.9% SLO over 30 days:
- Time budget: 30 × 24 × 60 × 0.001 = 43.2 minutes
- Request budget: If you serve 1M requests/day, that's 30,000 failed requests allowed over 30 days
Error Budget Policies
The real power of error budgets comes from defining what happens at different budget levels:
Budget healthy (>50% remaining):
- Ship features at normal velocity
- Accept reasonable deployment risk
- Standard incident response
Budget caution (25-50% remaining):
- Increase code review rigor for changes affecting reliability
- Run extra testing for deployments
- Prioritize reliability-related tech debt
Budget critical (<25% remaining):
- Freeze non-essential feature deployments
- Dedicate engineering capacity to reliability improvements
- Conduct review of recent incidents to identify systemic issues
Budget exhausted (0% remaining):
- All engineering effort shifts to reliability
- No feature deployments until budget recovers
- Require incident reviews for every error budget-consuming event
- Escalate to leadership if budget is exhausted for two consecutive periods
Why Error Budgets Resolve the DevOps Tension
The classic conflict: developers want to ship features fast; operations wants stability. Without error budgets, this is a never-ending argument.
Error budgets reframe the conversation. Instead of "should we deploy this risky change?", the question becomes "do we have error budget to absorb a potential failure?" If the answer is yes, ship it. If no, invest in reliability first.
This gives both sides what they want. Developers get clear permission to move fast when the system is healthy. Operations gets an automatic brake when reliability drops below acceptable levels. Nobody has to argue — the number decides.
SLAs: The Business Contract
A Service Level Agreement is a formal contract between a service provider and a customer. Unlike SLOs (which are internal targets), SLAs carry legal and financial consequences.
SLA vs SLO: The Key Differences
| Aspect | SLO | SLA |
|---|---|---|
| Audience | Internal engineering team | External customers |
| Consequence of breach | Error budget policies, prioritization shifts | Financial penalties, credits, contract termination |
| Typical target | Tighter (higher bar) | Looser (lower bar) |
| Negotiability | Set by engineering team | Negotiated by business/legal |
| Transparency | Visible to the team | Published to customers |
Why SLAs Should Be Looser Than SLOs
This is critical: your SLA target should always be lower than your SLO target. If your SLO is 99.9%, your SLA should be 99.5% or 99%. This creates a buffer zone.
Why? Because SLAs have real financial consequences. If your SLO and SLA are both 99.9%, any SLO miss immediately becomes an SLA breach with credits and contractual implications. But if your SLA is 99.5% while your SLO is 99.9%, you have room to miss your internal target, fix the issue, and recover — all without triggering business consequences.
The gap between SLO and SLA is your "safety margin." Size it based on your confidence in the system and the financial impact of SLA breaches.
What Goes in an SLA
A well-structured SLA includes:
1. Service scope — What specific services are covered? "The API" is too vague. "REST API endpoints at api.example.com, excluding the /beta namespace" is specific.
2. Performance metrics — Which SLIs and what targets? Usually availability and sometimes latency. Be precise about measurement methodology, exclusion windows (planned maintenance), and what counts as downtime.
3. Measurement method — How is compliance determined? Who measures? What data source? Does the customer's experience or the provider's monitoring prevail in disputes?
4. Remedies and credits — What happens on breach? The most common approach is service credits — a percentage discount on the next invoice proportional to the severity and duration of the breach.
Common credit schedules:
| Availability | Credit |
|---|---|
| 99.0% - 99.5% | 10% |
| 95.0% - 99.0% | 25% |
| < 95.0% | 50% |
5. Exclusions — What doesn't count? Typical exclusions include scheduled maintenance windows, force majeure events, issues caused by the customer's code, and features in beta or preview.
6. Reporting and claims — How does the customer claim credits? Many SLAs require the customer to submit a claim within a specific period (30 days is common). This is not just bureaucracy — it means customers who don't actively monitor their SLAs will never claim credits, which is a financial benefit for the provider.
SLA Anti-Patterns
The vanity SLA: Promising 99.99% availability when your infrastructure can barely sustain 99.5%. You'll pay for it — literally — in credits.
The unmeasurable SLA: "We guarantee high availability." What does "high" mean? Without a number, neither party can determine compliance.
The fine-print SLA: Excluding so many scenarios that the SLA is effectively meaningless. "99.9% availability, excluding planned maintenance, degraded performance, partial outages, third-party dependencies, and periods of high load." What's left?
The missing-SLO SLA: Signing an SLA without having internal SLOs. If you don't know how your service actually performs, you can't make reliable promises.
Putting It All Together: The Service Level Framework
Here's how SLIs, SLOs, and SLAs work together in practice:
Step 1: Identify User Journeys
Start with your users, not your infrastructure. What are the critical paths through your service? For an API product, these might be:
- Authentication (login, token refresh)
- Core operations (the main API calls your customers make)
- Dashboard/console (web UI for configuration)
- Webhook delivery (outgoing notifications)
Step 2: Define SLIs for Each Journey
For each critical journey, define 2-3 SLIs that capture user experience:
- Authentication: Availability (non-5xx), latency (p99 under 500ms)
- Core API: Availability, latency (p95 under 200ms), correctness
- Dashboard: Availability, page load time (p95 under 2s)
- Webhooks: Delivery success rate, delivery latency
Step 3: Set SLOs Based on User Needs
For each SLI, set an objective:
- Authentication availability: 99.95%
- Core API availability: 99.9%
- Core API latency: 99% of requests under 200ms
- Dashboard availability: 99.5% (less critical — users can use API directly)
- Webhook delivery: 99.9%
Step 4: Define Error Budget Policies
For each SLO, define what happens when the error budget is consumed:
- 50% consumed: Alert the on-call team, add reliability item to sprint
- 75% consumed: Halt risky deployments, dedicate engineer to investigation
- 100% consumed: Feature freeze, all hands on reliability
Step 5: Negotiate SLAs (if applicable)
Based on your SLOs and historical performance, set external SLAs with appropriate safety margins:
- Core API SLA: 99.5% (buffer of 0.4% below the 99.9% SLO)
- SLA credit schedule: 10% for < 99.5%, 25% for < 99.0%, 50% for < 95%
- Measurement: Provider's monitoring system, 5-minute intervals, excluding scheduled maintenance with 72-hour advance notice
Step 6: Monitor and Iterate
This isn't a "set it and forget it" exercise. Review your SLOs quarterly:
- Are the targets still appropriate? User expectations change.
- Are you consistently exceeding SLOs by a wide margin? Maybe you're over-investing in reliability.
- Are you frequently missing SLOs? Maybe the targets are unrealistic, or maybe you need to invest more.
- Have your SLIs drifted from user experience? New features might change what matters.
Common Mistakes Teams Make
Mistake 1: Too Many SLIs
If everything is an SLI, nothing is. Teams that track 50 metrics as SLIs can't focus on what matters. Aim for 3-5 SLIs per service, covering the critical user journeys.
Mistake 2: SLOs That Never Get Missed
An SLO you always meet isn't providing value — it's just telling you something you already know. If your availability SLO is 99.5% and you consistently deliver 99.99%, your SLO is too easy. Tighten it until it occasionally gets missed, because that's when error budget policies kick in and drive meaningful reliability work.
Mistake 3: Treating SLOs as Maximums Instead of Targets
An SLO of 99.9% doesn't mean "achieve at least 99.9%." It means "99.9% is the right level of reliability for this service." Significantly exceeding it (e.g., achieving 99.999%) might mean you're spending too much on reliability and not enough on features. The error budget exists to be spent.
Mistake 4: Not Having Error Budget Policies
SLOs without policies are just numbers on a dashboard. The value of SLOs comes from the decisions they trigger. If burning through your error budget doesn't change anyone's behavior, what's the point?
Mistake 5: Copying Someone Else's SLOs
Google's SLOs aren't your SLOs. A startup serving 100 users has different reliability needs than a platform serving 100 million. Your SLOs should reflect your users, your business, and your capacity — not an aspirational comparison to a tech giant.
Mistake 6: Forgetting About Dependencies
Your SLA can't be stronger than your weakest dependency. If your cloud provider offers 99.95% and your database offers 99.9%, your theoretical maximum availability is approximately 99.85% (0.9995 × 0.999). Set your SLOs with dependency limits in mind.
Tools and Implementation
Monitoring Your SLIs
You need infrastructure to measure SLIs continuously. Common approaches:
Application-level instrumentation: Libraries like OpenTelemetry, Prometheus client libraries, or your APM tool (Datadog, New Relic, Grafana) can record request counts, latencies, and error rates directly in your application code.
External synthetic monitoring: Services like API Status Check, Better Stack, or Pingdom make real requests to your endpoints and measure availability and latency from the outside. This catches issues that internal monitoring might miss (DNS problems, CDN issues, network routing).
Log-based SLIs: Parse your access logs (ALB, Cloudflare, nginx) to compute SLIs retroactively. Less real-time than other approaches, but captures every request.
Recommendation: Use a combination. Application metrics for primary SLIs (most accurate), synthetic monitoring for validation (catches infrastructure issues), and log analysis for investigation and forensics.
🔐 Managing credentials for Datadog, New Relic, Grafana, and your cloud provider? 1Password stores all your monitoring and infrastructure secrets in encrypted vaults — with CLI injection for CI/CD pipelines and team sharing so on-call engineers always have access.
Tracking Error Budgets
Most modern observability platforms support error budget tracking:
- Google Cloud SLO Monitoring: Native SLO/error budget support in Google Cloud
- Datadog SLO Tracking: Define SLOs, track burn rates, and alert on budget consumption
- Nobl9: Purpose-built SLO platform
- Grafana + Prometheus: Open-source stack with SLO dashboards and alerting
For smaller teams, a spreadsheet works. Seriously. If you're measuring availability with synthetic monitoring, you can calculate your error budget monthly in a simple tracker:
Monthly requests: 1,000,000
SLO: 99.9%
Error budget: 1,000 failed requests
Consumed this month: 347
Remaining: 653 (65.3%)
Alerting on SLO Burn Rate
Don't alert on instantaneous SLI values ("availability dropped below 99.9% in the last minute"). Instead, alert on burn rate — how fast you're consuming your error budget:
- 2% budget consumed in 1 hour → page the on-call (this rate would exhaust the budget in 2 days)
- 5% budget consumed in 6 hours → create a ticket (concerning but not urgent)
- 10% budget consumed in 3 days → review in the next team meeting
Burn-rate alerting reduces noise dramatically compared to threshold-based alerts while catching real problems faster.
Frequently Asked Questions
What's the difference between SLA and SLO?
An SLO is an internal target your engineering team sets for reliability — for example, "99.9% of API requests will succeed." An SLA is an external, legally binding agreement with customers that includes consequences (usually financial credits) for missing targets. SLOs should always be tighter than SLAs, creating a buffer so you can address issues before they become contractual problems.
Do I need SLAs if I have SLOs?
Not necessarily. SLOs are valuable even without SLAs. Many teams — especially those building internal services or early-stage products — operate purely on SLOs and error budgets. SLAs become important when you have enterprise customers who require contractual guarantees, or when you need a formal mechanism for accountability.
How do I choose the right SLO target?
Start with your historical performance data (last 30-90 days). Look at where your service naturally operates. Then consider user expectations: would your users notice the difference between 99.9% and 99.5%? Factor in dependencies — you can't exceed your infrastructure provider's reliability. Set a target that's achievable but challenging: you should miss it a few times per year, not every week and not never.
What is an error budget in SRE?
An error budget is the allowed amount of unreliability over a time window, calculated as 100% minus your SLO target. For a 99.9% availability SLO over 30 days, the error budget is 0.1%, or approximately 43 minutes of downtime. Error budgets make reliability decisions objective — when budget remains, teams can move fast; when it's depleted, reliability work takes priority.
Can I have different SLOs for different customers?
Yes, and you should. Different customer tiers have different reliability needs and different willingness to pay for higher reliability. A free tier might have a 99% SLO, a business tier 99.9%, and an enterprise tier 99.95%. This is reflected in your architecture (dedicated vs. shared infrastructure) and pricing.
How often should I review SLOs?
Quarterly is the most common cadence. Review whether SLOs are still appropriate given changes in user base, traffic patterns, and infrastructure. Annually, do a deeper review that considers whether your SLIs still capture user experience accurately and whether error budget policies are driving the right behaviors.
What's the relationship between SLOs and incident management?
SLOs define when an incident matters. If an issue isn't affecting your SLIs or consuming error budget, it's not impacting users — and may not need urgent response. During incident postmortems, SLO impact quantifies the severity objectively: "This incident consumed 15% of our monthly error budget" is more actionable than "the site was slow for a while."
How do SLAs work with third-party API dependencies?
Your SLA is bounded by your dependencies. If you rely on a payment provider with a 99.95% SLA and a cloud provider with a 99.99% SLA, your maximum achievable availability is roughly 99.94%. Account for this when setting SLAs. Monitor your dependencies' actual performance (not just their SLA promises) using API status monitoring, and build resilience patterns like circuit breakers and retry logic to mitigate dependency failures.
Where to Go From Here
Service levels aren't just a framework — they're a way of thinking about reliability. They shift the conversation from "is the service up?" to "are users happy?" and from "how do we prevent all failures?" to "how much failure is acceptable?"
Start small. Pick your most critical service. Define 2-3 SLIs that reflect user experience. Set SLOs based on historical data and user expectations. Write down what happens when the error budget runs low. Then measure, learn, and iterate.
Your users don't need perfection. They need consistency, transparency, and a team that knows exactly how reliable their service is — and has a plan for when it isn't.
Related Resources
- What 99.9% Uptime Really Means for Your API — Deep dive into the "nines" and their real-world impact
- How to Write an Incident Postmortem — Templates and frameworks for learning from failures
- Incident Response Playbook for Engineering Teams — Step-by-step process from detection to resolution
- Best Incident Management Software 2026 — Tools that help you manage the SLO lifecycle
- API Circuit Breaker Pattern — Build resilience when dependencies fail
- API Health Checks Implementation Guide — The foundation of availability SLIs
- API Monitoring Best Practices — How to instrument and observe your services
🛠 Tools We Recommend
Securely manage API keys, database credentials, and service tokens across your team.
Remove your personal data from 350+ data broker sites automatically.
Monitor your developer content performance and track API documentation rankings.
API Status Check
Stop checking API status pages manually
Get instant email alerts when OpenAI, Stripe, AWS, and 100+ APIs go down. Know before your users do.
Free dashboard available · 14-day trial on paid plans · Cancel anytime
Browse Free Dashboard →