Microservices Monitoring: Complete Guide
📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you
Microservices trade simplicity for scalability — and monitoring complexity is the price. When a monolith is slow, you check one service. When a microservices request is slow, it might have touched 12 services, and any one of them could be the culprit. This guide covers how to instrument, monitor, and alert on microservices systems effectively.
Why Microservices Monitoring Is Harder
Monitoring a monolith is straightforward: one process, one log file, one stack trace when something goes wrong. Microservices multiply every monitoring challenge:
- Distributed requests: A single API call may fan out across 5-20 services
- Cascading failures: One slow service can degrade all services that depend on it
- Log fragmentation: Logs for a single request are spread across multiple services
- Network as failure surface: Every service-to-service call is a new failure point
- Deployment complexity: 50 services may have 50 independent deployment pipelines
- Alert volume: 50 services × 10 metrics = 500 potential alert sources
Effective microservices monitoring addresses each of these challenges with the right tools and patterns.
The Three Pillars of Observability
Modern observability is built on three complementary data types:
1. Metrics
Numerical measurements over time. Metrics are aggregated and efficient to store — ideal for dashboards and alerting. They answer "what is happening" questions: "Error rate is 5% right now."
2. Logs
Timestamped event records from each service. Logs contain context that metrics can't capture. They answer "why is this happening" questions: "The payment service is failing because the merchant_id field is null."
3. Traces
Records of requests as they flow through multiple services. Traces answer "where is the time going" questions: "This request took 3 seconds — 2.8 seconds were spent waiting for the inventory service."
All three are necessary. Metrics alert you. Logs explain what happened. Traces show you where.
Unified Observability for Microservices
Centralize logs, metrics, and uptime monitoring across all your services. Get alerted on cross-service issues with correlated context.
Try Better Stack Free →The Four Golden Signals
Google's SRE book introduced the four golden signals as the minimum set of metrics to monitor for any service:
1. Latency
How long it takes to serve requests. Track:
- p50 (median): Typical user experience
- p95: What 1 in 20 users experiences
- p99: What 1 in 100 users experiences (often your best customers or heaviest users)
Always track percentiles, not averages. A spike in p99 latency can indicate a serious problem even when the average looks fine.
2. Traffic
How much demand is on your system. For APIs: requests per second. For message queues: messages processed per second. For batch jobs: jobs per minute.
Traffic context is essential for interpreting other metrics. A 5% error rate at 100 req/s (5 errors/s) is very different from a 5% error rate at 10,000 req/s (500 errors/s).
3. Errors
The rate of requests that fail. For HTTP services, track:
- 5xx error rate (server errors — your bugs)
- 4xx error rate (client errors — possibly integration bugs or attacks)
- Downstream error rate (errors calling other services)
4. Saturation
How "full" your service is. CPU utilization, memory usage, thread pool usage, connection pool utilization. Saturation metrics often predict failures before they happen — a service at 90% CPU capacity will start failing before it hits 100%.
Golden Signals Alert Thresholds (Starting Point)
- p99 latency > 1s: Warning | > 3s: Critical
- 5xx error rate > 1%: Warning | > 5%: Critical
- CPU saturation > 80%: Warning | > 95%: Critical
- Memory usage > 80%: Warning | > 95%: Critical
Distributed Tracing
Distributed tracing is the most powerful tool for debugging microservices. Without it, tracing a slow request across 10 services is like debugging code without a stack trace.
How Tracing Works
- Every request receives a unique trace ID at the entry point (API gateway or first service)
- The trace ID is propagated to all downstream services via HTTP headers (
X-Trace-IDor W3C Trace Context standard) - Each service records a span — the work it did, with start time, duration, and any errors
- The tracing backend collects all spans and assembles them into a complete trace
- Engineers can view a flame graph showing the entire request path and where time was spent
OpenTelemetry: The Standard
OpenTelemetry (OTel) is the vendor-neutral standard for emitting telemetry from applications. It's the right choice for instrumentation because:
- Instrument once, send to any backend (Jaeger, Zipkin, Datadog, Honeycomb, etc.)
- Auto-instrumentation for popular frameworks (Express, FastAPI, Spring, etc.) — often requires zero code changes
- Covers traces, metrics, and logs under a single SDK
- CNCF project with broad industry adoption
What to Look for in Traces
- Long-tail latency: Which service is slow in the p99 traces but not the p50 traces?
- N+1 query patterns: Is a service making 100 database calls where 1 would suffice?
- Synchronous chains: Services calling services calling services — these are latency multipliers
- Error propagation: Where did the error originate in a cascading failure?
Monitor Service Health Across Your Stack
Uptime monitoring, log aggregation, and incident management for distributed systems. Get correlated alerts across all your services.
Try Better Stack Free →Structured Logging for Microservices
Plain text logs are nearly useless in microservices environments. When logs from 20 services are aggregated into one stream, you need machine-parseable structured logs to filter, search, and correlate.
Structured Log Format
Every log entry should be JSON with consistent fields:
Required Fields in Every Log Entry
timestamp— ISO 8601 format, with millisecondslevel— debug, info, warn, errorservice— which service emitted this logtrace_id— so you can find all logs for a specific requestrequest_id— the unique ID for this HTTP request
Centralized Log Aggregation
Each service writes logs to stdout. A log aggregator (Fluentd, Logstash, Vector) collects logs from all services and forwards them to a central storage system (Elasticsearch, Loki, Datadog Logs).
Once centralized, you can search across all services by trace_id to reconstruct the complete log trail for a single user request.
Service Health Checks
Every microservice needs a health check endpoint. In a microservices environment, this becomes even more critical because:
- Load balancers need to know which instances are ready to serve traffic
- Kubernetes uses health checks to decide whether to restart or route traffic to pods
- Service meshes use health status for circuit breaking and traffic management
For microservices, each service should have:
- Liveness endpoint: Is the process alive?
- Readiness endpoint: Is the service ready to handle traffic (dependencies available)?
- Detailed endpoint (internal only): Full dependency status for debugging
Monitoring Service Dependencies
In microservices, the dependencies between services create a graph where any edge can fail. You need to monitor the health of inter-service communication, not just individual services.
Circuit Breakers
A circuit breaker monitors calls to a downstream service. When the error rate exceeds a threshold, it "opens" — stopping calls to the failing service and returning errors immediately instead of waiting for timeouts. This prevents cascading failures.
Monitor circuit breaker state transitions as key events. A circuit breaker opening means a downstream service is failing — this is often a faster signal than waiting for the downstream service's own health check to fail.
Service Dependency Map
Maintain (and automatically generate) a service dependency map. When Service A depends on Services B, C, and D, a failure in Service C affects Service A's reliability. Understanding these dependencies is essential for impact analysis during incidents.
Alerting Strategy for Microservices
The biggest alerting mistake in microservices is alerting on every service individually. With 50 services, you get 50x the alert volume, and most alerts are symptomatic rather than causal.
Alert at the User-Facing Layer First
Start alerts at the outermost layer of your stack — the API gateway or load balancer. If the overall system error rate exceeds 1%, that's your primary alert. The root cause might be any of 50 downstream services, but the symptom is one.
Alert on Symptoms, Not Causes
- Alert: User-facing API error rate > 1% (symptom)
- Don't alert: Database CPU > 70% (cause — only alert if it actually impacts users)
- Alert: Payment processing latency p99 > 3s (symptom users experience)
- Don't alert: Cache miss rate increased 10% (internal metric — only alert if latency degrades)
Use Alert Correlation
When a root cause failure triggers alerts in 10 dependent services, grouping these into a single incident with one pager notification is critical. Modern incident management tools (PagerDuty, Opsgenie, Better Stack) support alert correlation and deduplication.
Ownership and Escalation
Every service needs a clear owner. Alerts for Service A should route to Team A. Use on-call rotation schedules so alerts reach someone who can act on them.
Intelligent Alerting for Microservices
Correlate alerts across services, route to the right team, and reduce alert fatigue with Better Stack's incident management.
Try Better Stack Free →Monitoring Tools for Microservices
Full Observability Platforms
- Datadog: Comprehensive APM, logs, metrics, and distributed tracing. Best-in-class UI. Expensive at scale.
- New Relic: Strong APM and full-stack observability. Per-user pricing can be cost-effective for large teams.
- Dynatrace: AI-powered anomaly detection and automated dependency mapping. Strong for complex enterprise architectures.
Open Source Stack
- Prometheus + Grafana: De facto standard for metrics collection and visualization. High operational overhead.
- Jaeger / Zipkin: Distributed tracing backends. Integrate with OpenTelemetry.
- Loki: Log aggregation built to work with Grafana. Efficient storage model.
- OpenTelemetry Collector: Vendor-neutral pipeline for collecting and routing telemetry.
Uptime and External Monitoring
- Better Stack: Uptime monitoring, incident management, and status pages. Integrates with your alerting stack.
- APIStatusCheck: Monitoring for external API dependencies your services rely on.
SLOs for Individual Services
Each microservice should have its own SLO. When a service's SLO degrades, it signals a reliability problem that affects any consumer of that service.
Standard starting SLOs for internal services:
- Availability: 99.9% (less strict than customer-facing SLAs, since multiple internal services can be 99.9% and the compound availability is still acceptable)
- Latency p99: 200ms for synchronous services; 500ms for complex operations
Track error budget burn rate per service. When a service is burning through its error budget 10x faster than normal, something unusual is happening — investigate before it cascades.
Monitoring During Deployments
Most incidents in microservices environments occur during or after deployments. Automate deployment monitoring:
- Pre-deployment: Baseline current error rate and latency
- During canary: Compare metrics between new version and old version
- Automatic rollback trigger: If error rate increases >2x or latency increases >50%, roll back automatically
- Post-deployment validation: Monitor for 15 minutes after full rollout before considering the deployment stable
Where to Start
If you're starting from scratch, this priority order maximizes value with minimum overhead:
- Add health check endpoints to every service — 1-2 hours per service, immediate value for load balancers and Kubernetes
- Centralize logs with trace IDs — 1 day setup. Makes debugging dramatically faster.
- Instrument the four golden signals per service — 1-2 days. Foundation for alerting.
- Set up distributed tracing with OpenTelemetry — 1-3 days. Essential for latency debugging.
- Configure SLO-based alerts at the API gateway — 1 day. Reduces alert noise, focuses on user impact.
- Add circuit breakers for critical dependencies — 1-2 weeks. Prevents cascading failures.
Key Takeaways
- Microservices monitoring requires all three observability pillars: metrics, logs, and traces
- The four golden signals (latency, traffic, errors, saturation) are the minimum per-service metrics
- Distributed tracing with OpenTelemetry is essential for debugging cross-service latency and errors
- Use structured JSON logs with trace IDs to correlate logs across services
- Alert at the user-facing layer first, not on every internal metric
- Every service needs separate liveness and readiness health check endpoints
- Circuit breakers prevent cascading failures when downstream services degrade
- Define SLOs per service and track error budget burn rates
Microservices monitoring is an investment — but it pays for itself the first time you resolve an incident in 10 minutes instead of 10 hours because you had distributed traces showing exactly which service caused the problem and why.
Monitor Your Microservices with Better Stack
Uptime monitoring, log management, and incident alerting for distributed systems. Get correlated visibility across all your services without the operational overhead of running your own observability stack.
Start Monitoring Free →Alert Pro
14-day free trialStop checking — get alerted instantly
Next time Microservices Monitoring goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for Microservices Monitoring + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
🛠 Tools We Use & Recommend
Tested across our own infrastructure monitoring 200+ APIs daily
Uptime Monitoring & Incident Management
Used by 100,000+ websites
Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.
“We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.”
Secrets Management & Developer Security
Trusted by 150,000+ businesses
Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.
“After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.”
Automated Personal Data Removal
Removes data from 350+ brokers
Removes your personal data from 350+ data broker sites. Protects against phishing and social engineering attacks.
“Service outages sometimes involve data breaches. Optery keeps your personal info off the sites attackers use first.”
AI Voice & Audio Generation
Used by 1M+ developers
Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.
“The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.”
SEO & Site Performance Monitoring
Used by 10M+ marketers
Track your site health, uptime, search rankings, and competitor movements from one dashboard.
“We use SEMrush to track how our API status pages rank and catch site health issues early.”