Microservices Monitoring: Complete Guide

15 min read
Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

Microservices trade simplicity for scalability — and monitoring complexity is the price. When a monolith is slow, you check one service. When a microservices request is slow, it might have touched 12 services, and any one of them could be the culprit. This guide covers how to instrument, monitor, and alert on microservices systems effectively.

Why Microservices Monitoring Is Harder

Monitoring a monolith is straightforward: one process, one log file, one stack trace when something goes wrong. Microservices multiply every monitoring challenge:

Effective microservices monitoring addresses each of these challenges with the right tools and patterns.

The Three Pillars of Observability

Modern observability is built on three complementary data types:

1. Metrics

Numerical measurements over time. Metrics are aggregated and efficient to store — ideal for dashboards and alerting. They answer "what is happening" questions: "Error rate is 5% right now."

2. Logs

Timestamped event records from each service. Logs contain context that metrics can't capture. They answer "why is this happening" questions: "The payment service is failing because the merchant_id field is null."

3. Traces

Records of requests as they flow through multiple services. Traces answer "where is the time going" questions: "This request took 3 seconds — 2.8 seconds were spent waiting for the inventory service."

All three are necessary. Metrics alert you. Logs explain what happened. Traces show you where.

📡
Recommended

Unified Observability for Microservices

Centralize logs, metrics, and uptime monitoring across all your services. Get alerted on cross-service issues with correlated context.

Try Better Stack Free →

The Four Golden Signals

Google's SRE book introduced the four golden signals as the minimum set of metrics to monitor for any service:

1. Latency

How long it takes to serve requests. Track:

Always track percentiles, not averages. A spike in p99 latency can indicate a serious problem even when the average looks fine.

2. Traffic

How much demand is on your system. For APIs: requests per second. For message queues: messages processed per second. For batch jobs: jobs per minute.

Traffic context is essential for interpreting other metrics. A 5% error rate at 100 req/s (5 errors/s) is very different from a 5% error rate at 10,000 req/s (500 errors/s).

3. Errors

The rate of requests that fail. For HTTP services, track:

4. Saturation

How "full" your service is. CPU utilization, memory usage, thread pool usage, connection pool utilization. Saturation metrics often predict failures before they happen — a service at 90% CPU capacity will start failing before it hits 100%.

Golden Signals Alert Thresholds (Starting Point)

  • p99 latency > 1s: Warning | > 3s: Critical
  • 5xx error rate > 1%: Warning | > 5%: Critical
  • CPU saturation > 80%: Warning | > 95%: Critical
  • Memory usage > 80%: Warning | > 95%: Critical

Distributed Tracing

Distributed tracing is the most powerful tool for debugging microservices. Without it, tracing a slow request across 10 services is like debugging code without a stack trace.

How Tracing Works

  1. Every request receives a unique trace ID at the entry point (API gateway or first service)
  2. The trace ID is propagated to all downstream services via HTTP headers (X-Trace-ID or W3C Trace Context standard)
  3. Each service records a span — the work it did, with start time, duration, and any errors
  4. The tracing backend collects all spans and assembles them into a complete trace
  5. Engineers can view a flame graph showing the entire request path and where time was spent

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the vendor-neutral standard for emitting telemetry from applications. It's the right choice for instrumentation because:

// Node.js auto-instrumentation with OpenTelemetry // Instrument your app with zero code changes: const { NodeSDK } = require('@opentelemetry/sdk-node'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();

What to Look for in Traces

📡
Recommended

Monitor Service Health Across Your Stack

Uptime monitoring, log aggregation, and incident management for distributed systems. Get correlated alerts across all your services.

Try Better Stack Free →

Structured Logging for Microservices

Plain text logs are nearly useless in microservices environments. When logs from 20 services are aggregated into one stream, you need machine-parseable structured logs to filter, search, and correlate.

Structured Log Format

Every log entry should be JSON with consistent fields:

{ "timestamp": "2026-04-29T12:00:00.123Z", "level": "error", "service": "payment-service", "version": "1.2.3", "trace_id": "abc123def456", "span_id": "def789", "user_id": "usr_456", "request_id": "req_789", "message": "Payment processing failed", "error": { "code": "card_declined", "message": "Insufficient funds" }, "duration_ms": 234 }

Required Fields in Every Log Entry

Centralized Log Aggregation

Each service writes logs to stdout. A log aggregator (Fluentd, Logstash, Vector) collects logs from all services and forwards them to a central storage system (Elasticsearch, Loki, Datadog Logs).

Once centralized, you can search across all services by trace_id to reconstruct the complete log trail for a single user request.

Service Health Checks

Every microservice needs a health check endpoint. In a microservices environment, this becomes even more critical because:

For microservices, each service should have:

Monitoring Service Dependencies

In microservices, the dependencies between services create a graph where any edge can fail. You need to monitor the health of inter-service communication, not just individual services.

Circuit Breakers

A circuit breaker monitors calls to a downstream service. When the error rate exceeds a threshold, it "opens" — stopping calls to the failing service and returning errors immediately instead of waiting for timeouts. This prevents cascading failures.

Circuit breaker states: - CLOSED: Normal operation. Calls pass through. - OPEN: Downstream is failing. Calls fail fast (no waiting for timeout). - HALF-OPEN: Testing if downstream recovered. A few calls pass through to check.

Monitor circuit breaker state transitions as key events. A circuit breaker opening means a downstream service is failing — this is often a faster signal than waiting for the downstream service's own health check to fail.

Service Dependency Map

Maintain (and automatically generate) a service dependency map. When Service A depends on Services B, C, and D, a failure in Service C affects Service A's reliability. Understanding these dependencies is essential for impact analysis during incidents.

Alerting Strategy for Microservices

The biggest alerting mistake in microservices is alerting on every service individually. With 50 services, you get 50x the alert volume, and most alerts are symptomatic rather than causal.

Alert at the User-Facing Layer First

Start alerts at the outermost layer of your stack — the API gateway or load balancer. If the overall system error rate exceeds 1%, that's your primary alert. The root cause might be any of 50 downstream services, but the symptom is one.

Alert on Symptoms, Not Causes

Use Alert Correlation

When a root cause failure triggers alerts in 10 dependent services, grouping these into a single incident with one pager notification is critical. Modern incident management tools (PagerDuty, Opsgenie, Better Stack) support alert correlation and deduplication.

Ownership and Escalation

Every service needs a clear owner. Alerts for Service A should route to Team A. Use on-call rotation schedules so alerts reach someone who can act on them.

📡
Recommended

Intelligent Alerting for Microservices

Correlate alerts across services, route to the right team, and reduce alert fatigue with Better Stack's incident management.

Try Better Stack Free →

Monitoring Tools for Microservices

Full Observability Platforms

Open Source Stack

Uptime and External Monitoring

SLOs for Individual Services

Each microservice should have its own SLO. When a service's SLO degrades, it signals a reliability problem that affects any consumer of that service.

Standard starting SLOs for internal services:

Track error budget burn rate per service. When a service is burning through its error budget 10x faster than normal, something unusual is happening — investigate before it cascades.

Monitoring During Deployments

Most incidents in microservices environments occur during or after deployments. Automate deployment monitoring:

  1. Pre-deployment: Baseline current error rate and latency
  2. During canary: Compare metrics between new version and old version
  3. Automatic rollback trigger: If error rate increases >2x or latency increases >50%, roll back automatically
  4. Post-deployment validation: Monitor for 15 minutes after full rollout before considering the deployment stable

Where to Start

If you're starting from scratch, this priority order maximizes value with minimum overhead:

  1. Add health check endpoints to every service — 1-2 hours per service, immediate value for load balancers and Kubernetes
  2. Centralize logs with trace IDs — 1 day setup. Makes debugging dramatically faster.
  3. Instrument the four golden signals per service — 1-2 days. Foundation for alerting.
  4. Set up distributed tracing with OpenTelemetry — 1-3 days. Essential for latency debugging.
  5. Configure SLO-based alerts at the API gateway — 1 day. Reduces alert noise, focuses on user impact.
  6. Add circuit breakers for critical dependencies — 1-2 weeks. Prevents cascading failures.

Key Takeaways

Microservices monitoring is an investment — but it pays for itself the first time you resolve an incident in 10 minutes instead of 10 hours because you had distributed traces showing exactly which service caused the problem and why.

Monitor Your Microservices with Better Stack

Uptime monitoring, log management, and incident alerting for distributed systems. Get correlated visibility across all your services without the operational overhead of running your own observability stack.

Start Monitoring Free →

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time Microservices Monitoring goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for Microservices Monitoring + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial

🛠 Tools We Use & Recommend

Tested across our own infrastructure monitoring 200+ APIs daily

Better StackBest for API Teams

Uptime Monitoring & Incident Management

Used by 100,000+ websites

Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.

We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.

Free tier · Paid from $24/moStart Free Monitoring
1PasswordBest for Credential Security

Secrets Management & Developer Security

Trusted by 150,000+ businesses

Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.

After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.

OpteryBest for Privacy

Automated Personal Data Removal

Removes data from 350+ brokers

Removes your personal data from 350+ data broker sites. Protects against phishing and social engineering attacks.

Service outages sometimes involve data breaches. Optery keeps your personal info off the sites attackers use first.

From $9.99/moFree Privacy Scan
ElevenLabsBest for AI Voice

AI Voice & Audio Generation

Used by 1M+ developers

Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.

The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.

Free tier · Paid from $5/moTry ElevenLabs Free
SEMrushBest for SEO

SEO & Site Performance Monitoring

Used by 10M+ marketers

Track your site health, uptime, search rankings, and competitor movements from one dashboard.

We use SEMrush to track how our API status pages rank and catch site health issues early.

From $129.95/moTry SEMrush Free
View full comparison & more tools →Affiliate links — we earn a commission at no extra cost to you