What is the hardest part of monitoring microservices?

The hardest part of monitoring microservices is tracing a single user request across multiple services. In a monolith, you have one log file and one stack trace. In microservices, a single request may touch 10+ services. Without distributed tracing (e.g., OpenTelemetry + Jaeger or Datadog APM), it's nearly impossible to understand why a specific request failed or was slow.

What is distributed tracing?

Distributed tracing tracks the path of a single request across multiple services by attaching a unique trace ID to every request. Each service records a "span" — the work it did for that request, including start time, duration, and any errors. Tracing tools like Jaeger, Zipkin, or Datadog APM collect these spans and visualize the complete request path as a flame graph, showing exactly where time was spent and where errors occurred.

What metrics should I monitor for microservices?

The four golden signals from Google SRE are the starting point: Latency (how long requests take), Traffic (request rate), Errors (error rate), and Saturation (how close to capacity). Per service, track: p50/p95/p99 latency, request rate, error rate by status code, memory/CPU usage, and any service-specific business metrics. For service-to-service calls, also track downstream error rates and latency.

Should each microservice have its own monitoring?

Each microservice should emit standard metrics and logs, but monitoring infrastructure should be centralized. Having 50 separate Grafana dashboards for 50 services creates maintenance overhead. Instead, use a single observability platform (Datadog, New Relic, Grafana Cloud) that automatically discovers and monitors services, with a consistent dashboard template per service and a unified alert routing system.

What is the difference between monitoring and observability?

Monitoring is checking pre-defined metrics against known thresholds (e.g., alert if CPU > 80%). Observability is the ability to understand the internal state of a system from its external outputs — logs, metrics, and traces — including questions you haven't thought to ask yet. Observability is monitoring plus the ability to explore unknown failure modes. For microservices, both are necessary: monitoring for known failure patterns, observability for debugging novel issues.

Microservices Monitoring: Complete Guide (2026)

Microservices trade simplicity for scalability — and monitoring complexity is the price. When a monolith is slow, you check one service. When a microservices request is slow, it might have touched 12 services, and any one of them could be the culprit. This guide covers how to instrument, monitor, and alert on microservices systems effectively.

Why Microservices Monitoring Is Harder

Monitoring a monolith is straightforward: one process, one log file, one stack trace when something goes wrong. Microservices multiply every monitoring challenge:

Distributed requests: A single API call may fan out across 5-20 services
Cascading failures: One slow service can degrade all services that depend on it
Log fragmentation: Logs for a single request are spread across multiple services
Network as failure surface: Every service-to-service call is a new failure point
Deployment complexity: 50 services may have 50 independent deployment pipelines
Alert volume: 50 services × 10 metrics = 500 potential alert sources

Effective microservices monitoring addresses each of these challenges with the right tools and patterns.

The Three Pillars of Observability

Modern observability is built on three complementary data types:

1. Metrics

Numerical measurements over time. Metrics are aggregated and efficient to store — ideal for dashboards and alerting. They answer "what is happening" questions: "Error rate is 5% right now."

2. Logs

Timestamped event records from each service. Logs contain context that metrics can't capture. They answer "why is this happening" questions: "The payment service is failing because the merchant_id field is null."

3. Traces

Records of requests as they flow through multiple services. Traces answer "where is the time going" questions: "This request took 3 seconds — 2.8 seconds were spent waiting for the inventory service."

All three are necessary. Metrics alert you. Logs explain what happened. Traces show you where.

📡

Recommended

Unified Observability for Microservices

Centralize logs, metrics, and uptime monitoring across all your services. Get alerted on cross-service issues with correlated context.

Try Better Stack Free →

The Four Golden Signals

Google's SRE book introduced the four golden signals as the minimum set of metrics to monitor for any service:

1. Latency

How long it takes to serve requests. Track:

p50 (median): Typical user experience
p95: What 1 in 20 users experiences
p99: What 1 in 100 users experiences (often your best customers or heaviest users)

Always track percentiles, not averages. A spike in p99 latency can indicate a serious problem even when the average looks fine.

2. Traffic

How much demand is on your system. For APIs: requests per second. For message queues: messages processed per second. For batch jobs: jobs per minute.

Traffic context is essential for interpreting other metrics. A 5% error rate at 100 req/s (5 errors/s) is very different from a 5% error rate at 10,000 req/s (500 errors/s).

3. Errors

The rate of requests that fail. For HTTP services, track:

5xx error rate (server errors — your bugs)
4xx error rate (client errors — possibly integration bugs or attacks)
Downstream error rate (errors calling other services)

4. Saturation

How "full" your service is. CPU utilization, memory usage, thread pool usage, connection pool utilization. Saturation metrics often predict failures before they happen — a service at 90% CPU capacity will start failing before it hits 100%.

Golden Signals Alert Thresholds (Starting Point)

p99 latency > 1s: Warning | > 3s: Critical
5xx error rate > 1%: Warning | > 5%: Critical
CPU saturation > 80%: Warning | > 95%: Critical
Memory usage > 80%: Warning | > 95%: Critical

Distributed Tracing

Distributed tracing is the most powerful tool for debugging microservices. Without it, tracing a slow request across 10 services is like debugging code without a stack trace.

How Tracing Works

Every request receives a unique trace ID at the entry point (API gateway or first service)
The trace ID is propagated to all downstream services via HTTP headers (X-Trace-ID or W3C Trace Context standard)
Each service records a span — the work it did, with start time, duration, and any errors
The tracing backend collects all spans and assembles them into a complete trace
Engineers can view a flame graph showing the entire request path and where time was spent

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the vendor-neutral standard for emitting telemetry from applications. It's the right choice for instrumentation because:

Instrument once, send to any backend (Jaeger, Zipkin, Datadog, Honeycomb, etc.)
Auto-instrumentation for popular frameworks (Express, FastAPI, Spring, etc.) — often requires zero code changes
Covers traces, metrics, and logs under a single SDK
CNCF project with broad industry adoption

// Node.js auto-instrumentation with OpenTelemetry // Instrument your app with zero code changes: const { NodeSDK } = require('@opentelemetry/sdk-node'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();

What to Look for in Traces

Long-tail latency: Which service is slow in the p99 traces but not the p50 traces?
N+1 query patterns: Is a service making 100 database calls where 1 would suffice?
Synchronous chains: Services calling services calling services — these are latency multipliers
Error propagation: Where did the error originate in a cascading failure?

📡

Recommended

Monitor Service Health Across Your Stack

Uptime monitoring, log aggregation, and incident management for distributed systems. Get correlated alerts across all your services.

Try Better Stack Free →

Structured Logging for Microservices

Plain text logs are nearly useless in microservices environments. When logs from 20 services are aggregated into one stream, you need machine-parseable structured logs to filter, search, and correlate.

Structured Log Format

Every log entry should be JSON with consistent fields:

{ "timestamp": "2026-04-29T12:00:00.123Z", "level": "error", "service": "payment-service", "version": "1.2.3", "trace_id": "abc123def456", "span_id": "def789", "user_id": "usr_456", "request_id": "req_789", "message": "Payment processing failed", "error": { "code": "card_declined", "message": "Insufficient funds" }, "duration_ms": 234 }

Required Fields in Every Log Entry

timestamp — ISO 8601 format, with milliseconds
level — debug, info, warn, error
service — which service emitted this log
trace_id — so you can find all logs for a specific request
request_id — the unique ID for this HTTP request

Centralized Log Aggregation

Each service writes logs to stdout. A log aggregator (Fluentd, Logstash, Vector) collects logs from all services and forwards them to a central storage system (Elasticsearch, Loki, Datadog Logs).

Once centralized, you can search across all services by trace_id to reconstruct the complete log trail for a single user request.

Service Health Checks

Every microservice needs a health check endpoint. In a microservices environment, this becomes even more critical because:

Load balancers need to know which instances are ready to serve traffic
Kubernetes uses health checks to decide whether to restart or route traffic to pods
Service meshes use health status for circuit breaking and traffic management

For microservices, each service should have:

Liveness endpoint: Is the process alive?
Readiness endpoint: Is the service ready to handle traffic (dependencies available)?
Detailed endpoint (internal only): Full dependency status for debugging

Monitoring Service Dependencies

In microservices, the dependencies between services create a graph where any edge can fail. You need to monitor the health of inter-service communication, not just individual services.

Circuit Breakers

A circuit breaker monitors calls to a downstream service. When the error rate exceeds a threshold, it "opens" — stopping calls to the failing service and returning errors immediately instead of waiting for timeouts. This prevents cascading failures.

Circuit breaker states: - CLOSED: Normal operation. Calls pass through. - OPEN: Downstream is failing. Calls fail fast (no waiting for timeout). - HALF-OPEN: Testing if downstream recovered. A few calls pass through to check.

Monitor circuit breaker state transitions as key events. A circuit breaker opening means a downstream service is failing — this is often a faster signal than waiting for the downstream service's own health check to fail.

Service Dependency Map

Maintain (and automatically generate) a service dependency map. When Service A depends on Services B, C, and D, a failure in Service C affects Service A's reliability. Understanding these dependencies is essential for impact analysis during incidents.

Alerting Strategy for Microservices

The biggest alerting mistake in microservices is alerting on every service individually. With 50 services, you get 50x the alert volume, and most alerts are symptomatic rather than causal.

Alert at the User-Facing Layer First

Start alerts at the outermost layer of your stack — the API gateway or load balancer. If the overall system error rate exceeds 1%, that's your primary alert. The root cause might be any of 50 downstream services, but the symptom is one.

Alert on Symptoms, Not Causes

Alert: User-facing API error rate > 1% (symptom)
Don't alert: Database CPU > 70% (cause — only alert if it actually impacts users)
Alert: Payment processing latency p99 > 3s (symptom users experience)
Don't alert: Cache miss rate increased 10% (internal metric — only alert if latency degrades)

Use Alert Correlation

When a root cause failure triggers alerts in 10 dependent services, grouping these into a single incident with one pager notification is critical. Modern incident management tools (PagerDuty, Opsgenie, Better Stack) support alert correlation and deduplication.

Ownership and Escalation

Every service needs a clear owner. Alerts for Service A should route to Team A. Use on-call rotation schedules so alerts reach someone who can act on them.

📡

Recommended

Intelligent Alerting for Microservices

Correlate alerts across services, route to the right team, and reduce alert fatigue with Better Stack's incident management.

Try Better Stack Free →

Monitoring Tools for Microservices

Full Observability Platforms

Datadog: Comprehensive APM, logs, metrics, and distributed tracing. Best-in-class UI. Expensive at scale.
New Relic: Strong APM and full-stack observability. Per-user pricing can be cost-effective for large teams.
Dynatrace: AI-powered anomaly detection and automated dependency mapping. Strong for complex enterprise architectures.

Open Source Stack

Prometheus + Grafana: De facto standard for metrics collection and visualization. High operational overhead.
Jaeger / Zipkin: Distributed tracing backends. Integrate with OpenTelemetry.
Loki: Log aggregation built to work with Grafana. Efficient storage model.
OpenTelemetry Collector: Vendor-neutral pipeline for collecting and routing telemetry.

Uptime and External Monitoring

Better Stack: Uptime monitoring, incident management, and status pages. Integrates with your alerting stack.
APIStatusCheck: Monitoring for external API dependencies your services rely on.

SLOs for Individual Services

Each microservice should have its own SLO. When a service's SLO degrades, it signals a reliability problem that affects any consumer of that service.

Standard starting SLOs for internal services:

Availability: 99.9% (less strict than customer-facing SLAs, since multiple internal services can be 99.9% and the compound availability is still acceptable)
Latency p99: 200ms for synchronous services; 500ms for complex operations

Track error budget burn rate per service. When a service is burning through its error budget 10x faster than normal, something unusual is happening — investigate before it cascades.

Monitoring During Deployments

Most incidents in microservices environments occur during or after deployments. Automate deployment monitoring:

Pre-deployment: Baseline current error rate and latency
During canary: Compare metrics between new version and old version
Automatic rollback trigger: If error rate increases >2x or latency increases >50%, roll back automatically
Post-deployment validation: Monitor for 15 minutes after full rollout before considering the deployment stable

Where to Start

If you're starting from scratch, this priority order maximizes value with minimum overhead:

Add health check endpoints to every service — 1-2 hours per service, immediate value for load balancers and Kubernetes
Centralize logs with trace IDs — 1 day setup. Makes debugging dramatically faster.
Instrument the four golden signals per service — 1-2 days. Foundation for alerting.
Set up distributed tracing with OpenTelemetry — 1-3 days. Essential for latency debugging.
Configure SLO-based alerts at the API gateway — 1 day. Reduces alert noise, focuses on user impact.
Add circuit breakers for critical dependencies — 1-2 weeks. Prevents cascading failures.

Key Takeaways

Microservices monitoring requires all three observability pillars: metrics, logs, and traces
The four golden signals (latency, traffic, errors, saturation) are the minimum per-service metrics
Distributed tracing with OpenTelemetry is essential for debugging cross-service latency and errors
Use structured JSON logs with trace IDs to correlate logs across services
Alert at the user-facing layer first, not on every internal metric
Every service needs separate liveness and readiness health check endpoints
Circuit breakers prevent cascading failures when downstream services degrade
Define SLOs per service and track error budget burn rates

Microservices monitoring is an investment — but it pays for itself the first time you resolve an incident in 10 minutes instead of 10 hours because you had distributed traces showing exactly which service caused the problem and why.

Monitor Your Microservices with Better Stack

Uptime monitoring, log management, and incident alerting for distributed systems. Get correlated visibility across all your services without the operational overhead of running your own observability stack.

Start Monitoring Free →

Why Microservices Monitoring Is Harder

The Three Pillars of Observability

1. Metrics

2. Logs

3. Traces

The Four Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

Distributed Tracing

How Tracing Works

OpenTelemetry: The Standard

What to Look for in Traces

Structured Logging for Microservices

Structured Log Format

Required Fields in Every Log Entry

Centralized Log Aggregation

Service Health Checks

Monitoring Service Dependencies

Circuit Breakers

Service Dependency Map

Alerting Strategy for Microservices

Alert at the User-Facing Layer First

Alert on Symptoms, Not Causes

Use Alert Correlation

Ownership and Escalation

Monitoring Tools for Microservices

Full Observability Platforms

Open Source Stack

Uptime and External Monitoring

SLOs for Individual Services

Monitoring During Deployments

Where to Start

Key Takeaways

Monitor Your Microservices with Better Stack

Stop checking — get alerted instantly