📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you
Distributed Tracing: Complete Guide for Modern Applications (2026)
When a request spans five microservices and takes 800ms, distributed tracing tells you exactly which service is responsible for 600ms of that latency. Without it, you're guessing. This guide covers everything from core concepts to production implementation.
The Three Pillars of Observability
Logs
What happened
Metrics
How healthy is it
Traces
How did it flow
What Is Distributed Tracing?
Distributed tracing tracks a request's journey through multiple services. In a monolith, a single log file (or stack trace) shows you what happened. In a microservices architecture with 10-100 services, there is no single log — requests fork across services, hit databases, and call third-party APIs. Tracing stitches these together into a single coherent story.
The concept originated at Google, described in their 2010 Dapper paper. Jaeger (Uber), Zipkin (Twitter), and eventually the OpenTelemetry project all trace their lineage back to that paper.
Core Concepts
Trace
The complete record of a single request from entry to exit. A trace has a unique trace ID and contains all the spans generated by that request across all services.
Span
A single unit of work within a trace. Each service call, database query, or external API call creates a span. Spans have: name, trace ID, span ID, parent span ID, start/end timestamps, and optional attributes.
Trace ID
A globally unique 128-bit identifier generated when a request first enters the system. It is propagated to every downstream service via HTTP headers (W3C traceparent or B3 format).
Context Propagation
The mechanism for passing trace context between services. The calling service injects trace headers into outgoing requests; the receiving service extracts them and creates child spans linked to the same trace.
Sampling
Tracing every request at scale is expensive. Sampling reduces data volume by only recording a percentage of traces (e.g., 1% head-based) or all traces that match certain criteria (e.g., errors, high latency — tail-based sampling).
How Distributed Tracing Works
Let's walk through what happens when a user makes a purchase on an e-commerce site with microservices:
Trace Context Headers (W3C traceparent)
traceparent: 00-{traceId}-{spanId}-{flags}Example: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
Tracing shows you where latency hides — monitoring tells you when it hurts
Better Stack monitors your endpoints every 30 seconds and alerts you the moment performance degrades. Essential alongside your distributed tracing setup.
Try Better Stack Free →Distributed Tracing vs Logging vs Metrics
| Dimension | Logs | Metrics | Traces |
|---|---|---|---|
| Question answered | What happened in this service? | Is the system healthy? | How did this request flow? |
| Granularity | Per-event, per-service | Aggregated over time | Per-request, cross-service |
| Best for | Debugging specific errors | Alerting on thresholds | Latency root cause analysis |
| Cardinality | High (unique per event) | Low (pre-aggregated) | High (unique per request) |
| Cost at scale | High (storage) | Low | Medium (sampling helps) |
The three pillars are complementary. Production observability needs all three. Add trace IDs to your structured logs — then when a trace shows a slow span, you can click through to the corresponding log lines.
Top Distributed Tracing Tools (2026)
OpenTelemetry
Instrumentation standardThe CNCF standard for generating traces, metrics, and logs. Vendor-neutral SDKs for 11+ languages. The de facto standard for new instrumentation.
Best for: All new projects — use as the instrumentation layer, then export to any backend
Jaeger
Trace backend (open source)CNCF graduated project from Uber. Stores and visualizes traces, shows service dependency graphs, supports Cassandra/Elasticsearch/Badger storage.
Best for: Self-hosted distributed tracing with rich visualization
Zipkin
Trace backend (open source)Original open-source distributed tracing system from Twitter. Simpler than Jaeger, ideal for smaller deployments. Has broad ecosystem support.
Best for: Small-to-medium deployments, teams already invested in Zipkin
Grafana Tempo
Trace backend (open source)High-scale, cost-effective trace storage from Grafana Labs. Pairs with Prometheus and Loki for the full Grafana observability stack. Object storage backend (S3-compatible).
Best for: Teams already using Grafana, cost-sensitive large-scale tracing
Honeycomb
Managed observability platformColumn-oriented event storage with powerful query UI for exploring traces. Built for high-cardinality data. First-mover in observability-driven development.
Best for: Engineering teams that need fast, interactive trace exploration
AWS X-Ray
Managed (AWS-native)AWS-native tracing that integrates with Lambda, ECS, API Gateway, and other AWS services. Automated instrumentation for AWS SDK calls.
Best for: AWS-centric architectures that want zero-config tracing for AWS services
Implementing Distributed Tracing with OpenTelemetry
OpenTelemetry is the recommended instrumentation approach in 2026. Instrument once, export anywhere. Here's the pattern for a Node.js service:
Step 1: Install the SDK
# Install OpenTelemetry packages
npm install @opentelemetry/sdk-node
npm install @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-trace-otlp-http
Step 2: Initialize tracing (tracing.js — load before your app)
const
NodeSDK = require('@opentelemetry/sdk-node'),
{ getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'),
{ OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const
sdk = new NodeSDK({
serviceName: 'payment-service',
traceExporter: new OTLPTraceExporter({
url: 'http://jaeger:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Step 3: Add custom spans for business logic
const
{ trace } = require('@opentelemetry/api');
async function processPayment(orderId, amount) {
const tracer = trace.getTracer('payment-service');
const span = tracer.startSpan('processPayment');
span.setAttributes({ 'order.id': orderId, 'payment.amount': amount });
try {
const result = await stripeChargeCard(amount);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
}
Auto-instrumentation covers the basics
OpenTelemetry's auto-instrumentation automatically traces HTTP requests, database calls (pg, mysql2, redis), gRPC, and popular frameworks (Express, Fastify, Koa, NestJS) without any code changes. Add custom spans only for business-level operations that matter for debugging.
Sampling Strategies
Tracing 100% of requests is expensive. At 10,000 req/s, 100% sampling generates terabytes of trace data daily. You have three options:
Head-based sampling
Decide at trace start whether to record it. Simple (e.g., "sample 1%") but may miss rare slow or error traces. Good for high-volume, low-critical services.
Tail-based sampling
Buffer all spans, then decide to keep a trace after it completes. Can retain 100% of error traces and high-latency outliers while dropping normal traffic. Requires a sampling proxy (OTel Collector, Honeycomb Refinery).
Probabilistic + rule-based
Combine sampling rates with rules: always trace errors and anything over 500ms; sample 0.1% of successful fast requests. Best practice for most production systems.
Distributed Tracing Best Practices
Use W3C traceparent headers
W3C traceparent is the standard as of 2021. Avoid proprietary formats (B3, X-B3-*) for new services. All major OpenTelemetry SDKs support it by default.
Inject trace IDs into structured logs
Add trace_id and span_id to every log line. This lets you pivot from a trace waterfall to the raw log lines in one click. Most logging libraries (pino, winston, structlog) support this with a middleware.
Name spans with operation semantics, not code structure
Use "charge-payment" not "PaymentService.chargeCard". Use "get-user-profile" not "UserRepository.findById". Operational names make traces readable without code context.
Add business-relevant attributes to spans
Tag spans with user.id, order.id, tenant.id, product.sku. These enable filtering traces by business entity — far more useful than filtering by hostname.
Set error status codes explicitly
When a span catches an exception, call span.recordException(err) and span.setStatus(ERROR). Trace UIs surface error spans differently — you'll find them in seconds instead of grep-ing logs.
Start with auto-instrumentation, add custom spans incrementally
Auto-instrumentation handles HTTP, DB, and framework spans. Add manual spans for business operations over time as you learn where debugging gaps are. Don't over-instrument upfront.
Know when a downstream service degrades — before your traces show it
Better Stack provides uptime monitoring for all your services and dependencies. Get alerted in under 60 seconds when something degrades.
Try Better Stack Free →Frequently Asked Questions
What is distributed tracing?
Distributed tracing is an observability technique that tracks requests as they flow through multiple services in a distributed system. Each request is assigned a unique trace ID, and each service adds a span (a timed operation) to record what it did. You can then visualize the full request path, see where latency occurs, and identify failures across service boundaries.
What is the difference between distributed tracing and logging?
Logging records discrete events within a single service (e.g., "user authenticated", "database query took 50ms"). Distributed tracing correlates those events across multiple services for a single request, using a shared trace ID. Logs answer "what happened?"; tracing answers "how did this request flow through the system and where did it slow down?". The two are complementary — use structured logs with trace IDs injected so you can pivot from trace to logs.
What is OpenTelemetry and how does it relate to distributed tracing?
OpenTelemetry (OTel) is the CNCF standard for instrumenting applications to collect traces, metrics, and logs. It provides vendor-neutral SDKs for 11+ languages that generate trace data in a standardized format. You instrument your code once with OTel, then export to any compatible backend (Jaeger, Zipkin, Tempo, Honeycomb, Datadog, etc.). OpenTelemetry has effectively replaced vendor-specific instrumentation SDKs as the industry default.
Jaeger vs Zipkin — which should I use?
Jaeger is generally the better choice for new projects. It has better scalability (Cassandra/Elasticsearch backends), a richer UI with service dependency graphs, native OpenTelemetry support, and is a CNCF graduated project backed by Uber. Zipkin is older, simpler, and may be appropriate for smaller deployments already using it. Both accept OpenTelemetry data. If starting fresh, use Jaeger or a managed observability platform like Grafana Tempo.
What is a trace ID and span in distributed tracing?
A trace ID is a unique identifier assigned to a request when it enters the system. It is propagated via HTTP headers (e.g., W3C traceparent) to every downstream service. A span represents a single unit of work within a trace — it has a name, start time, duration, and optional attributes/events. Spans form a parent-child hierarchy: the root span is the initial request, and each downstream call creates a child span. Together, they form a trace — a complete picture of the request lifecycle.
Related Guides
OpenTelemetry Guide
Complete OTel setup — traces, metrics, logs
Kubernetes Monitoring
Monitor k8s clusters and pods
SLA vs SLO vs SLI
Reliability targets explained
MTTR/MTTD/MTBF Guide
Incident recovery metrics
On-Call Management Guide
Incident response best practices
Best APM Tools 2026
Top application performance monitoring tools
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your services goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your services + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial