Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Affiliate link — we may earn a commission at no extra cost to you

Blog/Distributed Tracing Guide

Distributed Tracing: Complete Guide for Modern Applications (2026)

When a request spans five microservices and takes 800ms, distributed tracing tells you exactly which service is responsible for 600ms of that latency. Without it, you're guessing. This guide covers everything from core concepts to production implementation.

The Three Pillars of Observability

Logs

What happened

Metrics

How healthy is it

Traces

How did it flow

What Is Distributed Tracing?

Distributed tracing tracks a request's journey through multiple services. In a monolith, a single log file (or stack trace) shows you what happened. In a microservices architecture with 10-100 services, there is no single log — requests fork across services, hit databases, and call third-party APIs. Tracing stitches these together into a single coherent story.

The concept originated at Google, described in their 2010 Dapper paper. Jaeger (Uber), Zipkin (Twitter), and eventually the OpenTelemetry project all trace their lineage back to that paper.

Core Concepts

Trace

The complete record of a single request from entry to exit. A trace has a unique trace ID and contains all the spans generated by that request across all services.

Span

A single unit of work within a trace. Each service call, database query, or external API call creates a span. Spans have: name, trace ID, span ID, parent span ID, start/end timestamps, and optional attributes.

Trace ID

A globally unique 128-bit identifier generated when a request first enters the system. It is propagated to every downstream service via HTTP headers (W3C traceparent or B3 format).

Context Propagation

The mechanism for passing trace context between services. The calling service injects trace headers into outgoing requests; the receiving service extracts them and creates child spans linked to the same trace.

Sampling

Tracing every request at scale is expensive. Sampling reduces data volume by only recording a percentage of traces (e.g., 1% head-based) or all traces that match certain criteria (e.g., errors, high latency — tail-based sampling).

How Distributed Tracing Works

Let's walk through what happens when a user makes a purchase on an e-commerce site with microservices:

API Gateway: Receives POST /checkout — generates trace ID (e.g., abc123) and root span. Injects traceparent: 00-abc123-span001-01 into header before forwarding.

Order Service: Extracts trace ID from header. Creates child span (span002, parent: span001). Calls Inventory Service and Payment Service in parallel.

Inventory Service: Creates child span (span003, parent: span002). Queries PostgreSQL — creates nested span (span004) for the DB call. Completes in 45ms.

Payment Service: Creates child span (span005, parent: span002). Calls Stripe API — creates nested span (span006). Stripe takes 380ms — this is your latency culprit.

Trace Backend: All spans are sent to Jaeger/Tempo. You can now see the full waterfall: total 430ms, 380ms is Stripe, 45ms is DB. The bottleneck is obvious.

Trace Context Headers (W3C traceparent)

traceparent: 00-{traceId}-{spanId}-{flags}

Example: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

📡

Recommended

Tracing shows you where latency hides — monitoring tells you when it hurts

Better Stack monitors your endpoints every 30 seconds and alerts you the moment performance degrades. Essential alongside your distributed tracing setup.

Try Better Stack Free →

Distributed Tracing vs Logging vs Metrics

Dimension	Logs	Metrics	Traces
Question answered	What happened in this service?	Is the system healthy?	How did this request flow?
Granularity	Per-event, per-service	Aggregated over time	Per-request, cross-service
Best for	Debugging specific errors	Alerting on thresholds	Latency root cause analysis
Cardinality	High (unique per event)	Low (pre-aggregated)	High (unique per request)
Cost at scale	High (storage)	Low	Medium (sampling helps)

The three pillars are complementary. Production observability needs all three. Add trace IDs to your structured logs — then when a trace shows a slow span, you can click through to the corresponding log lines.

Top Distributed Tracing Tools (2026)

OpenTelemetry

Instrumentation standard

Open Source

The CNCF standard for generating traces, metrics, and logs. Vendor-neutral SDKs for 11+ languages. The de facto standard for new instrumentation.

Best for: All new projects — use as the instrumentation layer, then export to any backend

Jaeger

Trace backend (open source)

Open Source

CNCF graduated project from Uber. Stores and visualizes traces, shows service dependency graphs, supports Cassandra/Elasticsearch/Badger storage.

Best for: Self-hosted distributed tracing with rich visualization

Zipkin

Trace backend (open source)

Open Source

Original open-source distributed tracing system from Twitter. Simpler than Jaeger, ideal for smaller deployments. Has broad ecosystem support.

Best for: Small-to-medium deployments, teams already invested in Zipkin

Grafana Tempo

Trace backend (open source)

Open Source

High-scale, cost-effective trace storage from Grafana Labs. Pairs with Prometheus and Loki for the full Grafana observability stack. Object storage backend (S3-compatible).

Best for: Teams already using Grafana, cost-sensitive large-scale tracing

Honeycomb

Managed observability platform

Managed

Column-oriented event storage with powerful query UI for exploring traces. Built for high-cardinality data. First-mover in observability-driven development.

Best for: Engineering teams that need fast, interactive trace exploration

AWS X-Ray

Managed (AWS-native)

Managed

AWS-native tracing that integrates with Lambda, ECS, API Gateway, and other AWS services. Automated instrumentation for AWS SDK calls.

Best for: AWS-centric architectures that want zero-config tracing for AWS services

Implementing Distributed Tracing with OpenTelemetry

OpenTelemetry is the recommended instrumentation approach in 2026. Instrument once, export anywhere. Here's the pattern for a Node.js service:

Step 1: Install the SDK

# Install OpenTelemetry packages

npm install @opentelemetry/sdk-node

npm install @opentelemetry/auto-instrumentations-node

npm install @opentelemetry/exporter-trace-otlp-http

Step 2: Initialize tracing (tracing.js — load before your app)

const

NodeSDK = require('@opentelemetry/sdk-node'),

{ getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'),

{ OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const

sdk = new NodeSDK({

serviceName: 'payment-service',

traceExporter: new OTLPTraceExporter({

url: 'http://jaeger:4318/v1/traces',

}),

instrumentations: [getNodeAutoInstrumentations()],

});

sdk.start();

Step 3: Add custom spans for business logic

const

{ trace } = require('@opentelemetry/api');

async function processPayment(orderId, amount) {

const tracer = trace.getTracer('payment-service');

const span = tracer.startSpan('processPayment');

span.setAttributes({ 'order.id': orderId, 'payment.amount': amount });

try {

const result = await stripeChargeCard(amount);

span.setStatus({ code: SpanStatusCode.OK });

return result;

} catch (err) {

span.recordException(err);

span.setStatus({ code: SpanStatusCode.ERROR });

throw err;

} finally {

span.end();

}

Auto-instrumentation covers the basics

OpenTelemetry's auto-instrumentation automatically traces HTTP requests, database calls (pg, mysql2, redis), gRPC, and popular frameworks (Express, Fastify, Koa, NestJS) without any code changes. Add custom spans only for business-level operations that matter for debugging.

Sampling Strategies

Tracing 100% of requests is expensive. At 10,000 req/s, 100% sampling generates terabytes of trace data daily. You have three options:

Head-based sampling

Decide at trace start whether to record it. Simple (e.g., "sample 1%") but may miss rare slow or error traces. Good for high-volume, low-critical services.

✓ Low overhead, simple to configure

✗ May miss important low-frequency events (errors, slow tails)

Tail-based sampling

Buffer all spans, then decide to keep a trace after it completes. Can retain 100% of error traces and high-latency outliers while dropping normal traffic. Requires a sampling proxy (OTel Collector, Honeycomb Refinery).

✓ Captures all errors and slow traces regardless of sample rate

✗ Higher infrastructure complexity, needs a buffering layer

Probabilistic + rule-based

Combine sampling rates with rules: always trace errors and anything over 500ms; sample 0.1% of successful fast requests. Best practice for most production systems.

✓ Balanced coverage — catches what matters, controls costs

✗ More configuration work upfront

Distributed Tracing Best Practices

Use W3C traceparent headers

W3C traceparent is the standard as of 2021. Avoid proprietary formats (B3, X-B3-*) for new services. All major OpenTelemetry SDKs support it by default.

Inject trace IDs into structured logs

Add trace_id and span_id to every log line. This lets you pivot from a trace waterfall to the raw log lines in one click. Most logging libraries (pino, winston, structlog) support this with a middleware.

Name spans with operation semantics, not code structure

Use "charge-payment" not "PaymentService.chargeCard". Use "get-user-profile" not "UserRepository.findById". Operational names make traces readable without code context.

Add business-relevant attributes to spans

Tag spans with user.id, order.id, tenant.id, product.sku. These enable filtering traces by business entity — far more useful than filtering by hostname.

Set error status codes explicitly

When a span catches an exception, call span.recordException(err) and span.setStatus(ERROR). Trace UIs surface error spans differently — you'll find them in seconds instead of grep-ing logs.

Start with auto-instrumentation, add custom spans incrementally

Auto-instrumentation handles HTTP, DB, and framework spans. Add manual spans for business operations over time as you learn where debugging gaps are. Don't over-instrument upfront.

📡

Recommended

Know when a downstream service degrades — before your traces show it

Better Stack provides uptime monitoring for all your services and dependencies. Get alerted in under 60 seconds when something degrades.

Try Better Stack Free →

Frequently Asked Questions

What is distributed tracing?

Distributed tracing is an observability technique that tracks requests as they flow through multiple services in a distributed system. Each request is assigned a unique trace ID, and each service adds a span (a timed operation) to record what it did. You can then visualize the full request path, see where latency occurs, and identify failures across service boundaries.

What is the difference between distributed tracing and logging?

Logging records discrete events within a single service (e.g., "user authenticated", "database query took 50ms"). Distributed tracing correlates those events across multiple services for a single request, using a shared trace ID. Logs answer "what happened?"; tracing answers "how did this request flow through the system and where did it slow down?". The two are complementary — use structured logs with trace IDs injected so you can pivot from trace to logs.

What is OpenTelemetry and how does it relate to distributed tracing?

OpenTelemetry (OTel) is the CNCF standard for instrumenting applications to collect traces, metrics, and logs. It provides vendor-neutral SDKs for 11+ languages that generate trace data in a standardized format. You instrument your code once with OTel, then export to any compatible backend (Jaeger, Zipkin, Tempo, Honeycomb, Datadog, etc.). OpenTelemetry has effectively replaced vendor-specific instrumentation SDKs as the industry default.

Jaeger vs Zipkin — which should I use?

Jaeger is generally the better choice for new projects. It has better scalability (Cassandra/Elasticsearch backends), a richer UI with service dependency graphs, native OpenTelemetry support, and is a CNCF graduated project backed by Uber. Zipkin is older, simpler, and may be appropriate for smaller deployments already using it. Both accept OpenTelemetry data. If starting fresh, use Jaeger or a managed observability platform like Grafana Tempo.

What is a trace ID and span in distributed tracing?

A trace ID is a unique identifier assigned to a request when it enters the system. It is propagated via HTTP headers (e.g., W3C traceparent) to every downstream service. A span represents a single unit of work within a trace — it has a name, start time, duration, and optional attributes/events. Spans form a parent-child hierarchy: the root span is the initial request, and each downstream call creates a child span. Together, they form a trace — a complete picture of the request lifecycle.

Related Guides

OpenTelemetry Guide

Complete OTel setup — traces, metrics, logs

Kubernetes Monitoring

Monitor k8s clusters and pods

SLA vs SLO vs SLI

Reliability targets explained

MTTR/MTTD/MTBF Guide

Incident recovery metrics

On-Call Management Guide

Incident response best practices

Best APM Tools 2026

Top application performance monitoring tools

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your services goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your services + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys