API Observability & Distributed Tracing: Complete Implementation Guide

by API Status Check Team

API Observability & Distributed Tracing: Complete Implementation Guide

Modern APIs rarely operate in isolation. A single user request might flow through dozens of microservices, external APIs, databases, and message queues. When something breaks, traditional monitoring tools that only track "is it up or down?" aren't enough.

You need observability — the ability to understand your system's internal state based on external outputs. This guide covers implementing production-ready API observability with distributed tracing, metrics, and structured logging.

Table of Contents

  1. Observability vs Monitoring
  2. The Three Pillars of Observability
  3. Why Distributed Tracing Matters
  4. OpenTelemetry: The Standard
  5. Implementing Distributed Tracing
  6. Metrics Collection
  7. Structured Logging
  8. Popular Observability Tools
  9. Real-World Examples
  10. Best Practices
  11. Common Mistakes
  12. Production Checklist

Observability vs Monitoring {#observability-vs-monitoring}

Monitoring tells you when something is wrong:

  • "API response time is 2000ms (should be <500ms)"
  • "Error rate is 5% (should be <1%)"
  • "CPU usage is 95%"

Observability tells you why something is wrong:

  • "The slow requests are all waiting on the Stripe API payment processing endpoint"
  • "Errors spike when database connection pool exhausts (max 20 connections)"
  • "High CPU from regex parsing unvalidated user input in search queries"

Key difference: Monitoring requires pre-defining what to watch. Observability lets you ask any question about your system's behavior, even ones you didn't anticipate.

When You Need Observability

  • Microservices architecture — requests span 5+ services
  • External API dependenciesAWS, Stripe, Twilio, SendGrid
  • Asynchronous processing — background jobs, message queues, webhooks
  • Production debugging — "Why is this one user experiencing errors?"
  • Performance optimization — finding the slowest part of a request chain

The Three Pillars of Observability {#three-pillars}

Modern observability is built on three data types that, when correlated, give you complete system visibility:

1. Logs

What: Timestamped text records of discrete events.

{
  "timestamp": "2026-03-11T14:05:23.142Z",
  "level": "error",
  "message": "Payment processing failed",
  "userId": "usr_abc123",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "error": {
    "code": "card_declined",
    "message": "Insufficient funds"
  }
}

Best for: Understanding specific events, debugging individual requests.

2. Metrics

What: Numeric measurements aggregated over time.

api_request_duration_seconds{endpoint="/api/checkout", status="200"} 0.342
api_request_total{endpoint="/api/checkout", status="500"} 47
database_connections_active 18

Best for: Dashboards, alerting on system-wide trends, capacity planning.

3. Traces

What: End-to-end journey of a request across all services.

Request: POST /api/checkout (842ms total)
  ├─ auth-service: verify JWT (12ms)
  ├─ inventory-service: check stock (45ms)
  │   └─ PostgreSQL: SELECT products (38ms)
  ├─ payment-service: process charge (723ms) ⚠️ SLOW
  │   ├─ Stripe API: create payment intent (687ms)
  │   └─ Redis: cache card info (4ms)
  └─ notification-service: send confirmation (62ms)
      └─ SendGrid API: send email (58ms)

Best for: Finding bottlenecks, understanding request flows, debugging distributed systems.

Why All Three?

  • Logs show what happened
  • Metrics show how much/how often
  • Traces show where it happened

The magic happens when you correlate them: Click on a spike in your error rate metric → see all related logs filtered by the trace ID → visualize the exact request path that failed.

Why Distributed Tracing Matters {#why-distributed-tracing}

The Problem: Debugging Distributed Systems

Without tracing, debugging a failed checkout request looks like this:

  1. Check API gateway logs: "Request received, forwarded to payment-service"
  2. Check payment-service logs: "Called Stripe API, got 500 error"
  3. Check Stripe status page: "All systems operational"
  4. Check Stripe API logs (if you have access): "Request succeeded, webhook sent"
  5. Check webhook-handler logs: "No webhook received"
  6. Check message queue: "Webhook delivery failed due to DNS timeout"
  7. Finally find root cause: DNS resolver configuration error

That took 6 tools and 30 minutes.

With Distributed Tracing

The same debugging session:

  1. Open trace for failed request ID
  2. See complete request flow with timing:
    • API Gateway → Payment Service (2ms)
    • Payment Service → Stripe API (120ms) ✅
    • Stripe → Webhook Queue (timeout after 5000ms) ❌
  3. Click on failed span → see error: "DNS resolution timeout"
  4. Root cause found in 30 seconds.

Real-World Impact

Shopify reduced incident resolution time by 75% after implementing distributed tracing.

Uber uses tracing to debug issues across 2,200+ microservices handling millions of requests per second.

Netflix attributes faster incident response and reduced MTTR (Mean Time To Recovery) to comprehensive tracing.

OpenTelemetry: The Standard {#opentelemetry}

OpenTelemetry is the industry-standard observability framework (merged from OpenTracing + OpenCensus). It's supported by every major observability vendor: Datadog, New Relic, Honeycomb, Grafana, AWS X-Ray, Google Cloud Trace.

Why OpenTelemetry?

Vendor-neutral — switch from Datadog to New Relic without changing instrumentation code
Auto-instrumentation — trace HTTP requests, database queries, external APIs automatically
Wide language support — JavaScript, Python, Go, Java, .NET, Ruby, PHP
Active development — CNCF project with contributions from Google, Microsoft, AWS, Datadog

Core Concepts

Span

A single operation with start time, end time, and metadata:

{
  "name": "POST /api/checkout",
  "spanId": "5ce929d0e0e4736",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "parentSpanId": null,
  "startTime": "2026-03-11T14:05:23.000Z",
  "endTime": "2026-03-11T14:05:23.842Z",
  "attributes": {
    "http.method": "POST",
    "http.url": "/api/checkout",
    "http.status_code": 200,
    "user.id": "usr_abc123"
  }
}

Trace

Collection of spans representing one complete request:

Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
  ├─ Span: API Gateway (842ms)
  │   ├─ Span: Auth Service (12ms)
  │   ├─ Span: Payment Service (723ms)
  │   │   └─ Span: Stripe API call (687ms)
  │   └─ Span: Notification Service (62ms)

Context Propagation

Passing trace context between services (usually via HTTP headers):

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-5ce929d0e0e4736-01

Implementing Distributed Tracing {#implementing-tracing}

Step 1: Install OpenTelemetry SDK

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http

Step 2: Initialize Tracing (tracing.ts)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.2.3',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces', // Jaeger/OTLP collector
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/health', '/metrics'], // Don't trace health checks
      },
      '@opentelemetry/instrumentation-express': {},
      '@opentelemetry/instrumentation-pg': {}, // PostgreSQL
      '@opentelemetry/instrumentation-redis': {},
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.error('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

Step 3: Import Before App Code

// index.ts
import './tracing'; // MUST be first import
import express from 'express';
import { someRoute } from './routes';

const app = express();
app.use('/api', someRoute);
app.listen(3000);

That's it! Auto-instrumentation now traces:

  • ✅ All HTTP requests/responses
  • ✅ Database queries (PostgreSQL, MongoDB, MySQL)
  • ✅ Redis operations
  • ✅ External HTTP calls (to Stripe, AWS, etc.)

Step 4: Add Custom Spans (Optional)

For business logic that auto-instrumentation doesn't cover:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

export async function processPayment(userId: string, amount: number) {
  const span = tracer.startSpan('processPayment', {
    attributes: {
      'user.id': userId,
      'payment.amount': amount,
      'payment.currency': 'USD',
    },
  });

  try {
    // Call Stripe API
    const result = await stripe.charges.create({
      amount: amount * 100,
      currency: 'usd',
      customer: userId,
    });

    span.setAttributes({
      'payment.id': result.id,
      'payment.status': result.status,
    });

    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.recordException(error as Error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Step 5: Propagate Context Across Services

When service A calls service B, pass trace context in HTTP headers:

// Service A (caller)
import axios from 'axios';
import { context, propagation } from '@opentelemetry/api';

const headers = {};
propagation.inject(context.active(), headers); // Adds traceparent header

await axios.post('http://payment-service/api/charge', { amount: 100 }, { headers });

Service B's auto-instrumentation extracts the trace context automatically — no manual code needed.

Metrics Collection {#metrics-collection}

Metrics complement traces by showing trends over time.

Key API Metrics to Track

Request Metrics:

  • api_requests_total (counter) — total requests by endpoint, status code
  • api_request_duration_seconds (histogram) — request latency distribution
  • api_request_size_bytes (histogram) — request payload sizes
  • api_response_size_bytes (histogram) — response payload sizes

Error Metrics:

  • api_errors_total (counter) — errors by type (validation, auth, server, external)
  • api_error_rate (gauge) — percentage of requests failing

External API Metrics:

  • external_api_calls_total (counter) — calls to Stripe, AWS, etc.
  • external_api_duration_seconds (histogram) — latency of external calls
  • external_api_errors_total (counter) — errors from external APIs

System Metrics:

  • nodejs_heap_used_bytes (gauge) — memory usage
  • nodejs_eventloop_lag_seconds (gauge) — event loop lag
  • database_connections_active (gauge) — active DB connections

Implementing Metrics with Prometheus

npm install prom-client
// metrics.ts
import { Registry, Counter, Histogram } from 'prom-client';
import { Request, Response, NextFunction } from 'express';

export const register = new Registry();

// Request counter
export const httpRequestsTotal = new Counter({
  name: 'api_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'endpoint', 'status'],
  registers: [register],
});

// Request duration histogram
export const httpRequestDuration = new Histogram({
  name: 'api_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'endpoint', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5], // Response time buckets
  registers: [register],
});

// External API call duration
export const externalApiDuration = new Histogram({
  name: 'external_api_duration_seconds',
  help: 'Duration of external API calls',
  labelNames: ['service', 'endpoint', 'status'],
  buckets: [0.1, 0.5, 1, 2, 5, 10],
  registers: [register],
});

// Metrics middleware
export function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const endpoint = req.route?.path || req.path;

    httpRequestsTotal.inc({
      method: req.method,
      endpoint,
      status: res.statusCode,
    });

    httpRequestDuration.observe(
      {
        method: req.method,
        endpoint,
        status: res.statusCode,
      },
      duration
    );
  });

  next();
}

Expose Metrics Endpoint

// app.ts
import express from 'express';
import { register, metricsMiddleware } from './metrics';

const app = express();

app.use(metricsMiddleware);

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

Track External API Calls

import axios from 'axios';
import { externalApiDuration } from './metrics';

export async function chargeCard(amount: number) {
  const end = externalApiDuration.startTimer();

  try {
    const response = await axios.post('https://api.stripe.com/v1/charges', {
      amount: amount * 100,
      currency: 'usd',
    });

    end({ service: 'stripe', endpoint: '/v1/charges', status: response.status });
    return response.data;
  } catch (error) {
    end({ service: 'stripe', endpoint: '/v1/charges', status: error.response?.status || 0 });
    throw error;
  }
}

Structured Logging {#structured-logging}

Logs become 10x more useful when they're structured (JSON) instead of plain text, and include correlation IDs linking them to traces.

Implementing Structured Logging with Winston

npm install winston
// logger.ts
import winston from 'winston';
import { trace, context } from '@opentelemetry/api';

// Custom format that adds trace context
const traceFormat = winston.format((info) => {
  const span = trace.getSpan(context.active());
  if (span) {
    const spanContext = span.spanContext();
    info.traceId = spanContext.traceId;
    info.spanId = spanContext.spanId;
  }
  return info;
});

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    traceFormat(), // Add trace IDs
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'payment-service',
    environment: process.env.NODE_ENV,
  },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
  ],
});

Using the Logger

import { logger } from './logger';

export async function processPayment(userId: string, amount: number) {
  logger.info('Processing payment', {
    userId,
    amount,
    currency: 'USD',
  });

  try {
    const result = await stripe.charges.create({ amount: amount * 100 });

    logger.info('Payment successful', {
      userId,
      paymentId: result.id,
      status: result.status,
    });

    return result;
  } catch (error) {
    logger.error('Payment failed', {
      userId,
      amount,
      error: {
        message: error.message,
        code: error.code,
        type: error.type,
      },
    });
    throw error;
  }
}

Example Structured Log Output

{
  "timestamp": "2026-03-11T14:05:23.142Z",
  "level": "error",
  "message": "Payment failed",
  "service": "payment-service",
  "environment": "production",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "5ce929d0e0e4736",
  "userId": "usr_abc123",
  "amount": 100,
  "error": {
    "message": "Your card was declined",
    "code": "card_declined",
    "type": "StripeCardError"
  }
}

Power of correlation: Copy the traceId → paste into your tracing tool (Jaeger, Datadog) → see the complete request flow that generated this error.

Popular Observability Tools {#observability-tools}

Open Source

Jaeger (distributed tracing)

  • Created by Uber, now CNCF project
  • Great for self-hosted tracing
  • Supports OpenTelemetry OTLP format
  • Free, but requires infrastructure setup

Grafana + Loki + Tempo (metrics + logs + traces)

  • Complete observability stack
  • Grafana for dashboards, Loki for logs, Tempo for traces
  • Works with Prometheus for metrics
  • Popular for Kubernetes environments

Zipkin (distributed tracing)

  • Alternative to Jaeger
  • Simpler UI, less features
  • Good for smaller teams

Commercial SaaS

Datadog APM

  • All-in-one: metrics, logs, traces, RUM
  • Auto-instrumentation for 15+ languages
  • Great correlation between data sources
  • Pricing: ~$31/host/month + $1.70/million spans

New Relic

  • Full-stack observability platform
  • Strong APM and distributed tracing
  • Good error tracking
  • Pricing: $99/user/month (full platform)

Honeycomb

  • Query-based exploration (not dashboards)
  • Excellent for debugging complex issues
  • Powerful "slice and dice" UI
  • Pricing: $0.03/GB ingested

Elastic APM (part of Elastic Stack)

  • Integrates with Elasticsearch/Kibana
  • Good if you already use ELK stack
  • Strong log correlation
  • Pricing: $95/month for 50GB

Sentry

  • Primarily error tracking with tracing
  • Great developer experience
  • Performance monitoring add-on
  • Pricing: $26/month for 50K errors

AWS X-Ray

  • Native AWS integration
  • Trace requests across Lambda, ECS, EC2
  • Works with API Gateway, DynamoDB
  • Pricing: $5 per million traces

Decision Guide

Choose Jaeger if:

  • You want self-hosted (no SaaS costs)
  • You have Kubernetes infrastructure
  • You need basic distributed tracing only

Choose Datadog if:

  • You want best-in-class correlation (logs ↔ traces ↔ metrics)
  • You monitor infrastructure + applications
  • Budget allows $300-1000+/month

Choose Honeycomb if:

  • You need to debug complex, high-cardinality issues
  • You want query-based exploration vs pre-built dashboards
  • You have unpredictable traffic (pay-per-GB works better)

Choose New Relic if:

  • You want full observability platform
  • You're already invested in New Relic ecosystem

Choose AWS X-Ray if:

  • You run primarily on AWS
  • You want zero-config Lambda tracing

Real-World Examples {#real-world-examples}

Example 1: Finding a Performance Bottleneck

Symptom: API response times increased from 200ms → 1.2s after deploying inventory service v2.4.

Without observability:

  • Check server CPU/memory (normal)
  • Review code changes (200+ line diff, nothing obvious)
  • Add console.log statements, redeploy
  • Wait for logs... still unclear

With distributed tracing:

  1. Open trace for slow request
  2. See breakdown:
    GET /api/products (1,200ms)
      ├─ Auth middleware (8ms)
      ├─ Inventory service (1,150ms) ⚠️
      │   ├─ PostgreSQL: SELECT products (15ms)
      │   └─ Loop: fetch images (1,135ms) ⚠️
      │       ├─ S3 getObject (142ms) × 8 = 1,136ms
      └─ Serialize response (42ms)
    
  3. Root cause found: New code fetches product images sequentially (8 products × 142ms each) instead of parallel.

Fix:

// Before (sequential — 1,136ms)
for (const product of products) {
  product.imageUrl = await s3.getObject(product.imageKey);
}

// After (parallel — 145ms)
await Promise.all(
  products.map(async (product) => {
    product.imageUrl = await s3.getObject(product.imageKey);
  })
);

Result: Response time back to 205ms. Issue found in 3 minutes vs 3 hours.

Example 2: Debugging Intermittent Errors

Symptom: 2% of checkout requests fail with "Payment processing timeout" — but Stripe status page shows no issues.

Investigation with observability:

  1. Check metrics: Error rate spikes every ~15 minutes
  2. Query traces: Filter failed checkouts by trace ID
  3. Discover pattern: All failing requests show:
    Payment service → Stripe API (timeout after 30s)
    
  4. Check logs with traceId: Find correlation:
    {
      "traceId": "abc123",
      "message": "Stripe webhook delivery failed",
      "error": "Connection pool exhausted (max 20 connections)"
    }
    
  5. Root cause: Webhook handler holds database connections during Stripe callbacks. When Stripe is slow (30s+), connection pool exhausts, blocking new payment requests.

Fix: Process webhooks asynchronously via message queue instead of blocking on DB writes.

Example 3: Identifying External API Impact

Question: "Which external API has the biggest impact on our response times?"

Answer with metrics:

Query Prometheus:

topk(5, 
  sum by (service) (
    rate(external_api_duration_seconds_sum[1h])
  )
)

Results:

1. stripe: 45% of total external API time
2. aws-s3: 28%
3. sendgrid: 15%
4. twilio: 8%
5. github: 4%

Insight: Stripe API is responsible for nearly half of external dependency time. Investigate caching payment methods or pre-authorizations to reduce calls.

Best Practices {#best-practices}

1. Use Sampling for High-Traffic APIs

Tracing every request can be expensive. Use sampling:

// Trace 10% of requests in production, 100% in dev
const sampleRate = process.env.NODE_ENV === 'production' ? 0.1 : 1.0;

const sdk = new NodeSDK({
  // ... other config
  sampler: new TraceIdRatioBasedSampler(sampleRate),
});

Advanced sampling: Always trace errors and slow requests:

import { Sampler, SamplingDecision } from '@opentelemetry/sdk-trace-base';

class ErrorAndSlowRequestSampler implements Sampler {
  shouldSample(context, traceId, spanName, spanKind, attributes) {
    // Always sample errors
    if (attributes['http.status_code'] >= 400) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }

    // Always sample slow requests (>1s)
    if (attributes['http.response_time'] > 1000) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }

    // Sample 10% of everything else
    return Math.random() < 0.1
      ? { decision: SamplingDecision.RECORD_AND_SAMPLED }
      : { decision: SamplingDecision.NOT_RECORD };
  }
}

2. Set Span Attributes Generously

More context = faster debugging:

span.setAttributes({
  'user.id': userId,
  'user.email': userEmail,
  'user.subscription_tier': 'pro',
  'payment.amount': 100,
  'payment.currency': 'USD',
  'payment.method': 'credit_card',
  'feature.flag.new_checkout': true,
  'request.user_agent': req.headers['user-agent'],
});

Don't include:

  • Passwords, tokens, API keys
  • Full credit card numbers
  • PII (if regulations prohibit)

3. Correlate Logs with Traces

Always include traceId and spanId in logs:

logger.info('Payment processed', {
  traceId: span.spanContext().traceId,
  spanId: span.spanContext().spanId,
  userId,
  amount,
});

This lets you jump from a dashboard spike → filtered logs → full trace visualization.

4. Monitor Observability System Health

Your observability tools can fail too:

  • Trace export failures: Set up alerts when OpenTelemetry exporter errors exceed 1%
  • Metric scraping failures: Monitor Prometheus scrape failures
  • Log ingestion lag: Alert if logs are >5 minutes behind real-time
const otelErrors = new Counter({
  name: 'otel_export_errors_total',
  help: 'Failed trace exports',
});

sdk.on('export-error', (error) => {
  otelErrors.inc();
  logger.error('Failed to export trace', { error });
});

5. Use Semantic Conventions

OpenTelemetry defines semantic conventions for consistent attribute naming:

✅ Do:

span.setAttributes({
  'http.method': 'POST',
  'http.status_code': 200,
  'http.url': '/api/checkout',
  'db.system': 'postgresql',
  'db.statement': 'SELECT * FROM users WHERE id = $1',
});

❌ Don't:

span.setAttributes({
  'method': 'POST', // Use http.method
  'statusCode': 200, // Use http.status_code
  'endpoint': '/api/checkout', // Use http.url
  'query': 'SELECT ...', // Use db.statement
});

Semantic conventions ensure:

  • Consistent naming across services
  • Better support from observability tools
  • Easier queries and dashboards

6. Set Alert Thresholds Based on Data

Don't guess — use percentiles from real traffic:

# P95 response time over 7 days
histogram_quantile(0.95, 
  rate(api_request_duration_seconds_bucket[7d])
)

Alert when: Current P95 > baseline P95 + 50%

7. Create Runbooks for Common Traces

Document patterns you see frequently:

Pattern: stripe_api_timeout in traces
Cause: Stripe API slow or down
Action:

  1. Check Stripe status page
  2. If Stripe is down, enable circuit breaker (return cached payment methods)
  3. If Stripe is operational, increase timeout from 5s → 10s

Pattern: database_connection_pool_exhausted
Cause: Long-running queries or leaked connections
Action:

  1. Check active queries: SELECT * FROM pg_stat_activity WHERE state = 'active'
  2. Kill long queries: SELECT pg_terminate_backend(pid)
  3. Review code for missing connection.release() calls

Common Mistakes {#common-mistakes}

1. Not Sampling High-Volume Endpoints

Mistake: Tracing every health check request (1000/second).

Impact: Trace storage costs explode, performance degrades.

Fix:

instrumentations: [
  getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': {
      ignoreIncomingPaths: ['/health', '/metrics', '/favicon.ico'],
    },
  }),
],

2. Missing Context Propagation

Mistake: Service A sends traceId to Service B, but Service B creates a new trace instead of continuing the existing one.

Impact: You see two separate traces instead of one complete flow.

Fix: Use OpenTelemetry's propagation.inject() and ensure Service B's auto-instrumentation extracts headers:

// Service A
const headers = {};
propagation.inject(context.active(), headers);
await axios.post('http://service-b/api', data, { headers });

// Service B automatically extracts traceparent header from request

3. Not Recording Exceptions in Spans

Mistake:

try {
  await riskyOperation();
} catch (error) {
  logger.error('Operation failed', { error });
  throw error; // Span marked successful despite error
}

Fix:

try {
  await riskyOperation();
} catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR });
  throw error;
}

4. Over-Instrumenting Hot Paths

Mistake: Creating spans for every iteration of a loop processing 10,000 items.

Impact: Generates 10,000 spans, slows down processing, inflates storage costs.

Fix: Create one span for the entire batch:

// ❌ Don't
for (const item of items) {
  const span = tracer.startSpan('processItem');
  await process(item);
  span.end();
}

// ✅ Do
const span = tracer.startSpan('processBatch', {
  attributes: { 'batch.size': items.length },
});
await Promise.all(items.map(process));
span.end();

5. Logging PII or Secrets

Mistake:

logger.info('User login', {
  email: 'user@example.com',
  password: 'plaintext123', // ❌ NEVER
  creditCard: '4111111111111111', // ❌ NEVER
});

Fix: Redact sensitive data:

logger.info('User login', {
  userId: 'usr_abc123',
  email: 'u***@example.com', // Masked
  authMethod: 'password',
});

6. Not Monitoring Observability Costs

Mistake: Sending unsampled traces from development/staging to production observability backend.

Impact: Datadog bill goes from $500/month → $5,000/month.

Fix:

  • Use separate environments (dev/staging/prod)
  • Set aggressive sampling in non-prod (1-5%)
  • Monitor ingestion volumes daily

7. Ignoring Cardinality

Mistake: Using high-cardinality values as metric labels:

// ❌ Don't — userId has millions of unique values
httpRequestsTotal.inc({ userId: 'usr_abc123' });

Impact: Metric explosion (millions of time series), Prometheus crashes.

Fix: Use low-cardinality labels only:

// ✅ Do — subscription tier has 3-5 values
httpRequestsTotal.inc({ subscription_tier: 'pro' });

Production Checklist {#production-checklist}

Before deploying observability to production:

Tracing:

  • OpenTelemetry SDK initialized before app code
  • Sampling configured (10-20% for high traffic)
  • Health checks/metrics excluded from tracing
  • Context propagation tested across services
  • Span attributes follow semantic conventions
  • Exceptions recorded in spans
  • Trace exporter tested (Jaeger/Datadog/etc. receiving data)

Metrics:

  • Key metrics instrumented (requests, errors, latency, external API calls)
  • Low-cardinality labels only (avoid userId, traceId in labels)
  • /metrics endpoint exposed for Prometheus scraping
  • Metrics endpoint excluded from tracing (avoid infinite loop)
  • Dashboard created for key metrics

Logging:

  • Structured JSON logging configured
  • Trace IDs included in all logs
  • PII/secrets redacted or excluded
  • Log levels configurable via environment variable
  • Log retention policy set (7-30 days typical)
  • Error logs routed to monitoring (Sentry, PagerDuty)

Alerting:

  • P95 latency alert (threshold: +50% from baseline)
  • Error rate alert (threshold: >1% for 5 minutes)
  • External API error rate (per service)
  • Trace export failure alert
  • Log ingestion lag alert
  • Runbooks documented for each alert

Cost Management:

  • Sampling rates set appropriately
  • Dev/staging environments use separate backends or aggressive sampling
  • Ingestion volume monitored daily
  • Budget alerts configured (Datadog/New Relic)

Testing:

  • Generate test trace and verify end-to-end flow in UI
  • Trigger error, verify exception appears in trace
  • Check log-to-trace correlation (click traceId → see full trace)
  • Load test with tracing enabled (measure performance impact)

Additional Resources

Conclusion

Observability transforms debugging from guesswork to science. Instead of "the API is slow," you know:

  • Which endpoint is slow (traces)
  • Why it's slow (span timing breakdown)
  • When it started (metrics timeline)
  • What errors occurred (correlated logs)
  • Who was affected (user attributes in traces)

Start simple:

  1. Add OpenTelemetry auto-instrumentation (15 minutes)
  2. Send traces to Jaeger or Datadog (10 minutes)
  3. Add structured logging with trace IDs (30 minutes)
  4. Instrument key metrics (1 hour)
  5. Create your first dashboard (30 minutes)

Total investment: ~3 hours to go from zero observability to production-ready monitoring.

The next time your API breaks at 3 AM, you'll know exactly where to look.


Track the status of observability tools like Datadog, New Relic, Sentry, and infrastructure providers like AWS, Stripe, and Twilio on API Status Check.

API Status Check

Stop checking API status pages manually

Get instant email alerts when OpenAI, Stripe, AWS, and 100+ APIs go down. Know before your users do.

Get Alerts — $9/mo →

Free dashboard available · 14-day trial on paid plans · Cancel anytime

Browse Free Dashboard →