Where can I monitor API status in real-time?

API Status Check (apistatuscheck.com) provides real-time monitoring for 100+ APIs with uptime tracking and alerts. You can view dashboards, subscribe to feeds, and set up notifications in minutes.

API Observability & Distributed Tracing: Complete Implementation Guide

Q: API Observability & Distributed Tracing: Complete Implementation Guide?

This post explains API Observability & Distributed Tracing: Complete Implementation Guide with clear steps and practical examples. Use the guidance to apply the recommendations in your own API workflows.

Modern APIs rarely operate in isolation. A single user request might flow through dozens of microservices, external APIs, databases, and message queues. When something breaks, traditional monitoring tools that only track "is it up or down?" aren't enough.

You need observability — the ability to understand your system's internal state based on external outputs. This guide covers implementing production-ready API observability with distributed tracing, metrics, and structured logging.

Observability vs Monitoring
The Three Pillars of Observability
Why Distributed Tracing Matters
OpenTelemetry: The Standard
Implementing Distributed Tracing
Metrics Collection
Structured Logging
Popular Observability Tools
Real-World Examples
Best Practices
Common Mistakes
Production Checklist

Observability vs Monitoring {#observability-vs-monitoring}

Monitoring tells you when something is wrong:

"API response time is 2000ms (should be <500ms)"
"Error rate is 5% (should be <1%)"
"CPU usage is 95%"

Observability tells you why something is wrong:

"The slow requests are all waiting on the Stripe API payment processing endpoint"
"Errors spike when database connection pool exhausts (max 20 connections)"
"High CPU from regex parsing unvalidated user input in search queries"

Key difference: Monitoring requires pre-defining what to watch. Observability lets you ask any question about your system's behavior, even ones you didn't anticipate.

When You Need Observability

Microservices architecture — requests span 5+ services
External API dependencies — AWS, Stripe, Twilio, SendGrid
Asynchronous processing — background jobs, message queues, webhooks
Production debugging — "Why is this one user experiencing errors?"
Performance optimization — finding the slowest part of a request chain

The Three Pillars of Observability {#three-pillars}

Modern observability is built on three data types that, when correlated, give you complete system visibility:

1. Logs

What: Timestamped text records of discrete events.

{
  "timestamp": "2026-03-11T14:05:23.142Z",
  "level": "error",
  "message": "Payment processing failed",
  "userId": "usr_abc123",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "error": {
    "code": "card_declined",
    "message": "Insufficient funds"
  }
}

Best for: Understanding specific events, debugging individual requests.

2. Metrics

What: Numeric measurements aggregated over time.

api_request_duration_seconds{endpoint="/api/checkout", status="200"} 0.342
api_request_total{endpoint="/api/checkout", status="500"} 47
database_connections_active 18

Best for: Dashboards, alerting on system-wide trends, capacity planning.

3. Traces

What: End-to-end journey of a request across all services.

Request: POST /api/checkout (842ms total)
  ├─ auth-service: verify JWT (12ms)
  ├─ inventory-service: check stock (45ms)
  │   └─ PostgreSQL: SELECT products (38ms)
  ├─ payment-service: process charge (723ms) ⚠️ SLOW
  │   ├─ Stripe API: create payment intent (687ms)
  │   └─ Redis: cache card info (4ms)
  └─ notification-service: send confirmation (62ms)
      └─ SendGrid API: send email (58ms)

Best for: Finding bottlenecks, understanding request flows, debugging distributed systems.

Why All Three?

Logs show what happened
Metrics show how much/how often
Traces show where it happened

The magic happens when you correlate them: Click on a spike in your error rate metric → see all related logs filtered by the trace ID → visualize the exact request path that failed.

Why Distributed Tracing Matters {#why-distributed-tracing}

The Problem: Debugging Distributed Systems

Without tracing, debugging a failed checkout request looks like this:

Check API gateway logs: "Request received, forwarded to payment-service"
Check payment-service logs: "Called Stripe API, got 500 error"
Check Stripe status page: "All systems operational"
Check Stripe API logs (if you have access): "Request succeeded, webhook sent"
Check webhook-handler logs: "No webhook received"
Check message queue: "Webhook delivery failed due to DNS timeout"
Finally find root cause: DNS resolver configuration error

That took 6 tools and 30 minutes.

With Distributed Tracing

The same debugging session:

Open trace for failed request ID
See complete request flow with timing:
- API Gateway → Payment Service (2ms)
- Payment Service → Stripe API (120ms) ✅
- Stripe → Webhook Queue (timeout after 5000ms) ❌
Click on failed span → see error: "DNS resolution timeout"
Root cause found in 30 seconds.

Real-World Impact

Shopify reduced incident resolution time by 75% after implementing distributed tracing.

Uber uses tracing to debug issues across 2,200+ microservices handling millions of requests per second.

Netflix attributes faster incident response and reduced MTTR (Mean Time To Recovery) to comprehensive tracing.

OpenTelemetry: The Standard {#opentelemetry}

OpenTelemetry is the industry-standard observability framework (merged from OpenTracing + OpenCensus). It's supported by every major observability vendor: Datadog, New Relic, Honeycomb, Grafana, AWS X-Ray, Google Cloud Trace.

Why OpenTelemetry?

✅ Vendor-neutral — switch from Datadog to New Relic without changing instrumentation code
✅ Auto-instrumentation — trace HTTP requests, database queries, external APIs automatically
✅ Wide language support — JavaScript, Python, Go, Java, .NET, Ruby, PHP
✅ Active development — CNCF project with contributions from Google, Microsoft, AWS, Datadog

Core Concepts

Span

A single operation with start time, end time, and metadata:

{
  "name": "POST /api/checkout",
  "spanId": "5ce929d0e0e4736",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "parentSpanId": null,
  "startTime": "2026-03-11T14:05:23.000Z",
  "endTime": "2026-03-11T14:05:23.842Z",
  "attributes": {
    "http.method": "POST",
    "http.url": "/api/checkout",
    "http.status_code": 200,
    "user.id": "usr_abc123"
  }
}

Trace

Collection of spans representing one complete request:

Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
  ├─ Span: API Gateway (842ms)
  │   ├─ Span: Auth Service (12ms)
  │   ├─ Span: Payment Service (723ms)
  │   │   └─ Span: Stripe API call (687ms)
  │   └─ Span: Notification Service (62ms)

Context Propagation

Passing trace context between services (usually via HTTP headers):

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-5ce929d0e0e4736-01

Implementing Distributed Tracing {#implementing-tracing}

Step 1: Install OpenTelemetry SDK

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http

Step 2: Initialize Tracing (tracing.ts)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.2.3',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces', // Jaeger/OTLP collector
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/health', '/metrics'], // Don't trace health checks
      },
      '@opentelemetry/instrumentation-express': {},
      '@opentelemetry/instrumentation-pg': {}, // PostgreSQL
      '@opentelemetry/instrumentation-redis': {},
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.error('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

Step 3: Import Before App Code

// index.ts
import './tracing'; // MUST be first import
import express from 'express';
import { someRoute } from './routes';

const app = express();
app.use('/api', someRoute);
app.listen(3000);

That's it! Auto-instrumentation now traces:

✅ All HTTP requests/responses
✅ Database queries (PostgreSQL, MongoDB, MySQL)
✅ Redis operations
✅ External HTTP calls (to Stripe, AWS, etc.)

Step 4: Add Custom Spans (Optional)

For business logic that auto-instrumentation doesn't cover:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

export async function processPayment(userId: string, amount: number) {
  const span = tracer.startSpan('processPayment', {
    attributes: {
      'user.id': userId,
      'payment.amount': amount,
      'payment.currency': 'USD',
    },
  });

  try {
    // Call Stripe API
    const result = await stripe.charges.create({
      amount: amount * 100,
      currency: 'usd',
      customer: userId,
    });

    span.setAttributes({
      'payment.id': result.id,
      'payment.status': result.status,
    });

    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.recordException(error as Error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Step 5: Propagate Context Across Services

When service A calls service B, pass trace context in HTTP headers:

// Service A (caller)
import axios from 'axios';
import { context, propagation } from '@opentelemetry/api';

const headers = {};
propagation.inject(context.active(), headers); // Adds traceparent header

await axios.post('http://payment-service/api/charge', { amount: 100 }, { headers });

Service B's auto-instrumentation extracts the trace context automatically — no manual code needed.

Metrics Collection {#metrics-collection}

Metrics complement traces by showing trends over time.

Key API Metrics to Track

Request Metrics:

api_requests_total (counter) — total requests by endpoint, status code
api_request_duration_seconds (histogram) — request latency distribution
api_request_size_bytes (histogram) — request payload sizes
api_response_size_bytes (histogram) — response payload sizes

Error Metrics:

api_errors_total (counter) — errors by type (validation, auth, server, external)
api_error_rate (gauge) — percentage of requests failing

External API Metrics:

external_api_calls_total (counter) — calls to Stripe, AWS, etc.
external_api_duration_seconds (histogram) — latency of external calls
external_api_errors_total (counter) — errors from external APIs

System Metrics:

nodejs_heap_used_bytes (gauge) — memory usage
nodejs_eventloop_lag_seconds (gauge) — event loop lag
database_connections_active (gauge) — active DB connections

Implementing Metrics with Prometheus

npm install prom-client

// metrics.ts
import { Registry, Counter, Histogram } from 'prom-client';
import { Request, Response, NextFunction } from 'express';

export const register = new Registry();

// Request counter
export const httpRequestsTotal = new Counter({
  name: 'api_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'endpoint', 'status'],
  registers: [register],
});

// Request duration histogram
export const httpRequestDuration = new Histogram({
  name: 'api_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'endpoint', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5], // Response time buckets
  registers: [register],
});

// External API call duration
export const externalApiDuration = new Histogram({
  name: 'external_api_duration_seconds',
  help: 'Duration of external API calls',
  labelNames: ['service', 'endpoint', 'status'],
  buckets: [0.1, 0.5, 1, 2, 5, 10],
  registers: [register],
});

// Metrics middleware
export function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const endpoint = req.route?.path || req.path;

    httpRequestsTotal.inc({
      method: req.method,
      endpoint,
      status: res.statusCode,
    });

    httpRequestDuration.observe(
      {
        method: req.method,
        endpoint,
        status: res.statusCode,
      },
      duration
    );
  });

  next();
}

Expose Metrics Endpoint

// app.ts
import express from 'express';
import { register, metricsMiddleware } from './metrics';

const app = express();

app.use(metricsMiddleware);

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

Track External API Calls

import axios from 'axios';
import { externalApiDuration } from './metrics';

export async function chargeCard(amount: number) {
  const end = externalApiDuration.startTimer();

  try {
    const response = await axios.post('https://api.stripe.com/v1/charges', {
      amount: amount * 100,
      currency: 'usd',
    });

    end({ service: 'stripe', endpoint: '/v1/charges', status: response.status });
    return response.data;
  } catch (error) {
    end({ service: 'stripe', endpoint: '/v1/charges', status: error.response?.status || 0 });
    throw error;
  }
}

Structured Logging {#structured-logging}

Logs become 10x more useful when they're structured (JSON) instead of plain text, and include correlation IDs linking them to traces.

Implementing Structured Logging with Winston

npm install winston

// logger.ts
import winston from 'winston';
import { trace, context } from '@opentelemetry/api';

// Custom format that adds trace context
const traceFormat = winston.format((info) => {
  const span = trace.getSpan(context.active());
  if (span) {
    const spanContext = span.spanContext();
    info.traceId = spanContext.traceId;
    info.spanId = spanContext.spanId;
  }
  return info;
});

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    traceFormat(), // Add trace IDs
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'payment-service',
    environment: process.env.NODE_ENV,
  },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
  ],
});

Using the Logger

import { logger } from './logger';

export async function processPayment(userId: string, amount: number) {
  logger.info('Processing payment', {
    userId,
    amount,
    currency: 'USD',
  });

  try {
    const result = await stripe.charges.create({ amount: amount * 100 });

    logger.info('Payment successful', {
      userId,
      paymentId: result.id,
      status: result.status,
    });

    return result;
  } catch (error) {
    logger.error('Payment failed', {
      userId,
      amount,
      error: {
        message: error.message,
        code: error.code,
        type: error.type,
      },
    });
    throw error;
  }
}

Example Structured Log Output

{
  "timestamp": "2026-03-11T14:05:23.142Z",
  "level": "error",
  "message": "Payment failed",
  "service": "payment-service",
  "environment": "production",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "5ce929d0e0e4736",
  "userId": "usr_abc123",
  "amount": 100,
  "error": {
    "message": "Your card was declined",
    "code": "card_declined",
    "type": "StripeCardError"
  }
}

Power of correlation: Copy the traceId → paste into your tracing tool (Jaeger, Datadog) → see the complete request flow that generated this error.

Popular Observability Tools {#observability-tools}

Open Source

Jaeger (distributed tracing)

Created by Uber, now CNCF project
Great for self-hosted tracing
Supports OpenTelemetry OTLP format
Free, but requires infrastructure setup

Grafana + Loki + Tempo (metrics + logs + traces)

Complete observability stack
Grafana for dashboards, Loki for logs, Tempo for traces
Works with Prometheus for metrics
Popular for Kubernetes environments

Zipkin (distributed tracing)

Alternative to Jaeger
Simpler UI, less features
Good for smaller teams

Commercial SaaS

Datadog APM

All-in-one: metrics, logs, traces, RUM
Auto-instrumentation for 15+ languages
Great correlation between data sources
Pricing: ~$31/host/month + $1.70/million spans

New Relic

Full-stack observability platform
Strong APM and distributed tracing
Good error tracking
Pricing: $99/user/month (full platform)

Honeycomb

Query-based exploration (not dashboards)
Excellent for debugging complex issues
Powerful "slice and dice" UI
Pricing: $0.03/GB ingested

Elastic APM (part of Elastic Stack)

Integrates with Elasticsearch/Kibana
Good if you already use ELK stack
Strong log correlation
Pricing: $95/month for 50GB

Sentry

Primarily error tracking with tracing
Great developer experience
Performance monitoring add-on
Pricing: $26/month for 50K errors

AWS X-Ray

Native AWS integration
Trace requests across Lambda, ECS, EC2
Works with API Gateway, DynamoDB
Pricing: $5 per million traces

Decision Guide

Choose Jaeger if:

You want self-hosted (no SaaS costs)
You have Kubernetes infrastructure
You need basic distributed tracing only

Choose Datadog if:

You want best-in-class correlation (logs ↔ traces ↔ metrics)
You monitor infrastructure + applications
Budget allows $300-1000+/month

Choose Honeycomb if:

You need to debug complex, high-cardinality issues
You want query-based exploration vs pre-built dashboards
You have unpredictable traffic (pay-per-GB works better)

Choose New Relic if:

You want full observability platform
You're already invested in New Relic ecosystem

Choose AWS X-Ray if:

You run primarily on AWS
You want zero-config Lambda tracing

Real-World Examples {#real-world-examples}

Example 1: Finding a Performance Bottleneck

Symptom: API response times increased from 200ms → 1.2s after deploying inventory service v2.4.

Without observability:

Check server CPU/memory (normal)
Review code changes (200+ line diff, nothing obvious)
Add console.log statements, redeploy
Wait for logs... still unclear

With distributed tracing:

Open trace for slow request

See breakdown:

GET /api/products (1,200ms)
  ├─ Auth middleware (8ms)
  ├─ Inventory service (1,150ms) ⚠️
  │   ├─ PostgreSQL: SELECT products (15ms)
  │   └─ Loop: fetch images (1,135ms) ⚠️
  │       ├─ S3 getObject (142ms) × 8 = 1,136ms
  └─ Serialize response (42ms)

Root cause found: New code fetches product images sequentially (8 products × 142ms each) instead of parallel.

Fix:

// Before (sequential — 1,136ms)
for (const product of products) {
  product.imageUrl = await s3.getObject(product.imageKey);
}

// After (parallel — 145ms)
await Promise.all(
  products.map(async (product) => {
    product.imageUrl = await s3.getObject(product.imageKey);
  })
);

Result: Response time back to 205ms. Issue found in 3 minutes vs 3 hours.

Example 2: Debugging Intermittent Errors

Symptom: 2% of checkout requests fail with "Payment processing timeout" — but Stripe status page shows no issues.

Investigation with observability:

Check metrics: Error rate spikes every ~15 minutes
Query traces: Filter failed checkouts by trace ID

Discover pattern: All failing requests show:

Payment service → Stripe API (timeout after 30s)

Check logs with traceId: Find correlation:

{
  "traceId": "abc123",
  "message": "Stripe webhook delivery failed",
  "error": "Connection pool exhausted (max 20 connections)"
}

Root cause: Webhook handler holds database connections during Stripe callbacks. When Stripe is slow (30s+), connection pool exhausts, blocking new payment requests.

Fix: Process webhooks asynchronously via message queue instead of blocking on DB writes.

Example 3: Identifying External API Impact

Question: "Which external API has the biggest impact on our response times?"

Answer with metrics:

Query Prometheus:

topk(5, 
  sum by (service) (
    rate(external_api_duration_seconds_sum[1h])
  )
)

Results:

1. stripe: 45% of total external API time
2. aws-s3: 28%
3. sendgrid: 15%
4. twilio: 8%
5. github: 4%

Insight: Stripe API is responsible for nearly half of external dependency time. Investigate caching payment methods or pre-authorizations to reduce calls.

Best Practices {#best-practices}

1. Use Sampling for High-Traffic APIs

Tracing every request can be expensive. Use sampling:

// Trace 10% of requests in production, 100% in dev
const sampleRate = process.env.NODE_ENV === 'production' ? 0.1 : 1.0;

const sdk = new NodeSDK({
  // ... other config
  sampler: new TraceIdRatioBasedSampler(sampleRate),
});

Advanced sampling: Always trace errors and slow requests:

import { Sampler, SamplingDecision } from '@opentelemetry/sdk-trace-base';

class ErrorAndSlowRequestSampler implements Sampler {
  shouldSample(context, traceId, spanName, spanKind, attributes) {
    // Always sample errors
    if (attributes['http.status_code'] >= 400) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }

    // Always sample slow requests (>1s)
    if (attributes['http.response_time'] > 1000) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }

    // Sample 10% of everything else
    return Math.random() < 0.1
      ? { decision: SamplingDecision.RECORD_AND_SAMPLED }
      : { decision: SamplingDecision.NOT_RECORD };
  }
}

2. Set Span Attributes Generously

More context = faster debugging:

span.setAttributes({
  'user.id': userId,
  'user.email': userEmail,
  'user.subscription_tier': 'pro',
  'payment.amount': 100,
  'payment.currency': 'USD',
  'payment.method': 'credit_card',
  'feature.flag.new_checkout': true,
  'request.user_agent': req.headers['user-agent'],
});

Don't include:

Passwords, tokens, API keys
Full credit card numbers
PII (if regulations prohibit)

🔐 Accidental credential logging is a top observability anti-pattern. 1Password keeps API keys in encrypted vaults and injects them at runtime via CLI — so they never appear in log fields, span attributes, or trace payloads.

3. Correlate Logs with Traces

Always include traceId and spanId in logs:

logger.info('Payment processed', {
  traceId: span.spanContext().traceId,
  spanId: span.spanContext().spanId,
  userId,
  amount,
});

This lets you jump from a dashboard spike → filtered logs → full trace visualization.

4. Monitor Observability System Health

Your observability tools can fail too:

Trace export failures: Set up alerts when OpenTelemetry exporter errors exceed 1%
Metric scraping failures: Monitor Prometheus scrape failures
Log ingestion lag: Alert if logs are >5 minutes behind real-time

const otelErrors = new Counter({
  name: 'otel_export_errors_total',
  help: 'Failed trace exports',
});

sdk.on('export-error', (error) => {
  otelErrors.inc();
  logger.error('Failed to export trace', { error });
});

5. Use Semantic Conventions

OpenTelemetry defines semantic conventions for consistent attribute naming:

✅ Do:

span.setAttributes({
  'http.method': 'POST',
  'http.status_code': 200,
  'http.url': '/api/checkout',
  'db.system': 'postgresql',
  'db.statement': 'SELECT * FROM users WHERE id = $1',
});

❌ Don't:

span.setAttributes({
  'method': 'POST', // Use http.method
  'statusCode': 200, // Use http.status_code
  'endpoint': '/api/checkout', // Use http.url
  'query': 'SELECT ...', // Use db.statement
});

Semantic conventions ensure:

Consistent naming across services
Better support from observability tools
Easier queries and dashboards

6. Set Alert Thresholds Based on Data

Don't guess — use percentiles from real traffic:

# P95 response time over 7 days
histogram_quantile(0.95, 
  rate(api_request_duration_seconds_bucket[7d])
)

Alert when: Current P95 > baseline P95 + 50%

7. Create Runbooks for Common Traces

Document patterns you see frequently:

Pattern: stripe_api_timeout in traces
Cause: Stripe API slow or down
Action:

Check Stripe status page
If Stripe is down, enable circuit breaker (return cached payment methods)
If Stripe is operational, increase timeout from 5s → 10s

Pattern: database_connection_pool_exhausted
Cause: Long-running queries or leaked connections
Action:

Check active queries: SELECT * FROM pg_stat_activity WHERE state = 'active'
Kill long queries: SELECT pg_terminate_backend(pid)
Review code for missing connection.release() calls

Common Mistakes {#common-mistakes}

1. Not Sampling High-Volume Endpoints

Mistake: Tracing every health check request (1000/second).

Impact: Trace storage costs explode, performance degrades.

Fix:

instrumentations: [
  getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': {
      ignoreIncomingPaths: ['/health', '/metrics', '/favicon.ico'],
    },
  }),
],

2. Missing Context Propagation

Mistake: Service A sends traceId to Service B, but Service B creates a new trace instead of continuing the existing one.

Impact: You see two separate traces instead of one complete flow.

Fix: Use OpenTelemetry's propagation.inject() and ensure Service B's auto-instrumentation extracts headers:

// Service A
const headers = {};
propagation.inject(context.active(), headers);
await axios.post('http://service-b/api', data, { headers });

// Service B automatically extracts traceparent header from request

3. Not Recording Exceptions in Spans

Mistake:

try {
  await riskyOperation();
} catch (error) {
  logger.error('Operation failed', { error });
  throw error; // Span marked successful despite error
}

Fix:

try {
  await riskyOperation();
} catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR });
  throw error;
}

4. Over-Instrumenting Hot Paths

Mistake: Creating spans for every iteration of a loop processing 10,000 items.

Impact: Generates 10,000 spans, slows down processing, inflates storage costs.

Fix: Create one span for the entire batch:

// ❌ Don't
for (const item of items) {
  const span = tracer.startSpan('processItem');
  await process(item);
  span.end();
}

// ✅ Do
const span = tracer.startSpan('processBatch', {
  attributes: { 'batch.size': items.length },
});
await Promise.all(items.map(process));
span.end();

5. Logging PII or Secrets

Mistake:

logger.info('User login', {
  email: 'user@example.com',
  password: 'plaintext123', // ❌ NEVER
  creditCard: '4111111111111111', // ❌ NEVER
});

Fix: Redact sensitive data:

logger.info('User login', {
  userId: 'usr_abc123',
  email: 'u***@example.com', // Masked
  authMethod: 'password',
});

6. Not Monitoring Observability Costs

Mistake: Sending unsampled traces from development/staging to production observability backend.

Impact: Datadog bill goes from $500/month → $5,000/month.

Fix:

Use separate environments (dev/staging/prod)
Set aggressive sampling in non-prod (1-5%)
Monitor ingestion volumes daily

7. Ignoring Cardinality

Mistake: Using high-cardinality values as metric labels:

// ❌ Don't — userId has millions of unique values
httpRequestsTotal.inc({ userId: 'usr_abc123' });

Impact: Metric explosion (millions of time series), Prometheus crashes.

Fix: Use low-cardinality labels only:

// ✅ Do — subscription tier has 3-5 values
httpRequestsTotal.inc({ subscription_tier: 'pro' });

Production Checklist {#production-checklist}

Before deploying observability to production:

Tracing:

OpenTelemetry SDK initialized before app code
Sampling configured (10-20% for high traffic)
Health checks/metrics excluded from tracing
Context propagation tested across services
Span attributes follow semantic conventions
Exceptions recorded in spans
Trace exporter tested (Jaeger/Datadog/etc. receiving data)

Metrics:

Key metrics instrumented (requests, errors, latency, external API calls)
Low-cardinality labels only (avoid userId, traceId in labels)
/metrics endpoint exposed for Prometheus scraping
Metrics endpoint excluded from tracing (avoid infinite loop)
Dashboard created for key metrics

Logging:

Structured JSON logging configured
Trace IDs included in all logs
PII/secrets redacted or excluded
Log levels configurable via environment variable
Log retention policy set (7-30 days typical)
Error logs routed to monitoring (Sentry, PagerDuty)

Alerting:

P95 latency alert (threshold: +50% from baseline)
Error rate alert (threshold: >1% for 5 minutes)
External API error rate (per service)
Trace export failure alert
Log ingestion lag alert
Runbooks documented for each alert

Cost Management:

Sampling rates set appropriately
Dev/staging environments use separate backends or aggressive sampling
Ingestion volume monitored daily
Budget alerts configured (Datadog/New Relic)

Testing:

Generate test trace and verify end-to-end flow in UI
Trigger error, verify exception appears in trace
Check log-to-trace correlation (click traceId → see full trace)
Load test with tracing enabled (measure performance impact)

Additional Resources

OpenTelemetry Documentation: opentelemetry.io/docs
Datadog APM Guide: Datadog APM best practices
Prometheus Best Practices: prometheus.io/docs/practices
Distributed Tracing Book: Mastering Distributed Tracing by Yuri Shkuro (creator of Jaeger)

Conclusion

Observability transforms debugging from guesswork to science. Instead of "the API is slow," you know:

Which endpoint is slow (traces)
Why it's slow (span timing breakdown)
When it started (metrics timeline)
What errors occurred (correlated logs)
Who was affected (user attributes in traces)

Start simple:

Add OpenTelemetry auto-instrumentation (15 minutes)
Send traces to Jaeger or Datadog (10 minutes)
Add structured logging with trace IDs (30 minutes)
Instrument key metrics (1 hour)
Create your first dashboard (30 minutes)

Total investment: ~3 hours to go from zero observability to production-ready monitoring.

The next time your API breaks at 3 AM, you'll know exactly where to look.

Recommended Observability Tools

Complete observability starts with reliable monitoring. Better Stack combines uptime monitoring, log aggregation, and incident management in one platform — covering the monitoring pillar of observability with 30-second check intervals and multi-region verification. {/* affiliate:betterstack */}

Secure your observability credentials. Tracing endpoints, log shipping tokens, and monitoring API keys need protection. 1Password provides developer-friendly secret management with CLI integration for your observability pipeline. {/* affiliate:1password */}

API Observability & Distributed Tracing: Complete Implementation Guide

Table of Contents

Observability vs Monitoring {#observability-vs-monitoring}

When You Need Observability

The Three Pillars of Observability {#three-pillars}

1. Logs

2. Metrics

3. Traces

Why All Three?

Why Distributed Tracing Matters {#why-distributed-tracing}

The Problem: Debugging Distributed Systems

With Distributed Tracing

Real-World Impact

OpenTelemetry: The Standard {#opentelemetry}

Why OpenTelemetry?

Core Concepts

Span

Trace

Context Propagation

Implementing Distributed Tracing {#implementing-tracing}

Step 1: Install OpenTelemetry SDK

Step 2: Initialize Tracing (tracing.ts)

Step 3: Import Before App Code

Step 4: Add Custom Spans (Optional)

Step 5: Propagate Context Across Services

Metrics Collection {#metrics-collection}

Key API Metrics to Track

Implementing Metrics with Prometheus

Expose Metrics Endpoint

Track External API Calls

Structured Logging {#structured-logging}

Implementing Structured Logging with Winston

Using the Logger

Example Structured Log Output

Popular Observability Tools {#observability-tools}

Open Source

Commercial SaaS

Decision Guide

Real-World Examples {#real-world-examples}

Example 1: Finding a Performance Bottleneck

Example 2: Debugging Intermittent Errors

Example 3: Identifying External API Impact

Best Practices {#best-practices}

1. Use Sampling for High-Traffic APIs

2. Set Span Attributes Generously

3. Correlate Logs with Traces

4. Monitor Observability System Health

5. Use Semantic Conventions

6. Set Alert Thresholds Based on Data

7. Create Runbooks for Common Traces

Common Mistakes {#common-mistakes}

1. Not Sampling High-Volume Endpoints

2. Missing Context Propagation

3. Not Recording Exceptions in Spans

4. Over-Instrumenting Hot Paths

5. Logging PII or Secrets

6. Not Monitoring Observability Costs

7. Ignoring Cardinality

Production Checklist {#production-checklist}

Additional Resources

Conclusion

Recommended Observability Tools

Stop checking — get alerted instantly