Blog/Log Aggregation Guide

Log Aggregation Guide: Centralized Logging for DevOps Teams (2026)

How to collect, centralize, and query logs from your entire infrastructure — and actually find what you're looking for during an incident.

By API Status Check·Updated April 2026·12 min read

Logs are the raw material of incident investigation. But logs scattered across dozens of servers, containers, and cloud services are nearly useless when something breaks at 2 AM. Log aggregation centralizes everything into a single searchable system — so you spend minutes debugging instead of hours tracking down where to look.

This guide covers the full log aggregation landscape: structured logging patterns, the ELK stack, Grafana Loki, managed services, and how to build alerting on top of your log data.

Why Centralized Logging Matters

Without log aggregation, incident investigation looks like this: SSH to the web server, grep the application log. SSH to the database, grep its log. SSH to the load balancer, grep its access log. Cross-reference timestamps manually. Miss the Kubernetes pod that died and took its logs with it.

With centralized logging: open one dashboard, filter by time window and service, see the full sequence of events across every component in milliseconds. MTTR drops dramatically.

Structured Logging: The Foundation

Before choosing a log aggregation tool, fix your logging format. Unstructured logs — plain text strings — are hard to query and nearly impossible to alert on reliably. Structured logs emit every field as a named key-value pair.

FormatExampleQueryability
Unstructured2026-04-30 10:15:00 ERROR User 123 checkout failed: payment timeoutgrep only
Structured (JSON){"ts":"2026-04-30T10:15:00Z","level":"error","event":"checkout.failed","user_id":123,"reason":"payment_timeout","duration_ms":5012}field-level queries

Structured Logging Examples

// Node.js: pino (fast JSON logger)
import pino from 'pino';
const log = pino({
  base: { service: 'checkout-api', env: process.env.NODE_ENV },
  redact: ['req.headers.authorization', 'body.card_number'],
});

// Structured event log
log.info({
  event: 'order.created',
  order_id: order.id,
  user_id: req.user.id,
  amount_cents: order.total,
  duration_ms: Date.now() - startTime,
}, 'Order created successfully');

// Error with context
log.error({
  event: 'payment.failed',
  order_id: order.id,
  provider: 'stripe',
  error_code: err.code,
  duration_ms: Date.now() - startTime,
}, err, 'Payment processing failed');
# Python: structlog
import structlog
log = structlog.get_logger()

# Configure structlog for JSON output
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
)

log.info("order.created", order_id=order.id, user_id=request.user.id, amount=order.total)
log.error("payment.failed", order_id=order.id, provider="stripe", error=str(e))

Log Aggregation Architecture

Every log aggregation system has the same three layers:

  1. Shipper/collector — runs on each server/pod and ships logs to the aggregator (Filebeat, Fluent Bit, Fluentd, Vector)
  2. Aggregator/processor — receives, parses, transforms, and routes logs (Logstash, Loki, Elasticsearch, Kafka)
  3. Storage and query — stores logs and provides search UI (Elasticsearch/Kibana, Loki/Grafana, Datadog)
📡
Recommended

Better Stack includes built-in log aggregation

Better Stack Logs centralizes your application logs with a Grafana-powered query interface, log-based alerting, and correlation with uptime monitoring — at a fraction of Datadog's price.

Try Better Stack Free →

ELK Stack vs. Grafana Loki vs. Managed Services

ELK Stack (Elasticsearch + Logstash + Kibana)

The ELK stack is the most capable open-source log aggregation solution. Elasticsearch indexes every token of every log line, enabling full-text search across billions of events. Kibana provides powerful visualizations, dashboards, and ML-based anomaly detection.

Trade-offs:

Grafana Loki

Loki takes a different approach: it only indexes metadata labels (like Prometheus), not log content. This makes it 10–100x cheaper to operate than Elasticsearch for the same log volume, at the cost of query flexibility. You can't do arbitrary full-text search — you filter by labels (service, environment, pod) and then grep within results.

This is the right trade-off for most teams — you almost always know which service is producing the logs you care about.

# LogQL: Loki's query language (Prometheus-like)
# Filter logs from checkout service with errors
{service="checkout", env="production"} |= "error"

# Count error rate per minute
sum by (service) (
  rate({env="production"} |= "level":"error"" [5m])
)

# Parse JSON and filter by field
{service="api"} | json | user_id="12345"

# Alert: error rate > 10/min for 5 minutes
sum(rate({env="prod"} |= "error" [1m])) > 10

Managed Log Services

ServiceStorage EngineRetentionPricingBest For
Datadog LogsProprietary15 days–1yr$0.10/GB + $2.04/M eventsDatadog users
Better Stack LogsClickHouse + S33 days–1yr$25/mo (50GB)Cost-efficient teams
PapertrailProprietary1 week–1yr$7/mo (1GB)Simple apps
LogglyElasticsearch30 days–1yr$79/mo (1GB/day)Teams needing full-text search
Grafana Cloud LogsLoki30 days freeFree 50GB/moGrafana users
Elastic CloudElasticsearchConfigurable$95+/moFull ELK without ops

Log Shippers: Filebeat, Fluent Bit, and Vector

Fluent Bit (Recommended for Kubernetes)

Fluent Bit is the lightweight choice for container environments — written in C, it uses ~450KB of memory compared to Fluentd's 40MB. The Kubernetes DaemonSet deployment reads logs from /var/log/containers/*.log on every node automatically.

# Fluent Bit DaemonSet for Kubernetes → Loki
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit \
  --namespace logging \
  --create-namespace \
  --set config.outputs="[OUTPUT]
    Name loki
    Match kube.*
    Host loki.monitoring.svc.cluster.local
    Port 3100
    Labels job=fluentbit, env=production
    label_keys $kubernetes['namespace_name'],$kubernetes['pod_name']"

# Vector: modern alternative with better performance + more outputs
vector --config /etc/vector/vector.toml

Log Retention and Cost Management

Log costs grow unboundedly without retention policies. A typical mid-sized app generates 50–500GB of logs per month. Strategies to control cost:

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your infrastructure goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for your infrastructure + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial

Log-Based Alerting

Once logs are centralized, you can alert on log patterns — not just metrics. Useful log alerts:

PatternAlert ConditionSeverity
Error log spikeError logs > 100/min (vs. baseline of 5/min)Warning
OOM eventAny log matching "out of memory" or "OOMKilled"Critical
Auth failuresAuth failure logs > 50/min from same IPWarning (brute force)
Database connection errorsAny "connection refused" to DBCritical
Payment failuresevent=payment.failed count > 10/minCritical (revenue impact)

FAQ

What is the best open-source log aggregation tool?

For Kubernetes teams: Grafana Loki + Fluent Bit + Grafana. It's the most cost-efficient stack, Loki integrates natively with Grafana dashboards and alerting, and Fluent Bit is the lightest log shipper available. For teams needing full-text search across all log content: ELK stack (Elasticsearch + Logstash/Beats + Kibana). Operationally heavier but more powerful for arbitrary searches.

How long should I keep logs?

Regulatory minimums vary: PCI-DSS requires 1 year, HIPAA requires 6 years, SOC 2 recommends 1 year minimum. Operationally, most teams keep 7–30 days of hot (queryable) logs and archive 1 year to cold storage (S3). Logs older than 30 days are rarely queried in practice — incidents are investigated within hours or days.

Is Loki good for production?

Yes — Grafana Labs runs Loki in production at massive scale, and many large organizations use it. The key limitation is that full-text search without label filters is slow (it must scan raw compressed chunks). Design your label schema carefully before deploying at scale: don't use high-cardinality values (user IDs, request IDs) as Loki labels — use log fields instead.

🛠 Tools We Use & Recommend

Tested across our own infrastructure monitoring 200+ APIs daily

Better StackBest for API Teams

Uptime Monitoring & Incident Management

Used by 100,000+ websites

Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.

We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.

Free tier · Paid from $24/moStart Free Monitoring
1PasswordBest for Credential Security

Secrets Management & Developer Security

Trusted by 150,000+ businesses

Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.

After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.

OpteryBest for Privacy

Automated Personal Data Removal

Removes data from 350+ brokers

Removes your personal data from 350+ data broker sites. Protects against phishing and social engineering attacks.

Service outages sometimes involve data breaches. Optery keeps your personal info off the sites attackers use first.

From $9.99/moFree Privacy Scan
ElevenLabsBest for AI Voice

AI Voice & Audio Generation

Used by 1M+ developers

Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.

The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.

Free tier · Paid from $5/moTry ElevenLabs Free
SEMrushBest for SEO

SEO & Site Performance Monitoring

Used by 10M+ marketers

Track your site health, uptime, search rankings, and competitor movements from one dashboard.

We use SEMrush to track how our API status pages rank and catch site health issues early.

From $129.95/moTry SEMrush Free
View full comparison & more tools →Affiliate links — we earn a commission at no extra cost to you