What is log aggregation?

Log aggregation is the process of collecting logs from multiple sources (servers, containers, applications, cloud services) and centralizing them in a single system for search, analysis, and alerting. Instead of SSH-ing into individual servers to grep log files, aggregation lets you query all your logs in one place with a unified search interface.

What is the ELK stack?

The ELK stack is a popular open-source log aggregation solution: Elasticsearch (search and storage), Logstash (log processing and ingestion), and Kibana (visualization and dashboards). Often extended to ELK+B (with Beats for lightweight log shippers) or called the Elastic Stack. It's highly scalable but operationally complex — best for teams with dedicated infrastructure engineers.

What is Grafana Loki and how does it differ from Elasticsearch?

Grafana Loki is a log aggregation system inspired by Prometheus. Unlike Elasticsearch, which indexes the full content of every log line (making it searchable but expensive), Loki only indexes metadata labels (service, pod, namespace) and stores log content compressed. This makes Loki 10-100x cheaper to operate than Elasticsearch. Trade-off: you can't do full-text search; you filter by labels first, then search within results.

What is structured logging?

Structured logging means emitting logs as machine-readable key-value pairs (typically JSON) instead of free-text strings. Structured logs are far easier to query, filter, and alert on. Instead of parsing "User 12345 placed order 6789 at 2026-04-30T10:15:00", you query log fields directly: user_id=12345, order_id=6789, event=order.placed. Every log aggregation tool handles structured logs better than unstructured text.

How much does log aggregation cost?

Log aggregation costs vary widely. Self-hosted Loki is nearly free (just storage). Datadog Logs costs $0.10/GB ingested + $2.04/million log events analyzed. Elastic Cloud starts at $95/month. Better Stack Logs starts at $25/month for 50GB. Most teams spend between $50–500/month on logs for a medium-sized application. To control costs, implement log filtering at the agent level to drop debug logs in production and aggregate at a lower sampling rate.

Log Aggregation Guide: Centralized Logging for DevOps Teams (2026)

Logs are the raw material of incident investigation. But logs scattered across dozens of servers, containers, and cloud services are nearly useless when something breaks at 2 AM. Log aggregation centralizes everything into a single searchable system — so you spend minutes debugging instead of hours tracking down where to look.

This guide covers the full log aggregation landscape: structured logging patterns, the ELK stack, Grafana Loki, managed services, and how to build alerting on top of your log data.

Why Centralized Logging Matters

Without log aggregation, incident investigation looks like this: SSH to the web server, grep the application log. SSH to the database, grep its log. SSH to the load balancer, grep its access log. Cross-reference timestamps manually. Miss the Kubernetes pod that died and took its logs with it.

With centralized logging: open one dashboard, filter by time window and service, see the full sequence of events across every component in milliseconds. MTTR drops dramatically.

Structured Logging: The Foundation

Before choosing a log aggregation tool, fix your logging format. Unstructured logs — plain text strings — are hard to query and nearly impossible to alert on reliably. Structured logs emit every field as a named key-value pair.

Format	Example	Queryability
Unstructured	2026-04-30 10:15:00 ERROR User 123 checkout failed: payment timeout	grep only
Structured (JSON)	{"ts":"2026-04-30T10:15:00Z","level":"error","event":"checkout.failed","user_id":123,"reason":"payment_timeout","duration_ms":5012}	field-level queries

Structured Logging Examples

// Node.js: pino (fast JSON logger)
import pino from 'pino';
const log = pino({
  base: { service: 'checkout-api', env: process.env.NODE_ENV },
  redact: ['req.headers.authorization', 'body.card_number'],
});

// Structured event log
log.info({
  event: 'order.created',
  order_id: order.id,
  user_id: req.user.id,
  amount_cents: order.total,
  duration_ms: Date.now() - startTime,
}, 'Order created successfully');

// Error with context
log.error({
  event: 'payment.failed',
  order_id: order.id,
  provider: 'stripe',
  error_code: err.code,
  duration_ms: Date.now() - startTime,
}, err, 'Payment processing failed');

# Python: structlog
import structlog
log = structlog.get_logger()

# Configure structlog for JSON output
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
)

log.info("order.created", order_id=order.id, user_id=request.user.id, amount=order.total)
log.error("payment.failed", order_id=order.id, provider="stripe", error=str(e))

Log Aggregation Architecture

Every log aggregation system has the same three layers:

Shipper/collector — runs on each server/pod and ships logs to the aggregator (Filebeat, Fluent Bit, Fluentd, Vector)
Aggregator/processor — receives, parses, transforms, and routes logs (Logstash, Loki, Elasticsearch, Kafka)
Storage and query — stores logs and provides search UI (Elasticsearch/Kibana, Loki/Grafana, Datadog)

📡

Recommended

Better Stack includes built-in log aggregation

Better Stack Logs centralizes your application logs with a Grafana-powered query interface, log-based alerting, and correlation with uptime monitoring — at a fraction of Datadog's price.

Try Better Stack Free →

ELK Stack vs. Grafana Loki vs. Managed Services

ELK Stack (Elasticsearch + Logstash + Kibana)

The ELK stack is the most capable open-source log aggregation solution. Elasticsearch indexes every token of every log line, enabling full-text search across billions of events. Kibana provides powerful visualizations, dashboards, and ML-based anomaly detection.

Trade-offs:

✅ Full-text search on all log content
✅ Massive ecosystem, extensive integrations
✅ Elastic Cloud managed option
❌ High resource consumption (JVM heap, disk I/O)
❌ Operationally complex at scale
❌ Expensive at high ingest volumes ($95–500+/month managed)

Grafana Loki

Loki takes a different approach: it only indexes metadata labels (like Prometheus), not log content. This makes it 10–100x cheaper to operate than Elasticsearch for the same log volume, at the cost of query flexibility. You can't do arbitrary full-text search — you filter by labels (service, environment, pod) and then grep within results.

This is the right trade-off for most teams — you almost always know which service is producing the logs you care about.

# LogQL: Loki's query language (Prometheus-like)
# Filter logs from checkout service with errors
{service="checkout", env="production"} |= "error"

# Count error rate per minute
sum by (service) (
  rate({env="production"} |= "level":"error"" [5m])
)

# Parse JSON and filter by field
{service="api"} | json | user_id="12345"

# Alert: error rate > 10/min for 5 minutes
sum(rate({env="prod"} |= "error" [1m])) > 10

Managed Log Services

Service	Storage Engine	Retention	Pricing	Best For
Datadog Logs	Proprietary	15 days–1yr	$0.10/GB + $2.04/M events	Datadog users
Better Stack Logs	ClickHouse + S3	3 days–1yr	$25/mo (50GB)	Cost-efficient teams
Papertrail	Proprietary	1 week–1yr	$7/mo (1GB)	Simple apps
Loggly	Elasticsearch	30 days–1yr	$79/mo (1GB/day)	Teams needing full-text search
Grafana Cloud Logs	Loki	30 days free	Free 50GB/mo	Grafana users
Elastic Cloud	Elasticsearch	Configurable	$95+/mo	Full ELK without ops

Log Shippers: Filebeat, Fluent Bit, and Vector

Fluent Bit (Recommended for Kubernetes)

Fluent Bit is the lightweight choice for container environments — written in C, it uses ~450KB of memory compared to Fluentd's 40MB. The Kubernetes DaemonSet deployment reads logs from /var/log/containers/*.log on every node automatically.

# Fluent Bit DaemonSet for Kubernetes → Loki
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit \
  --namespace logging \
  --create-namespace \
  --set config.outputs="[OUTPUT]
    Name loki
    Match kube.*
    Host loki.monitoring.svc.cluster.local
    Port 3100
    Labels job=fluentbit, env=production
    label_keys $kubernetes['namespace_name'],$kubernetes['pod_name']"

# Vector: modern alternative with better performance + more outputs
vector --config /etc/vector/vector.toml

Log Retention and Cost Management

Log costs grow unboundedly without retention policies. A typical mid-sized app generates 50–500GB of logs per month. Strategies to control cost:

Filter at the shipper level. Drop DEBUG logs in production before they leave the host. A single level=debug filter in Fluent Bit can cut log volume 60-80%.
Tiered retention. Keep hot logs (queryable, indexed) for 7–30 days. Archive to S3/GCS for 1 year at $0.023/GB — much cheaper than keeping everything hot.
Sample verbose logs. For endpoints that log every request, sample at 10–20% during normal operation. Sample 100% when error rates are elevated.
Aggregate instead of log. Instead of logging every cache hit, emit a metric counter. Metrics are far cheaper than logs for high-volume, low-information events.

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your infrastructure goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your infrastructure + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

Log-Based Alerting

Once logs are centralized, you can alert on log patterns — not just metrics. Useful log alerts:

Pattern	Alert Condition	Severity
Error log spike	Error logs > 100/min (vs. baseline of 5/min)	Warning
OOM event	Any log matching "out of memory" or "OOMKilled"	Critical
Auth failures	Auth failure logs > 50/min from same IP	Warning (brute force)
Database connection errors	Any "connection refused" to DB	Critical
Payment failures	event=payment.failed count > 10/min	Critical (revenue impact)

FAQ

What is the best open-source log aggregation tool?

For Kubernetes teams: Grafana Loki + Fluent Bit + Grafana. It's the most cost-efficient stack, Loki integrates natively with Grafana dashboards and alerting, and Fluent Bit is the lightest log shipper available. For teams needing full-text search across all log content: ELK stack (Elasticsearch + Logstash/Beats + Kibana). Operationally heavier but more powerful for arbitrary searches.

How long should I keep logs?

Regulatory minimums vary: PCI-DSS requires 1 year, HIPAA requires 6 years, SOC 2 recommends 1 year minimum. Operationally, most teams keep 7–30 days of hot (queryable) logs and archive 1 year to cold storage (S3). Logs older than 30 days are rarely queried in practice — incidents are investigated within hours or days.

Is Loki good for production?

Yes — Grafana Labs runs Loki in production at massive scale, and many large organizations use it. The key limitation is that full-text search without label filters is slow (it must scan raw compressed chunks). Design your label schema carefully before deploying at scale: don't use high-cardinality values (user IDs, request IDs) as Loki labels — use log fields instead.