Logs are the raw material of incident investigation. But logs scattered across dozens of servers, containers, and cloud services are nearly useless when something breaks at 2 AM. Log aggregation centralizes everything into a single searchable system — so you spend minutes debugging instead of hours tracking down where to look.
This guide covers the full log aggregation landscape: structured logging patterns, the ELK stack, Grafana Loki, managed services, and how to build alerting on top of your log data.
Why Centralized Logging Matters
Without log aggregation, incident investigation looks like this: SSH to the web server, grep the application log. SSH to the database, grep its log. SSH to the load balancer, grep its access log. Cross-reference timestamps manually. Miss the Kubernetes pod that died and took its logs with it.
With centralized logging: open one dashboard, filter by time window and service, see the full sequence of events across every component in milliseconds. MTTR drops dramatically.
Structured Logging: The Foundation
Before choosing a log aggregation tool, fix your logging format. Unstructured logs — plain text strings — are hard to query and nearly impossible to alert on reliably. Structured logs emit every field as a named key-value pair.
| Format | Example | Queryability |
|---|---|---|
| Unstructured | 2026-04-30 10:15:00 ERROR User 123 checkout failed: payment timeout | grep only |
| Structured (JSON) | {"ts":"2026-04-30T10:15:00Z","level":"error","event":"checkout.failed","user_id":123,"reason":"payment_timeout","duration_ms":5012} | field-level queries |
Structured Logging Examples
// Node.js: pino (fast JSON logger)
import pino from 'pino';
const log = pino({
base: { service: 'checkout-api', env: process.env.NODE_ENV },
redact: ['req.headers.authorization', 'body.card_number'],
});
// Structured event log
log.info({
event: 'order.created',
order_id: order.id,
user_id: req.user.id,
amount_cents: order.total,
duration_ms: Date.now() - startTime,
}, 'Order created successfully');
// Error with context
log.error({
event: 'payment.failed',
order_id: order.id,
provider: 'stripe',
error_code: err.code,
duration_ms: Date.now() - startTime,
}, err, 'Payment processing failed');# Python: structlog
import structlog
log = structlog.get_logger()
# Configure structlog for JSON output
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
)
log.info("order.created", order_id=order.id, user_id=request.user.id, amount=order.total)
log.error("payment.failed", order_id=order.id, provider="stripe", error=str(e))Log Aggregation Architecture
Every log aggregation system has the same three layers:
- Shipper/collector — runs on each server/pod and ships logs to the aggregator (Filebeat, Fluent Bit, Fluentd, Vector)
- Aggregator/processor — receives, parses, transforms, and routes logs (Logstash, Loki, Elasticsearch, Kafka)
- Storage and query — stores logs and provides search UI (Elasticsearch/Kibana, Loki/Grafana, Datadog)
Better Stack includes built-in log aggregation
Better Stack Logs centralizes your application logs with a Grafana-powered query interface, log-based alerting, and correlation with uptime monitoring — at a fraction of Datadog's price.
Try Better Stack Free →ELK Stack vs. Grafana Loki vs. Managed Services
ELK Stack (Elasticsearch + Logstash + Kibana)
The ELK stack is the most capable open-source log aggregation solution. Elasticsearch indexes every token of every log line, enabling full-text search across billions of events. Kibana provides powerful visualizations, dashboards, and ML-based anomaly detection.
Trade-offs:
- ✅ Full-text search on all log content
- ✅ Massive ecosystem, extensive integrations
- ✅ Elastic Cloud managed option
- ❌ High resource consumption (JVM heap, disk I/O)
- ❌ Operationally complex at scale
- ❌ Expensive at high ingest volumes ($95–500+/month managed)
Grafana Loki
Loki takes a different approach: it only indexes metadata labels (like Prometheus), not log content. This makes it 10–100x cheaper to operate than Elasticsearch for the same log volume, at the cost of query flexibility. You can't do arbitrary full-text search — you filter by labels (service, environment, pod) and then grep within results.
This is the right trade-off for most teams — you almost always know which service is producing the logs you care about.
# LogQL: Loki's query language (Prometheus-like)
# Filter logs from checkout service with errors
{service="checkout", env="production"} |= "error"
# Count error rate per minute
sum by (service) (
rate({env="production"} |= "level":"error"" [5m])
)
# Parse JSON and filter by field
{service="api"} | json | user_id="12345"
# Alert: error rate > 10/min for 5 minutes
sum(rate({env="prod"} |= "error" [1m])) > 10Managed Log Services
| Service | Storage Engine | Retention | Pricing | Best For |
|---|---|---|---|---|
| Datadog Logs | Proprietary | 15 days–1yr | $0.10/GB + $2.04/M events | Datadog users |
| Better Stack Logs | ClickHouse + S3 | 3 days–1yr | $25/mo (50GB) | Cost-efficient teams |
| Papertrail | Proprietary | 1 week–1yr | $7/mo (1GB) | Simple apps |
| Loggly | Elasticsearch | 30 days–1yr | $79/mo (1GB/day) | Teams needing full-text search |
| Grafana Cloud Logs | Loki | 30 days free | Free 50GB/mo | Grafana users |
| Elastic Cloud | Elasticsearch | Configurable | $95+/mo | Full ELK without ops |
Log Shippers: Filebeat, Fluent Bit, and Vector
Fluent Bit (Recommended for Kubernetes)
Fluent Bit is the lightweight choice for container environments — written in C, it uses ~450KB of memory compared to Fluentd's 40MB. The Kubernetes DaemonSet deployment reads logs from /var/log/containers/*.log on every node automatically.
# Fluent Bit DaemonSet for Kubernetes → Loki
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit \
--namespace logging \
--create-namespace \
--set config.outputs="[OUTPUT]
Name loki
Match kube.*
Host loki.monitoring.svc.cluster.local
Port 3100
Labels job=fluentbit, env=production
label_keys $kubernetes['namespace_name'],$kubernetes['pod_name']"
# Vector: modern alternative with better performance + more outputs
vector --config /etc/vector/vector.tomlLog Retention and Cost Management
Log costs grow unboundedly without retention policies. A typical mid-sized app generates 50–500GB of logs per month. Strategies to control cost:
- Filter at the shipper level. Drop DEBUG logs in production before they leave the host. A single
level=debugfilter in Fluent Bit can cut log volume 60-80%. - Tiered retention. Keep hot logs (queryable, indexed) for 7–30 days. Archive to S3/GCS for 1 year at $0.023/GB — much cheaper than keeping everything hot.
- Sample verbose logs. For endpoints that log every request, sample at 10–20% during normal operation. Sample 100% when error rates are elevated.
- Aggregate instead of log. Instead of logging every cache hit, emit a metric counter. Metrics are far cheaper than logs for high-volume, low-information events.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your infrastructure goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your infrastructure + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Log-Based Alerting
Once logs are centralized, you can alert on log patterns — not just metrics. Useful log alerts:
| Pattern | Alert Condition | Severity |
|---|---|---|
| Error log spike | Error logs > 100/min (vs. baseline of 5/min) | Warning |
| OOM event | Any log matching "out of memory" or "OOMKilled" | Critical |
| Auth failures | Auth failure logs > 50/min from same IP | Warning (brute force) |
| Database connection errors | Any "connection refused" to DB | Critical |
| Payment failures | event=payment.failed count > 10/min | Critical (revenue impact) |
FAQ
What is the best open-source log aggregation tool?
For Kubernetes teams: Grafana Loki + Fluent Bit + Grafana. It's the most cost-efficient stack, Loki integrates natively with Grafana dashboards and alerting, and Fluent Bit is the lightest log shipper available. For teams needing full-text search across all log content: ELK stack (Elasticsearch + Logstash/Beats + Kibana). Operationally heavier but more powerful for arbitrary searches.
How long should I keep logs?
Regulatory minimums vary: PCI-DSS requires 1 year, HIPAA requires 6 years, SOC 2 recommends 1 year minimum. Operationally, most teams keep 7–30 days of hot (queryable) logs and archive 1 year to cold storage (S3). Logs older than 30 days are rarely queried in practice — incidents are investigated within hours or days.
Is Loki good for production?
Yes — Grafana Labs runs Loki in production at massive scale, and many large organizations use it. The key limitation is that full-text search without label filters is slow (it must scan raw compressed chunks). Design your label schema carefully before deploying at scale: don't use high-cardinality values (user IDs, request IDs) as Loki labels — use log fields instead.