Apache Kafka Monitoring Guide 2026
How to monitor Kafka in production — consumer lag, broker health, throughput metrics, and automated alerting before message backlogs cascade into incidents.
TL;DR
- →Consumer lag is the #1 metric — alert when it grows over 3 consecutive intervals
- →UnderReplicatedPartitions > 0 means data loss risk — page on-call immediately
- →ActiveControllerCount must equal exactly 1 at all times
- →Monitor all four signal types: consumer lag, broker health, producer errors, network throughput
- →Use Prometheus + JMX Exporter (free) or Datadog (managed) — both have 200+ Kafka metrics
Why Kafka Monitoring Is Uniquely Challenging
Apache Kafka exposes hundreds of JMX metrics across brokers, producers, consumers, and topics. The challenge isn't collecting metrics — it's knowing which ones matter and what thresholds to alert on. A misconfigured consumer quietly falls behind while all other metrics look normal. An under-replicated partition means you're one broker crash away from data loss.
Common Kafka incidents in production and their early warning signs:
records-lag-max grows for 3+ consecutive intervalsUnderReplicatedPartitions > 0record-error-rate > 0 + request-latency-avg spikeActiveControllerCount != 1OfflinePartitionsCount > 0Critical Kafka Metrics to Monitor
Kafka exposes hundreds of JMX metrics. These are the ones that actually matter for production reliability, organized by signal category:
Consumer Metrics
Criticalkafka.consumer:type=consumer-fetch-manager-metrics,records-lag-maxMaximum consumer lag across partitions — #1 metric to watch
Alert: growing over 3 intervals; Critical: exceeds SLA-based thresholdkafka.consumer:type=consumer-fetch-manager-metrics,records-consumed-rateConsumer throughput — sudden drop = consumer health issue
Alert: drops >50% from baseline; Critical: drops to 0kafka.consumer:type=consumer-fetch-manager-metrics,fetch-latency-avgAverage time to fetch messages from broker
Alert: >500ms; Critical: >2sBroker Metrics
Criticalkafka.server:type=ReplicaManager,name=UnderReplicatedPartitionsPartitions without sufficient replicas — data loss risk
Critical: > 0 for >30 secondskafka.controller:type=KafkaController,name=ActiveControllerCountNumber of active controllers — should always be exactly 1
Critical: != 1kafka.controller:type=KafkaController,name=OfflinePartitionsCountPartitions with no leader — producers/consumers cannot use these
Critical: > 0kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercentFraction of time request handlers are idle (0–1)
Warning: < 0.3; Critical: < 0.1Producer Metrics
Highkafka.producer:type=producer-metrics,record-error-rateRate of failed record sends — should be 0 in healthy system
Warning: > 0; Critical: > 0.01kafka.producer:type=producer-metrics,request-latency-avgAverage produce request latency
Warning: >100ms; Critical: >500mskafka.producer:type=producer-metrics,record-queue-time-maxMax time records spend in producer queue before send
Warning: >50ms; Critical: >200msTopic/Partition Metrics
Mediumkafka.server:type=BrokerTopicMetrics,name=MessagesInPerSecMessage ingestion rate — baseline and trend monitoring
Alert: drops >80% below baselinekafka.server:type=BrokerTopicMetrics,name=BytesIn/OutPerSecNetwork throughput — watch for saturation
Warning: >80% of network capacity; Critical: >95%kafka.log:type=Log,name=LogStartOffset/LogEndOffsetTrack partition retention and growth
Alert: disk usage >80% on any brokerConsumer Lag: The Most Important Kafka Metric
Consumer lag is the difference between the latest offset (log end offset) and the consumer's current position (committed offset) for each partition. A lag of 0 means the consumer is fully caught up. Growing lag means consumers can't keep pace with producers.
Check consumer lag from CLI
# List all consumer groups and their lag kafka-consumer-groups.sh \ --bootstrap-server localhost:9092 \ --describe \ --all-groups # Check specific consumer group kafka-consumer-groups.sh \ --bootstrap-server localhost:9092 \ --describe \ --group my-consumer-group # Output columns: # GROUP | TOPIC | PARTITION | CURRENT-OFFSET | LOG-END-OFFSET | LAG
Prometheus query for consumer lag alerting
# Consumer lag per group/topic/partition
kafka_consumergroup_lag{group="my-group", topic="my-topic"}
# Total lag for a consumer group (sum across all partitions)
sum(kafka_consumergroup_lag{group="my-group"}) by (group, topic)
# Alert: lag has been growing for 5+ minutes
# (lag at time T > lag at time T-5m)
(
sum(kafka_consumergroup_lag{group="my-group"}) by (group)
>
sum(kafka_consumergroup_lag{group="my-group"} offset 5m) by (group)
)⚠ Alert on Trend, Not Just Value
A lag spike during a burst of traffic is normal — Kafka is designed to buffer messages. What matters is whether lag is growing (consumer can't keep up) vs. shrinking (consumer is catching up). Use Burrow or time-series comparison in Prometheus to detect sustained growth rather than alerting on absolute lag values which create false positives.
📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you
Kafka Monitoring Tools Comparison
The right tool depends on whether you're self-hosting Kafka or using a managed service, and your team's operational capacity:
Prometheus + JMX Exporter + Grafana
Open SourceBest for: Teams with Prometheus expertise; self-hosted Kafka
Datadog
ManagedBest for: Teams already using Datadog for infrastructure
Confluent Control Center
Managed (Confluent)Best for: Confluent Cloud or Confluent Platform deployments
Burrow (LinkedIn)
Open SourceBest for: Consumer lag monitoring specifically
Better Stack
ManagedBest for: Teams wanting fast setup with incident management
Kafka Alerting Runbook
These are the alerts every Kafka cluster should have, ordered by severity:
KafkaUnderReplicatedPartitionsCondition: kafka_server_ReplicaManager_UnderReplicatedPartitions > 0 for 1m
Response: Check broker health immediately — a broker may be down. Identify which topics/partitions are affected. Do NOT perform maintenance until replication is healthy. This is a data loss precursor.
KafkaNoActiveControllerCondition: kafka_controller_KafkaController_ActiveControllerCount != 1 for 1m
Response: ZooKeeper/KRaft issue or network partition. Check ZooKeeper/KRaft ensemble health. Review broker logs for controller election failures. May resolve automatically within seconds; if >2 min, escalate.
KafkaOfflinePartitionsCondition: kafka_controller_KafkaController_OfflinePartitionsCount > 0 for 30s
Response: Partitions with no leader — producers/consumers fail for affected topics. Identify offline partitions, restart dead brokers, or trigger leader election with kafka-leader-election.sh.
KafkaConsumerLagGrowingCondition: Consumer lag has grown for 5+ consecutive minutes (trend, not absolute)
Response: Check consumer health — are consumers processing messages? Look for slow consumers, blocking calls, or schema changes. Confirm producers haven't spiked. Scale consumer group if needed.
KafkaBrokerCPUSaturationCondition: RequestHandlerAvgIdlePercent < 0.20 for 5m
Response: Broker is CPU-saturated. Review recent traffic spikes. Check for consumer group with excessive fetch requests. Consider adding partitions or brokers if sustained.
KafkaDiskHighCondition: Broker disk usage > 80%
Response: Check retention settings — may need to reduce retention.ms or retention.bytes. Identify high-volume topics. Add capacity or increase log compaction. Critical if >90%.
Enabling JMX for Kafka Monitoring
Kafka exposes all metrics via JMX. You need JMX enabled on your brokers and a compatible exporter to get data into your monitoring stack:
Enable JMX on Kafka brokers
# In kafka-server-start.sh or as environment variable export JMX_PORT=9999 # For remote JMX access (production — add auth): export KAFKA_JMX_OPTS="-Dcom.sun.jndi.rmiregistry.interfaces=127.0.0.1 \ -Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.port=9999 \ -Dcom.sun.management.jmxremote.rmi.port=9999 \ -Djava.rmi.server.hostname=<broker-hostname>"
JMX Exporter config for Prometheus (kafka-jmx-exporter.yml)
lowercaseOutputName: true
rules:
# Under-replicated partitions
- pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
name: kafka_server_ReplicaManager_UnderReplicatedPartitions
type: GAUGE
# Active controller
- pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
name: kafka_controller_KafkaController_ActiveControllerCount
type: GAUGE
# Offline partitions
- pattern: 'kafka.controller<type=KafkaController, name=OfflinePartitionsCount><>Value'
name: kafka_controller_KafkaController_OfflinePartitionsCount
type: GAUGE
# Request handler idle ratio
- pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate'
name: kafka_server_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent
type: GAUGEAlert Pro
14-day free trialStop checking — get alerted instantly
Next time your Kafka cluster goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your Kafka cluster + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Related Guides
Distributed Tracing Guide
Trace requests across microservices and event-driven systems.
Database Monitoring Guide
Monitor PostgreSQL, MySQL, Redis, and other data stores.
Infrastructure Monitoring Guide
Monitor servers, containers, and cloud infrastructure.
OpenTelemetry Guide
Instrument Kafka producers and consumers with OTel.
Alert Fatigue Guide
Build alert policies that page on real issues, not noise.
Prometheus vs Datadog
Choose the right monitoring stack for Kafka.
Frequently Asked Questions
What is the most important Kafka metric to monitor?
Consumer lag (kafka.consumer.fetch-manager-metrics:records-lag-max) is the single most important Kafka metric. It measures how far behind a consumer group is from the latest offset. A growing lag means consumers can't keep up with production — this is the earliest warning sign of a backlog before it becomes an incident. Alert when consumer lag exceeds your acceptable processing delay (typically 1,000–10,000 messages depending on throughput).
How do I monitor Kafka consumer lag?
Monitor consumer lag with: (1) kafka-consumer-groups.sh --describe --all-groups to check all consumer groups manually, (2) Prometheus kafka_consumergroup_lag metric via JMX exporter, (3) Burrow (LinkedIn's open-source lag monitor) for trend-based alerting, (4) Confluent Control Center for managed Kafka. Alert when lag grows over 3 consecutive intervals rather than on absolute value — a spike is normal, persistent growth is not.
What Kafka broker metrics should I alert on?
Critical broker metrics to alert on: (1) UnderReplicatedPartitions > 0 — partitions without enough replicas, data loss risk, (2) ActiveControllerCount != 1 — split brain or no controller, (3) OfflinePartitionsCount > 0 — partitions with no leader, unavailable, (4) RequestHandlerAvgIdlePercent < 20% — broker CPU saturation, (5) NetworkProcessorAvgIdlePercent < 30% — network thread saturation. Any of these should page on-call immediately.
What tools can I use to monitor Kafka?
Top Kafka monitoring tools: (1) Prometheus + Kafka JMX Exporter + Grafana — open source, full visibility, requires setup, (2) Datadog Kafka integration — managed, 200+ pre-built metrics, auto-discovery, (3) Confluent Control Center — best for Confluent Cloud, schema registry integration, (4) Burrow — LinkedIn's open-source consumer lag monitor, trend-based alerting, (5) Conduktor Platform — UI-heavy, good for team visibility, (6) Better Stack — infrastructure monitoring with Kafka metric collection, (7) Kafka UI — open source web UI for cluster management and basic monitoring.
What is acceptable Kafka consumer lag?
Acceptable Kafka consumer lag depends on your use case. For real-time systems (payment processing, fraud detection): alert at >100 messages lag, critical at >1,000. For near-real-time (analytics pipelines): alert at >10,000, critical at >100,000. For batch processing: alert when lag hasn't decreased in 30 minutes. The right threshold is lag that represents more delay than your SLA allows. Calculate: lag_messages / consumer_throughput_per_second = processing_delay_in_seconds.