Infrastructure MonitoringUpdated May 2026

Apache Kafka Monitoring Guide 2026

How to monitor Kafka in production — consumer lag, broker health, throughput metrics, and automated alerting before message backlogs cascade into incidents.

TL;DR

  • Consumer lag is the #1 metric — alert when it grows over 3 consecutive intervals
  • UnderReplicatedPartitions > 0 means data loss risk — page on-call immediately
  • ActiveControllerCount must equal exactly 1 at all times
  • Monitor all four signal types: consumer lag, broker health, producer errors, network throughput
  • Use Prometheus + JMX Exporter (free) or Datadog (managed) — both have 200+ Kafka metrics

Why Kafka Monitoring Is Uniquely Challenging

Apache Kafka exposes hundreds of JMX metrics across brokers, producers, consumers, and topics. The challenge isn't collecting metrics — it's knowing which ones matter and what thresholds to alert on. A misconfigured consumer quietly falls behind while all other metrics look normal. An under-replicated partition means you're one broker crash away from data loss.

Common Kafka incidents in production and their early warning signs:

IncidentConsumer lag spike
Early Warningrecords-lag-max grows for 3+ consecutive intervals
Common CauseConsumer slowdown, schema change, or dependency bottleneck
IncidentBroker failure
Early WarningUnderReplicatedPartitions > 0
Common CauseDisk full, OOM, network partition, or GC pause
IncidentProducer timeout storm
Early Warningrecord-error-rate > 0 + request-latency-avg spike
Common CauseBroker overload, network saturation, or wrong producer config
IncidentNo controller elected
Early WarningActiveControllerCount != 1
Common CauseZooKeeper/KRaft issue, network partition, or broker restart race
IncidentTopic partition offline
Early WarningOfflinePartitionsCount > 0
Common CauseAll replicas offline, under-min-ISR condition
📡
Recommended

Monitor your services before your users notice

Try Better Stack Free →

Critical Kafka Metrics to Monitor

Kafka exposes hundreds of JMX metrics. These are the ones that actually matter for production reliability, organized by signal category:

Consumer Metrics

Critical
records-lag-max
kafka.consumer:type=consumer-fetch-manager-metrics,records-lag-max

Maximum consumer lag across partitions — #1 metric to watch

Alert: growing over 3 intervals; Critical: exceeds SLA-based threshold
records-consumed-rate
kafka.consumer:type=consumer-fetch-manager-metrics,records-consumed-rate

Consumer throughput — sudden drop = consumer health issue

Alert: drops >50% from baseline; Critical: drops to 0
fetch-latency-avg
kafka.consumer:type=consumer-fetch-manager-metrics,fetch-latency-avg

Average time to fetch messages from broker

Alert: >500ms; Critical: >2s

Broker Metrics

Critical
UnderReplicatedPartitions
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

Partitions without sufficient replicas — data loss risk

Critical: > 0 for >30 seconds
ActiveControllerCount
kafka.controller:type=KafkaController,name=ActiveControllerCount

Number of active controllers — should always be exactly 1

Critical: != 1
OfflinePartitionsCount
kafka.controller:type=KafkaController,name=OfflinePartitionsCount

Partitions with no leader — producers/consumers cannot use these

Critical: > 0
RequestHandlerAvgIdlePercent
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent

Fraction of time request handlers are idle (0–1)

Warning: < 0.3; Critical: < 0.1

Producer Metrics

High
record-error-rate
kafka.producer:type=producer-metrics,record-error-rate

Rate of failed record sends — should be 0 in healthy system

Warning: > 0; Critical: > 0.01
request-latency-avg
kafka.producer:type=producer-metrics,request-latency-avg

Average produce request latency

Warning: >100ms; Critical: >500ms
record-queue-time-max
kafka.producer:type=producer-metrics,record-queue-time-max

Max time records spend in producer queue before send

Warning: >50ms; Critical: >200ms

Topic/Partition Metrics

Medium
MessagesInPerSec
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec

Message ingestion rate — baseline and trend monitoring

Alert: drops >80% below baseline
BytesInPerSec / BytesOutPerSec
kafka.server:type=BrokerTopicMetrics,name=BytesIn/OutPerSec

Network throughput — watch for saturation

Warning: >80% of network capacity; Critical: >95%
LogStartOffset / LogEndOffset
kafka.log:type=Log,name=LogStartOffset/LogEndOffset

Track partition retention and growth

Alert: disk usage >80% on any broker

Consumer Lag: The Most Important Kafka Metric

Consumer lag is the difference between the latest offset (log end offset) and the consumer's current position (committed offset) for each partition. A lag of 0 means the consumer is fully caught up. Growing lag means consumers can't keep pace with producers.

Check consumer lag from CLI

# List all consumer groups and their lag
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --all-groups

# Check specific consumer group
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group my-consumer-group

# Output columns:
# GROUP | TOPIC | PARTITION | CURRENT-OFFSET | LOG-END-OFFSET | LAG

Prometheus query for consumer lag alerting

# Consumer lag per group/topic/partition
kafka_consumergroup_lag{group="my-group", topic="my-topic"}

# Total lag for a consumer group (sum across all partitions)
sum(kafka_consumergroup_lag{group="my-group"}) by (group, topic)

# Alert: lag has been growing for 5+ minutes
# (lag at time T > lag at time T-5m)
(
  sum(kafka_consumergroup_lag{group="my-group"}) by (group)
  >
  sum(kafka_consumergroup_lag{group="my-group"} offset 5m) by (group)
)

⚠ Alert on Trend, Not Just Value

A lag spike during a burst of traffic is normal — Kafka is designed to buffer messages. What matters is whether lag is growing (consumer can't keep up) vs. shrinking (consumer is catching up). Use Burrow or time-series comparison in Prometheus to detect sustained growth rather than alerting on absolute lag values which create false positives.

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

Kafka Monitoring Tools Comparison

The right tool depends on whether you're self-hosting Kafka or using a managed service, and your team's operational capacity:

Prometheus + JMX Exporter + Grafana

Open Source

Best for: Teams with Prometheus expertise; self-hosted Kafka

ProsFree, highly customizable, 200+ dashboard templates, community-supported
ConsRequires setup and maintenance; no built-in alerting UX
Setup: Install JMX exporter as Kafka agent → scrape with Prometheus → dashboard in Grafana

Datadog

Managed

Best for: Teams already using Datadog for infrastructure

ProsAuto-discovery, 200+ pre-built Kafka metrics, correlated with infra metrics
Cons$15–25/host/month — expensive at scale
Setup: Install Datadog agent → enable Kafka integration → import dashboards

Confluent Control Center

Managed (Confluent)

Best for: Confluent Cloud or Confluent Platform deployments

ProsNative integration, schema registry visibility, consumer group management
ConsConfluent Cloud only or expensive Confluent Platform license
Setup: Built into Confluent Platform; enable in Confluent Cloud dashboard

Burrow (LinkedIn)

Open Source

Best for: Consumer lag monitoring specifically

ProsTrend-based lag evaluation (not just absolute values), HTTP API
ConsLag monitoring only — not a full observability solution
Setup: Deploy Burrow service → configure Kafka cluster → query HTTP API or integrate with alerting

Better Stack

Managed

Best for: Teams wanting fast setup with incident management

ProsEasy setup, incident timelines, on-call scheduling, status pages
ConsLess Kafka-specific than Datadog
Setup: Connect infrastructure monitoring → create Kafka-specific uptime checks → configure alerts

Kafka Alerting Runbook

These are the alerts every Kafka cluster should have, ordered by severity:

CRITICALKafkaUnderReplicatedPartitions

Condition: kafka_server_ReplicaManager_UnderReplicatedPartitions > 0 for 1m

Response: Check broker health immediately — a broker may be down. Identify which topics/partitions are affected. Do NOT perform maintenance until replication is healthy. This is a data loss precursor.

CRITICALKafkaNoActiveController

Condition: kafka_controller_KafkaController_ActiveControllerCount != 1 for 1m

Response: ZooKeeper/KRaft issue or network partition. Check ZooKeeper/KRaft ensemble health. Review broker logs for controller election failures. May resolve automatically within seconds; if >2 min, escalate.

CRITICALKafkaOfflinePartitions

Condition: kafka_controller_KafkaController_OfflinePartitionsCount > 0 for 30s

Response: Partitions with no leader — producers/consumers fail for affected topics. Identify offline partitions, restart dead brokers, or trigger leader election with kafka-leader-election.sh.

WARNINGKafkaConsumerLagGrowing

Condition: Consumer lag has grown for 5+ consecutive minutes (trend, not absolute)

Response: Check consumer health — are consumers processing messages? Look for slow consumers, blocking calls, or schema changes. Confirm producers haven't spiked. Scale consumer group if needed.

WARNINGKafkaBrokerCPUSaturation

Condition: RequestHandlerAvgIdlePercent < 0.20 for 5m

Response: Broker is CPU-saturated. Review recent traffic spikes. Check for consumer group with excessive fetch requests. Consider adding partitions or brokers if sustained.

WARNINGKafkaDiskHigh

Condition: Broker disk usage > 80%

Response: Check retention settings — may need to reduce retention.ms or retention.bytes. Identify high-volume topics. Add capacity or increase log compaction. Critical if >90%.

Enabling JMX for Kafka Monitoring

Kafka exposes all metrics via JMX. You need JMX enabled on your brokers and a compatible exporter to get data into your monitoring stack:

Enable JMX on Kafka brokers

# In kafka-server-start.sh or as environment variable
export JMX_PORT=9999

# For remote JMX access (production — add auth):
export KAFKA_JMX_OPTS="-Dcom.sun.jndi.rmiregistry.interfaces=127.0.0.1 \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.port=9999 \
  -Dcom.sun.management.jmxremote.rmi.port=9999 \
  -Djava.rmi.server.hostname=<broker-hostname>"

JMX Exporter config for Prometheus (kafka-jmx-exporter.yml)

lowercaseOutputName: true
rules:
  # Under-replicated partitions
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
    name: kafka_server_ReplicaManager_UnderReplicatedPartitions
    type: GAUGE
  # Active controller
  - pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
    name: kafka_controller_KafkaController_ActiveControllerCount
    type: GAUGE
  # Offline partitions
  - pattern: 'kafka.controller<type=KafkaController, name=OfflinePartitionsCount><>Value'
    name: kafka_controller_KafkaController_OfflinePartitionsCount
    type: GAUGE
  # Request handler idle ratio
  - pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate'
    name: kafka_server_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent
    type: GAUGE

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your Kafka cluster goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for your Kafka cluster + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial

Related Guides

Frequently Asked Questions

What is the most important Kafka metric to monitor?

Consumer lag (kafka.consumer.fetch-manager-metrics:records-lag-max) is the single most important Kafka metric. It measures how far behind a consumer group is from the latest offset. A growing lag means consumers can't keep up with production — this is the earliest warning sign of a backlog before it becomes an incident. Alert when consumer lag exceeds your acceptable processing delay (typically 1,000–10,000 messages depending on throughput).

How do I monitor Kafka consumer lag?

Monitor consumer lag with: (1) kafka-consumer-groups.sh --describe --all-groups to check all consumer groups manually, (2) Prometheus kafka_consumergroup_lag metric via JMX exporter, (3) Burrow (LinkedIn's open-source lag monitor) for trend-based alerting, (4) Confluent Control Center for managed Kafka. Alert when lag grows over 3 consecutive intervals rather than on absolute value — a spike is normal, persistent growth is not.

What Kafka broker metrics should I alert on?

Critical broker metrics to alert on: (1) UnderReplicatedPartitions > 0 — partitions without enough replicas, data loss risk, (2) ActiveControllerCount != 1 — split brain or no controller, (3) OfflinePartitionsCount > 0 — partitions with no leader, unavailable, (4) RequestHandlerAvgIdlePercent < 20% — broker CPU saturation, (5) NetworkProcessorAvgIdlePercent < 30% — network thread saturation. Any of these should page on-call immediately.

What tools can I use to monitor Kafka?

Top Kafka monitoring tools: (1) Prometheus + Kafka JMX Exporter + Grafana — open source, full visibility, requires setup, (2) Datadog Kafka integration — managed, 200+ pre-built metrics, auto-discovery, (3) Confluent Control Center — best for Confluent Cloud, schema registry integration, (4) Burrow — LinkedIn's open-source consumer lag monitor, trend-based alerting, (5) Conduktor Platform — UI-heavy, good for team visibility, (6) Better Stack — infrastructure monitoring with Kafka metric collection, (7) Kafka UI — open source web UI for cluster management and basic monitoring.

What is acceptable Kafka consumer lag?

Acceptable Kafka consumer lag depends on your use case. For real-time systems (payment processing, fraud detection): alert at >100 messages lag, critical at >1,000. For near-real-time (analytics pipelines): alert at >10,000, critical at >100,000. For batch processing: alert when lag hasn't decreased in 30 minutes. The right threshold is lag that represents more delay than your SLA allows. Calculate: lag_messages / consumer_throughput_per_second = processing_delay_in_seconds.