Infrastructure MonitoringUpdated May 2026

Apache Kafka Monitoring Guide 2026

Q: What tools can I use to monitor Kafka?

Top Kafka monitoring tools: (1) Prometheus + Kafka JMX Exporter + Grafana — open source, full visibility, requires setup, (2) Datadog Kafka integration — managed, 200+ pre-built metrics, auto-discovery, (3) Confluent Control Center — best for Confluent Cloud, schema registry integration, (4) Burrow — LinkedIn's open-source consumer lag monitor, trend-based alerting, (5) Conduktor Platform — UI-heavy, good for team visibility, (6) Better Stack — infrastructure monitoring with Kafka metric collection, (7) Kafka UI — open source web UI for cluster management and basic monitoring.

Q: What is acceptable Kafka consumer lag?

Acceptable Kafka consumer lag depends on your use case. For real-time systems (payment processing, fraud detection): alert at >100 messages lag, critical at >1,000. For near-real-time (analytics pipelines): alert at >10,000, critical at >100,000. For batch processing: alert when lag hasn't decreased in 30 minutes. The right threshold is lag that represents more delay than your SLA allows. Calculate: lag_messages / consumer_throughput_per_second = processing_delay_in_seconds.

How to monitor Kafka in production — consumer lag, broker health, throughput metrics, and automated alerting before message backlogs cascade into incidents.

TL;DR

→Consumer lag is the #1 metric — alert when it grows over 3 consecutive intervals
→UnderReplicatedPartitions > 0 means data loss risk — page on-call immediately
→ActiveControllerCount must equal exactly 1 at all times
→Monitor all four signal types: consumer lag, broker health, producer errors, network throughput
→Use Prometheus + JMX Exporter (free) or Datadog (managed) — both have 200+ Kafka metrics

Why Kafka Monitoring Is Uniquely Challenging

Apache Kafka exposes hundreds of JMX metrics across brokers, producers, consumers, and topics. The challenge isn't collecting metrics — it's knowing which ones matter and what thresholds to alert on. A misconfigured consumer quietly falls behind while all other metrics look normal. An under-replicated partition means you're one broker crash away from data loss.

Common Kafka incidents in production and their early warning signs:

IncidentConsumer lag spike

Early Warningrecords-lag-max grows for 3+ consecutive intervals

Common CauseConsumer slowdown, schema change, or dependency bottleneck

IncidentBroker failure

Early WarningUnderReplicatedPartitions > 0

Common CauseDisk full, OOM, network partition, or GC pause

IncidentProducer timeout storm

Early Warningrecord-error-rate > 0 + request-latency-avg spike

Common CauseBroker overload, network saturation, or wrong producer config

IncidentNo controller elected

Early WarningActiveControllerCount != 1

Common CauseZooKeeper/KRaft issue, network partition, or broker restart race

IncidentTopic partition offline

Early WarningOfflinePartitionsCount > 0

Common CauseAll replicas offline, under-min-ISR condition

📡

Recommended

Monitor your services before your users notice

Try Better Stack Free →

Critical Kafka Metrics to Monitor

Kafka exposes hundreds of JMX metrics. These are the ones that actually matter for production reliability, organized by signal category:

Consumer Metrics

Critical

records-lag-max

kafka.consumer:type=consumer-fetch-manager-metrics,records-lag-max

Maximum consumer lag across partitions — #1 metric to watch

Alert: growing over 3 intervals; Critical: exceeds SLA-based threshold

records-consumed-rate

kafka.consumer:type=consumer-fetch-manager-metrics,records-consumed-rate

Consumer throughput — sudden drop = consumer health issue

Alert: drops >50% from baseline; Critical: drops to 0

fetch-latency-avg

kafka.consumer:type=consumer-fetch-manager-metrics,fetch-latency-avg

Average time to fetch messages from broker

Alert: >500ms; Critical: >2s

Broker Metrics

Critical

UnderReplicatedPartitions

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

Partitions without sufficient replicas — data loss risk

Critical: > 0 for >30 seconds

ActiveControllerCount

kafka.controller:type=KafkaController,name=ActiveControllerCount

Number of active controllers — should always be exactly 1

Critical: != 1

OfflinePartitionsCount

kafka.controller:type=KafkaController,name=OfflinePartitionsCount

Partitions with no leader — producers/consumers cannot use these

Critical: > 0

RequestHandlerAvgIdlePercent

kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent

Fraction of time request handlers are idle (0–1)

Warning: < 0.3; Critical: < 0.1

Producer Metrics

High

record-error-rate

kafka.producer:type=producer-metrics,record-error-rate

Rate of failed record sends — should be 0 in healthy system

Warning: > 0; Critical: > 0.01

request-latency-avg

kafka.producer:type=producer-metrics,request-latency-avg

Average produce request latency

Warning: >100ms; Critical: >500ms

record-queue-time-max

kafka.producer:type=producer-metrics,record-queue-time-max

Max time records spend in producer queue before send

Warning: >50ms; Critical: >200ms

Topic/Partition Metrics

Medium

MessagesInPerSec

kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec

Message ingestion rate — baseline and trend monitoring

Alert: drops >80% below baseline

BytesInPerSec / BytesOutPerSec

kafka.server:type=BrokerTopicMetrics,name=BytesIn/OutPerSec

Network throughput — watch for saturation

Warning: >80% of network capacity; Critical: >95%

LogStartOffset / LogEndOffset

kafka.log:type=Log,name=LogStartOffset/LogEndOffset

Track partition retention and growth

Alert: disk usage >80% on any broker

Consumer Lag: The Most Important Kafka Metric

Consumer lag is the difference between the latest offset (log end offset) and the consumer's current position (committed offset) for each partition. A lag of 0 means the consumer is fully caught up. Growing lag means consumers can't keep pace with producers.

Check consumer lag from CLI

# List all consumer groups and their lag
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --all-groups

# Check specific consumer group
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group my-consumer-group

# Output columns:
# GROUP | TOPIC | PARTITION | CURRENT-OFFSET | LOG-END-OFFSET | LAG

Prometheus query for consumer lag alerting

# Consumer lag per group/topic/partition
kafka_consumergroup_lag{group="my-group", topic="my-topic"}

# Total lag for a consumer group (sum across all partitions)
sum(kafka_consumergroup_lag{group="my-group"}) by (group, topic)

# Alert: lag has been growing for 5+ minutes
# (lag at time T > lag at time T-5m)
(
  sum(kafka_consumergroup_lag{group="my-group"}) by (group)
  >
  sum(kafka_consumergroup_lag{group="my-group"} offset 5m) by (group)
)

⚠ Alert on Trend, Not Just Value

A lag spike during a burst of traffic is normal — Kafka is designed to buffer messages. What matters is whether lag is growing (consumer can't keep up) vs. shrinking (consumer is catching up). Use Burrow or time-series comparison in Prometheus to detect sustained growth rather than alerting on absolute lag values which create false positives.

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

Kafka Monitoring Tools Comparison

The right tool depends on whether you're self-hosting Kafka or using a managed service, and your team's operational capacity:

Prometheus + JMX Exporter + Grafana

Open Source

Best for: Teams with Prometheus expertise; self-hosted Kafka

ProsFree, highly customizable, 200+ dashboard templates, community-supported

ConsRequires setup and maintenance; no built-in alerting UX

Setup: Install JMX exporter as Kafka agent → scrape with Prometheus → dashboard in Grafana

Datadog

Managed

Best for: Teams already using Datadog for infrastructure

ProsAuto-discovery, 200+ pre-built Kafka metrics, correlated with infra metrics

Cons$15–25/host/month — expensive at scale

Setup: Install Datadog agent → enable Kafka integration → import dashboards

Confluent Control Center

Managed (Confluent)

Best for: Confluent Cloud or Confluent Platform deployments

ProsNative integration, schema registry visibility, consumer group management

ConsConfluent Cloud only or expensive Confluent Platform license

Setup: Built into Confluent Platform; enable in Confluent Cloud dashboard

Burrow (LinkedIn)

Open Source

Best for: Consumer lag monitoring specifically

ProsTrend-based lag evaluation (not just absolute values), HTTP API

ConsLag monitoring only — not a full observability solution

Setup: Deploy Burrow service → configure Kafka cluster → query HTTP API or integrate with alerting

Better Stack

Managed

Best for: Teams wanting fast setup with incident management

ProsEasy setup, incident timelines, on-call scheduling, status pages

ConsLess Kafka-specific than Datadog

Setup: Connect infrastructure monitoring → create Kafka-specific uptime checks → configure alerts

Kafka Alerting Runbook

These are the alerts every Kafka cluster should have, ordered by severity:

CRITICALKafkaUnderReplicatedPartitions

Condition: kafka_server_ReplicaManager_UnderReplicatedPartitions > 0 for 1m

Response: Check broker health immediately — a broker may be down. Identify which topics/partitions are affected. Do NOT perform maintenance until replication is healthy. This is a data loss precursor.

CRITICALKafkaNoActiveController

Condition: kafka_controller_KafkaController_ActiveControllerCount != 1 for 1m

Response: ZooKeeper/KRaft issue or network partition. Check ZooKeeper/KRaft ensemble health. Review broker logs for controller election failures. May resolve automatically within seconds; if >2 min, escalate.

CRITICALKafkaOfflinePartitions

Condition: kafka_controller_KafkaController_OfflinePartitionsCount > 0 for 30s

Response: Partitions with no leader — producers/consumers fail for affected topics. Identify offline partitions, restart dead brokers, or trigger leader election with kafka-leader-election.sh.

WARNINGKafkaConsumerLagGrowing

Condition: Consumer lag has grown for 5+ consecutive minutes (trend, not absolute)

Response: Check consumer health — are consumers processing messages? Look for slow consumers, blocking calls, or schema changes. Confirm producers haven't spiked. Scale consumer group if needed.

WARNINGKafkaBrokerCPUSaturation

Condition: RequestHandlerAvgIdlePercent < 0.20 for 5m

Response: Broker is CPU-saturated. Review recent traffic spikes. Check for consumer group with excessive fetch requests. Consider adding partitions or brokers if sustained.

WARNINGKafkaDiskHigh

Condition: Broker disk usage > 80%

Response: Check retention settings — may need to reduce retention.ms or retention.bytes. Identify high-volume topics. Add capacity or increase log compaction. Critical if >90%.

Enabling JMX for Kafka Monitoring

Kafka exposes all metrics via JMX. You need JMX enabled on your brokers and a compatible exporter to get data into your monitoring stack:

Enable JMX on Kafka brokers

# In kafka-server-start.sh or as environment variable
export JMX_PORT=9999

# For remote JMX access (production — add auth):
export KAFKA_JMX_OPTS="-Dcom.sun.jndi.rmiregistry.interfaces=127.0.0.1 \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.port=9999 \
  -Dcom.sun.management.jmxremote.rmi.port=9999 \
  -Djava.rmi.server.hostname=<broker-hostname>"

JMX Exporter config for Prometheus (kafka-jmx-exporter.yml)

lowercaseOutputName: true
rules:
  # Under-replicated partitions
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
    name: kafka_server_ReplicaManager_UnderReplicatedPartitions
    type: GAUGE
  # Active controller
  - pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
    name: kafka_controller_KafkaController_ActiveControllerCount
    type: GAUGE
  # Offline partitions
  - pattern: 'kafka.controller<type=KafkaController, name=OfflinePartitionsCount><>Value'
    name: kafka_controller_KafkaController_OfflinePartitionsCount
    type: GAUGE
  # Request handler idle ratio
  - pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate'
    name: kafka_server_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent
    type: GAUGE

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your Kafka cluster goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your Kafka cluster + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

Related Guides

Distributed Tracing Guide

Trace requests across microservices and event-driven systems.

Database Monitoring Guide

Monitor PostgreSQL, MySQL, Redis, and other data stores.

Infrastructure Monitoring Guide

Monitor servers, containers, and cloud infrastructure.

OpenTelemetry Guide

Instrument Kafka producers and consumers with OTel.

Alert Fatigue Guide

Build alert policies that page on real issues, not noise.

Prometheus vs Datadog

Choose the right monitoring stack for Kafka.

Frequently Asked Questions

What is the most important Kafka metric to monitor?

Consumer lag (kafka.consumer.fetch-manager-metrics:records-lag-max) is the single most important Kafka metric. It measures how far behind a consumer group is from the latest offset. A growing lag means consumers can't keep up with production — this is the earliest warning sign of a backlog before it becomes an incident. Alert when consumer lag exceeds your acceptable processing delay (typically 1,000–10,000 messages depending on throughput).

How do I monitor Kafka consumer lag?

Monitor consumer lag with: (1) kafka-consumer-groups.sh --describe --all-groups to check all consumer groups manually, (2) Prometheus kafka_consumergroup_lag metric via JMX exporter, (3) Burrow (LinkedIn's open-source lag monitor) for trend-based alerting, (4) Confluent Control Center for managed Kafka. Alert when lag grows over 3 consecutive intervals rather than on absolute value — a spike is normal, persistent growth is not.

What Kafka broker metrics should I alert on?

Critical broker metrics to alert on: (1) UnderReplicatedPartitions > 0 — partitions without enough replicas, data loss risk, (2) ActiveControllerCount != 1 — split brain or no controller, (3) OfflinePartitionsCount > 0 — partitions with no leader, unavailable, (4) RequestHandlerAvgIdlePercent < 20% — broker CPU saturation, (5) NetworkProcessorAvgIdlePercent < 30% — network thread saturation. Any of these should page on-call immediately.

What tools can I use to monitor Kafka?

Top Kafka monitoring tools: (1) Prometheus + Kafka JMX Exporter + Grafana — open source, full visibility, requires setup, (2) Datadog Kafka integration — managed, 200+ pre-built metrics, auto-discovery, (3) Confluent Control Center — best for Confluent Cloud, schema registry integration, (4) Burrow — LinkedIn's open-source consumer lag monitor, trend-based alerting, (5) Conduktor Platform — UI-heavy, good for team visibility, (6) Better Stack — infrastructure monitoring with Kafka metric collection, (7) Kafka UI — open source web UI for cluster management and basic monitoring.

What is acceptable Kafka consumer lag?

Acceptable Kafka consumer lag depends on your use case. For real-time systems (payment processing, fraud detection): alert at >100 messages lag, critical at >1,000. For near-real-time (analytics pipelines): alert at >10,000, critical at >100,000. For batch processing: alert when lag hasn't decreased in 30 minutes. The right threshold is lag that represents more delay than your SLA allows. Calculate: lag_messages / consumer_throughput_per_second = processing_delay_in_seconds.