Blog›Kubernetes Monitoring Guide

📘 Comprehensive Guide · 12 min read

Kubernetes Monitoring Guide 2026: Tools, Metrics & Best Practices

Everything you need to know about monitoring Kubernetes in production — from critical metrics and alerting to tool comparisons and cost optimization.

Published: April 7, 2026•By API Status Check

⚡ TL;DR

• Monitor 5 layers: infrastructure → control plane → workloads → application → external endpoints
• Start with kube-prometheus-stack (Helm chart) — installs Prometheus + Grafana in minutes
• Critical alerts: CrashLoopBackOff, node memory, PVC usage, pending pods, API server latency
• Best managed option for cost-conscious teams: Better Stack ($25/mo) or Grafana Cloud (free tier)
• For enterprise: Datadog has the best K8s auto-discovery, but costs $300+/month for 20 nodes

The 5 Layers of Kubernetes Monitoring

Kubernetes monitoring is multi-layered. Missing even one layer leaves blind spots that cause production incidents. Here's the full stack:

1Infrastructure

What: Nodes, VMs, network, storage

Tools: Prometheus Node Exporter, cloud provider metrics

2Kubernetes Control Plane

What: API server, etcd, scheduler, controller

Tools: kube-state-metrics, metrics-server

3Workloads

What: Pods, deployments, jobs, DaemonSets

Tools: Prometheus, Datadog Agent, New Relic

4Applications

What: Custom metrics, latency, error rates

Tools: Prometheus client libraries, OpenTelemetry

5External Endpoints

What: Ingress, LoadBalancer, service URLs

Tools: Better Stack, Pingdom, APIStatusCheck

📡

Recommended

Monitor your services before your users notice

Try Better Stack Free →

Critical K8s Metrics & PromQL Queries

These 6 alerts should be in every Kubernetes monitoring setup. Copy-paste into your Prometheus Alertmanager:

Pod CrashLoopBackOffCritical

kube_pod_container_status_restarts_total > 5

Node Memory > 85%Warning

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15

PVC Usage > 80%Warning

kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8

The fastest way to get production-grade K8s monitoring running. One Helm command installs Prometheus, Alertmanager, Grafana, and 20+ pre-built dashboards:

# Add the Prometheus community chart repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install the full monitoring stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=changeme \
  --set prometheus.prometheusSpec.retention=30d

# Access Grafana dashboard
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80

After installation, browse to localhost:3000 with admin/changeme. Grafana will have pre-built K8s dashboards for nodes, pods, workloads, and control plane components.

External Monitoring: Why In-Cluster Isn't Enough

In-cluster monitoring (Prometheus) only detects issues once traffic reaches your cluster. But what if the cluster itself is unreachable, or your ingress is broken? That's where external uptime monitoring becomes critical:

Why you need external monitoring too

• DNS failures — Prometheus won't catch if your domain stops resolving
• Load balancer issues — ingress controller up, but external traffic blocked
• SSL certificate expiry — cert-manager can fail silently
• DDoS / CDN issues — your cluster is fine but users can't reach it
• Multi-region awareness — monitor from US, EU, Asia to catch regional routing failures

Tools like Better Stack and APIStatusCheck monitor your K8s endpoints from 10+ global locations every 30-60 seconds — catching external failures that in-cluster monitoring misses entirely.

Frequently Asked Questions

What is the best Kubernetes monitoring tool in 2026?

The best Kubernetes monitoring tool depends on your team size and budget. For open-source: the Prometheus + Grafana stack is the industry standard — widely supported, deeply integrated with K8s. For managed solutions: Better Stack offers affordable uptime monitoring with K8s-aware alerting starting at $25/month. For full enterprise observability: Datadog has the best UX but costs $15-23/host/month. For medium teams: New Relic and Dynatrace offer powerful K8s auto-discovery with auto-instrumentation.

What Kubernetes metrics should I monitor?

Critical Kubernetes metrics to monitor: (1) Node metrics — CPU/memory utilization, disk I/O, network throughput, (2) Pod metrics — restart count (crashloopbackoff detection), CPU/memory requests vs limits, (3) Control plane — API server latency, etcd request duration, scheduler queue depth, (4) Application metrics — custom metrics via Prometheus instrumentation, request rate, error rate, p99 latency (RED method), (5) Cluster health — PVC usage, eviction events, pending pods count.

How do I set up Prometheus for Kubernetes?

The easiest way to set up Prometheus on Kubernetes is using the kube-prometheus-stack Helm chart: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack. This installs Prometheus, Alertmanager, Grafana, and all standard K8s dashboards in one command. For production, configure persistent storage for metrics retention and set up remote write to long-term storage (Thanos, Cortex, or Grafana Cloud).

How much does Kubernetes monitoring cost?

Kubernetes monitoring costs vary widely: (1) Prometheus + Grafana self-hosted: free, but requires infrastructure and ops overhead, (2) Grafana Cloud: free tier for small clusters, ~$8/month for medium, (3) Better Stack: from $25/month, covers uptime + alerting, (4) Datadog: $15-23/node/month — a 20-node cluster costs $300-460/month, (5) New Relic: free tier for 1 user, then $99/user/month. Most teams spend $50-500/month depending on cluster size and tool choice.

What is the difference between Prometheus and Datadog for Kubernetes?

Prometheus is open-source and pull-based — you define scrape configs for each service, store metrics locally, and visualize in Grafana. It requires more setup but has no per-host cost. Datadog is a fully managed SaaS that auto-discovers K8s workloads and provides out-of-box dashboards, distributed tracing, and log management in one platform. Prometheus is the default for cost-conscious teams; Datadog is preferred when time-to-value and cross-signal correlation (metrics + logs + traces) justify the cost.

Related Guides

Best Uptime Monitoring Tools 2026

Compare the top monitoring services

Datadog Alternatives

Cheaper K8s monitoring options

Observability Consolidation Guide

Reduce monitoring tool sprawl

API Observability Guide

Monitor your APIs and microservices

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you