Kubernetes Monitoring Guide 2026: Tools, Metrics & Best Practices
Everything you need to know about monitoring Kubernetes in production — from critical metrics and alerting to tool comparisons and cost optimization.
⚡ TL;DR
- • Monitor 5 layers: infrastructure → control plane → workloads → application → external endpoints
- • Start with kube-prometheus-stack (Helm chart) — installs Prometheus + Grafana in minutes
- • Critical alerts: CrashLoopBackOff, node memory, PVC usage, pending pods, API server latency
- • Best managed option for cost-conscious teams: Better Stack ($25/mo) or Grafana Cloud (free tier)
- • For enterprise: Datadog has the best K8s auto-discovery, but costs $300+/month for 20 nodes
The 5 Layers of Kubernetes Monitoring
Kubernetes monitoring is multi-layered. Missing even one layer leaves blind spots that cause production incidents. Here's the full stack:
What: Nodes, VMs, network, storage
Tools: Prometheus Node Exporter, cloud provider metrics
What: API server, etcd, scheduler, controller
Tools: kube-state-metrics, metrics-server
What: Pods, deployments, jobs, DaemonSets
Tools: Prometheus, Datadog Agent, New Relic
What: Custom metrics, latency, error rates
Tools: Prometheus client libraries, OpenTelemetry
What: Ingress, LoadBalancer, service URLs
Tools: Better Stack, Pingdom, APIStatusCheck
Critical K8s Metrics & PromQL Queries
These 6 alerts should be in every Kubernetes monitoring setup. Copy-paste into your Prometheus Alertmanager:
kube_pod_container_status_restarts_total > 5node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8kube_pod_status_phase{phase="Pending"} > 0 for 5mapiserver_request_duration_seconds{quantile="0.99"} > 1changes(etcd_server_leader_changes_seen_total[15m]) > 0Kubernetes Monitoring Tools Compared
Prometheus + Grafana
Open SourceFreeIndustry standard, vast ecosystem, no vendor lock-in
Requires ops overhead, no built-in long-term storage
Teams with DevOps capacity
Better Stack
Managed SaaSFrom $25/moSimple setup, uptime + alerting + on-call in one, transparent pricing
Less deep APM than Datadog
Startup to mid-size teams
Datadog
Enterprise SaaS$15-23/node/moBest UX, auto-discovery, unified metrics/logs/traces
Very expensive at scale, complex billing
Enterprise teams with large budgets
Grafana Cloud
Managed SaaSFree → $8/moManaged Prometheus + Loki + Tempo, generous free tier
Grafana UI learning curve
Teams already using Prometheus/Grafana
New Relic
Enterprise SaaS$99/user/moPowerful K8s Pixie auto-instrumentation, good UX
User-based pricing gets expensive fast
Teams wanting APM + K8s in one tool
Quick Start: kube-prometheus-stack
The fastest way to get production-grade K8s monitoring running. One Helm command installs Prometheus, Alertmanager, Grafana, and 20+ pre-built dashboards:
# Add the Prometheus community chart repo helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install the full monitoring stack helm install monitoring prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.adminPassword=changeme \ --set prometheus.prometheusSpec.retention=30d # Access Grafana dashboard kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
After installation, browse to localhost:3000 with admin/changeme. Grafana will have pre-built K8s dashboards for nodes, pods, workloads, and control plane components.
External Monitoring: Why In-Cluster Isn't Enough
In-cluster monitoring (Prometheus) only detects issues once traffic reaches your cluster. But what if the cluster itself is unreachable, or your ingress is broken? That's where external uptime monitoring becomes critical:
Why you need external monitoring too
- • DNS failures — Prometheus won't catch if your domain stops resolving
- • Load balancer issues — ingress controller up, but external traffic blocked
- • SSL certificate expiry — cert-manager can fail silently
- • DDoS / CDN issues — your cluster is fine but users can't reach it
- • Multi-region awareness — monitor from US, EU, Asia to catch regional routing failures
Tools like Better Stack and APIStatusCheck monitor your K8s endpoints from 10+ global locations every 30-60 seconds — catching external failures that in-cluster monitoring misses entirely.
Frequently Asked Questions
What is the best Kubernetes monitoring tool in 2026?
The best Kubernetes monitoring tool depends on your team size and budget. For open-source: the Prometheus + Grafana stack is the industry standard — widely supported, deeply integrated with K8s. For managed solutions: Better Stack offers affordable uptime monitoring with K8s-aware alerting starting at $25/month. For full enterprise observability: Datadog has the best UX but costs $15-23/host/month. For medium teams: New Relic and Dynatrace offer powerful K8s auto-discovery with auto-instrumentation.
What Kubernetes metrics should I monitor?
Critical Kubernetes metrics to monitor: (1) Node metrics — CPU/memory utilization, disk I/O, network throughput, (2) Pod metrics — restart count (crashloopbackoff detection), CPU/memory requests vs limits, (3) Control plane — API server latency, etcd request duration, scheduler queue depth, (4) Application metrics — custom metrics via Prometheus instrumentation, request rate, error rate, p99 latency (RED method), (5) Cluster health — PVC usage, eviction events, pending pods count.
How do I set up Prometheus for Kubernetes?
The easiest way to set up Prometheus on Kubernetes is using the kube-prometheus-stack Helm chart: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack. This installs Prometheus, Alertmanager, Grafana, and all standard K8s dashboards in one command. For production, configure persistent storage for metrics retention and set up remote write to long-term storage (Thanos, Cortex, or Grafana Cloud).
How much does Kubernetes monitoring cost?
Kubernetes monitoring costs vary widely: (1) Prometheus + Grafana self-hosted: free, but requires infrastructure and ops overhead, (2) Grafana Cloud: free tier for small clusters, ~$8/month for medium, (3) Better Stack: from $25/month, covers uptime + alerting, (4) Datadog: $15-23/node/month — a 20-node cluster costs $300-460/month, (5) New Relic: free tier for 1 user, then $99/user/month. Most teams spend $50-500/month depending on cluster size and tool choice.
What is the difference between Prometheus and Datadog for Kubernetes?
Prometheus is open-source and pull-based — you define scrape configs for each service, store metrics locally, and visualize in Grafana. It requires more setup but has no per-host cost. Datadog is a fully managed SaaS that auto-discovers K8s workloads and provides out-of-box dashboards, distributed tracing, and log management in one platform. Prometheus is the default for cost-conscious teams; Datadog is preferred when time-to-value and cross-signal correlation (metrics + logs + traces) justify the cost.
Related Guides
📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you