BlogKubernetes Monitoring Guide
📘 Comprehensive Guide · 12 min read

Kubernetes Monitoring Guide 2026: Tools, Metrics & Best Practices

Everything you need to know about monitoring Kubernetes in production — from critical metrics and alerting to tool comparisons and cost optimization.

Published: April 7, 2026By API Status Check

⚡ TL;DR

The 5 Layers of Kubernetes Monitoring

Kubernetes monitoring is multi-layered. Missing even one layer leaves blind spots that cause production incidents. Here's the full stack:

1Infrastructure

What: Nodes, VMs, network, storage

Tools: Prometheus Node Exporter, cloud provider metrics

2Kubernetes Control Plane

What: API server, etcd, scheduler, controller

Tools: kube-state-metrics, metrics-server

3Workloads

What: Pods, deployments, jobs, DaemonSets

Tools: Prometheus, Datadog Agent, New Relic

4Applications

What: Custom metrics, latency, error rates

Tools: Prometheus client libraries, OpenTelemetry

5External Endpoints

What: Ingress, LoadBalancer, service URLs

Tools: Better Stack, Pingdom, APIStatusCheck

📡
Recommended

Monitor your services before your users notice

Try Better Stack Free →

Critical K8s Metrics & PromQL Queries

These 6 alerts should be in every Kubernetes monitoring setup. Copy-paste into your Prometheus Alertmanager:

Pod CrashLoopBackOffCritical
kube_pod_container_status_restarts_total > 5
Node Memory > 85%Warning
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15
PVC Usage > 80%Warning
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8
Pending Pods > 5 minWarning
kube_pod_status_phase{phase="Pending"} > 0 for 5m
API Server Latency > 1sCritical
apiserver_request_duration_seconds{quantile="0.99"} > 1
etcd Leader ChangesCritical
changes(etcd_server_leader_changes_seen_total[15m]) > 0

Kubernetes Monitoring Tools Compared

Prometheus + Grafana

Open SourceFree
✓ Pros

Industry standard, vast ecosystem, no vendor lock-in

✗ Cons

Requires ops overhead, no built-in long-term storage

★ Best For

Teams with DevOps capacity

Better Stack

Managed SaaSFrom $25/mo
✓ Pros

Simple setup, uptime + alerting + on-call in one, transparent pricing

✗ Cons

Less deep APM than Datadog

★ Best For

Startup to mid-size teams

Datadog

Enterprise SaaS$15-23/node/mo
✓ Pros

Best UX, auto-discovery, unified metrics/logs/traces

✗ Cons

Very expensive at scale, complex billing

★ Best For

Enterprise teams with large budgets

Grafana Cloud

Managed SaaSFree → $8/mo
✓ Pros

Managed Prometheus + Loki + Tempo, generous free tier

✗ Cons

Grafana UI learning curve

★ Best For

Teams already using Prometheus/Grafana

New Relic

Enterprise SaaS$99/user/mo
✓ Pros

Powerful K8s Pixie auto-instrumentation, good UX

✗ Cons

User-based pricing gets expensive fast

★ Best For

Teams wanting APM + K8s in one tool

Quick Start: kube-prometheus-stack

The fastest way to get production-grade K8s monitoring running. One Helm command installs Prometheus, Alertmanager, Grafana, and 20+ pre-built dashboards:

# Add the Prometheus community chart repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install the full monitoring stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=changeme \
  --set prometheus.prometheusSpec.retention=30d

# Access Grafana dashboard
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80

After installation, browse to localhost:3000 with admin/changeme. Grafana will have pre-built K8s dashboards for nodes, pods, workloads, and control plane components.

External Monitoring: Why In-Cluster Isn't Enough

In-cluster monitoring (Prometheus) only detects issues once traffic reaches your cluster. But what if the cluster itself is unreachable, or your ingress is broken? That's where external uptime monitoring becomes critical:

Why you need external monitoring too

  • DNS failures — Prometheus won't catch if your domain stops resolving
  • Load balancer issues — ingress controller up, but external traffic blocked
  • SSL certificate expiry — cert-manager can fail silently
  • DDoS / CDN issues — your cluster is fine but users can't reach it
  • Multi-region awareness — monitor from US, EU, Asia to catch regional routing failures

Tools like Better Stack and APIStatusCheck monitor your K8s endpoints from 10+ global locations every 30-60 seconds — catching external failures that in-cluster monitoring misses entirely.

Frequently Asked Questions

What is the best Kubernetes monitoring tool in 2026?

The best Kubernetes monitoring tool depends on your team size and budget. For open-source: the Prometheus + Grafana stack is the industry standard — widely supported, deeply integrated with K8s. For managed solutions: Better Stack offers affordable uptime monitoring with K8s-aware alerting starting at $25/month. For full enterprise observability: Datadog has the best UX but costs $15-23/host/month. For medium teams: New Relic and Dynatrace offer powerful K8s auto-discovery with auto-instrumentation.

What Kubernetes metrics should I monitor?

Critical Kubernetes metrics to monitor: (1) Node metrics — CPU/memory utilization, disk I/O, network throughput, (2) Pod metrics — restart count (crashloopbackoff detection), CPU/memory requests vs limits, (3) Control plane — API server latency, etcd request duration, scheduler queue depth, (4) Application metrics — custom metrics via Prometheus instrumentation, request rate, error rate, p99 latency (RED method), (5) Cluster health — PVC usage, eviction events, pending pods count.

How do I set up Prometheus for Kubernetes?

The easiest way to set up Prometheus on Kubernetes is using the kube-prometheus-stack Helm chart: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack. This installs Prometheus, Alertmanager, Grafana, and all standard K8s dashboards in one command. For production, configure persistent storage for metrics retention and set up remote write to long-term storage (Thanos, Cortex, or Grafana Cloud).

How much does Kubernetes monitoring cost?

Kubernetes monitoring costs vary widely: (1) Prometheus + Grafana self-hosted: free, but requires infrastructure and ops overhead, (2) Grafana Cloud: free tier for small clusters, ~$8/month for medium, (3) Better Stack: from $25/month, covers uptime + alerting, (4) Datadog: $15-23/node/month — a 20-node cluster costs $300-460/month, (5) New Relic: free tier for 1 user, then $99/user/month. Most teams spend $50-500/month depending on cluster size and tool choice.

What is the difference between Prometheus and Datadog for Kubernetes?

Prometheus is open-source and pull-based — you define scrape configs for each service, store metrics locally, and visualize in Grafana. It requires more setup but has no per-host cost. Datadog is a fully managed SaaS that auto-discovers K8s workloads and provides out-of-box dashboards, distributed tracing, and log management in one platform. Prometheus is the default for cost-conscious teams; Datadog is preferred when time-to-value and cross-signal correlation (metrics + logs + traces) justify the cost.

Related Guides

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you