What metrics should I monitor for Docker containers?

The five core Docker container metrics are: CPU usage (% of allocated limit), memory usage (bytes and % of limit), network I/O (bytes sent/received), block I/O (disk read/write), and container restart count. Restart count is especially important — a container restarting more than once per hour indicates a crash loop that needs investigation.

What is the best container monitoring tool?

For Kubernetes, Prometheus + Grafana is the open-source standard — kube-state-metrics and node-exporter provide comprehensive cluster metrics. For managed solutions, Datadog Container Monitoring, New Relic, and Dynatrace offer deep Kubernetes integration. For small teams, Grafana Cloud's free tier with the Kubernetes monitoring integration is an excellent starting point.

What are OOMKilled containers and how do I fix them?

OOMKilled (Out of Memory Killed) means Kubernetes killed your container because it exceeded its memory limit. To fix: (1) Check actual memory usage with kubectl top pod and compare to limits. (2) Increase the memory limit in your pod spec if usage is legitimate. (3) Profile your app for memory leaks if usage is abnormal. (4) Add memory limit alerts before hitting OOM — alert at 80% of limit.

How do I monitor Kubernetes pod restarts?

Monitor pod restarts with the kube_pod_container_status_restarts_total metric in Prometheus. Alert when a pod restarts more than 3 times in 15 minutes — this indicates a CrashLoopBackOff. In kubectl: `kubectl get pods --all-namespaces --field-selector=status.phase=Running | grep -v "0" | awk '{print $5}' ` shows restart counts. Set up alerting before CrashLoopBackOff triggers — once Kubernetes hits backoff, restart intervals grow exponentially.

What is the difference between Docker stats and Kubernetes metrics?

Docker stats (available via `docker stats`) shows per-container resource usage on a single host. Kubernetes metrics (via Metrics Server and kube-state-metrics) provide cluster-wide visibility: pod health across all nodes, deployment rollout status, HPA scaling events, persistent volume claims, and node conditions. For production Kubernetes clusters, `docker stats` is too narrow — use cluster-level metrics.

Container Monitoring Guide: Docker & Kubernetes Observability (2026)

Containers fundamentally change how you monitor infrastructure. Traditional server monitoring — CPU, memory, disk on a fixed host — only covers part of the picture. Container workloads are ephemeral, horizontally scaled, and managed by orchestrators like Kubernetes that make scheduling decisions you need visibility into.

This guide covers the complete container monitoring picture: Docker metrics for single-host setups, Kubernetes metrics for orchestrated clusters, alerting thresholds, and the best tools for each layer.

Why Container Monitoring Is Different

Three things make container monitoring distinct from traditional infrastructure monitoring:

Ephemerality. Containers start and stop constantly. A metric spike on a container that no longer exists won't show up in your dashboard unless you designed for short-lived workloads from the start.
Resource limits. Unlike VMs where you'd gradually exhaust resources, containers hit hard limits and get killed (OOMKilled) or throttled. You need to monitor resource headroom, not just current usage.
Orchestrator state. Kubernetes knows things your containers don't: which pods are pending (scheduler can't place them), which nodes have pressure, which deployments are rolling out. These orchestrator-level signals are critical for understanding system health.

Essential Docker Container Metrics

Metric	What It Measures	Alert Threshold	Page Threshold
CPU %	CPU used vs. allocated limit	> 80% sustained 5 min	> 95% (throttling)
Memory usage	Bytes used vs. memory limit	> 80% of limit	> 90% (OOM risk)
Restart count	Container restarts since creation	> 1 restart in 15 min	> 5 (CrashLoop)
Net I/O	Bytes sent/received	Baseline × 5	Baseline × 20
Block I/O	Disk read/write bytes	Depends on workload	Sustained saturation

# Check all running containers at once
docker stats --no-stream

# Output: container metrics snapshot
CONTAINER ID   NAME        CPU %   MEM USAGE / LIMIT     MEM %   NET I/O
a1b2c3d4e5f6   web         12.3%   256MiB / 512MiB      50.0%   1.2GB / 890MB
f1e2d3c4b5a6   worker      0.8%    128MiB / 256MiB      50.0%   45MB / 12MB

# Continuous monitoring with formatting
docker stats --format "table {{.Name}}	{{.CPUPerc}}	{{.MemPerc}}	{{.MemUsage}}	{{.NetIO}}"

# Check restart count for specific container
docker inspect --format='{{.RestartCount}}' web

Kubernetes Monitoring: The Full Picture

Kubernetes adds an orchestration layer above individual containers. You need visibility at four levels:

1. Pod-Level Metrics

Metric	Prometheus Query	Alert On
Pod restarts	kube_pod_container_status_restarts_total	> 3 in 15 minutes
Pod CPU usage	container_cpu_usage_seconds_total	> 80% of request
Memory pressure	container_memory_working_set_bytes	> 80% of limit
OOMKilled	kube_pod_container_status_last_terminated_reason	Any OOMKilled event
Pod phase	kube_pod_status_phase	Pending > 10 min

2. Deployment-Level Metrics

# Prometheus: Alert when deployment has unavailable replicas
alert: DeploymentUnavailableReplicas
expr: kube_deployment_status_replicas_unavailable > 0
for: 5m
labels:
  severity: warning
annotations:
  summary: "Deployment {{ $labels.deployment }} has {{ $value }} unavailable replicas"

# Alert when rollout stalls (requested != ready)
alert: DeploymentRolloutStuck
expr: |
  kube_deployment_status_replicas_updated != kube_deployment_spec_replicas
  AND kube_deployment_spec_replicas > 0
for: 15m
labels:
  severity: critical

3. Node-Level Metrics

Node CPU/memory pressure — node has insufficient resources to schedule new pods
Disk pressure — node's disk is filling up (kubelet will evict pods to reclaim space)
PID pressure — too many processes on the node
Node readiness — is the node able to accept pods at all

# Check node conditions
kubectl describe nodes | grep -A5 "Conditions:"

# Prometheus: Alert on node memory pressure
alert: NodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
  severity: critical
annotations:
  summary: "Node {{ $labels.node }} is under memory pressure — pod evictions may start"

4. HPA (Autoscaler) Metrics

If you use Horizontal Pod Autoscaler, monitor both the current and desired replica counts, and alert when HPA is stuck at its maximum replica count — that means demand is outpacing your configured maximum scale.

# HPA at max replicas — can't scale further
alert: HPAMaxedOut
expr: |
  kube_horizontalpodautoscaler_status_current_replicas
  == kube_horizontalpodautoscaler_spec_max_replicas
for: 10m
labels:
  severity: warning
annotations:
  summary: "HPA {{ $labels.horizontalpodautoscaler }} is at max replicas — increase max or investigate traffic spike"

📡

Recommended

Monitor your container endpoints with Better Stack

Better Stack runs synthetic checks on your containerized services from 30+ global locations. HTTP, TCP, and keyword checks — with on-call alerting when containers go down.

Try Better Stack Free →

Container Monitoring Tool Comparison

Tool	Best For	Docker Support	K8s Support	Pricing
Prometheus + Grafana	Open-source DIY	Via cAdvisor	Native (kube-state-metrics)	Free (self-hosted)
Datadog	Enterprise, multi-cloud	Auto-discovery	Deep K8s integration	$18/host/mo + infra
New Relic	Full-stack observability	Via agent	K8s cluster explorer	Free 100GB/mo
Grafana Cloud	Managed Prometheus/Loki	Via scrape configs	K8s monitoring bundle	Free tier / $29+/mo
Dynatrace	Auto-discovery, AIOps	OneAgent auto	Excellent K8s support	$0.08/hour/host
Better Stack	Uptime + endpoint health	HTTP/TCP checks	External checks	Free tier / $20+/mo

Setting Up Prometheus for Kubernetes

The standard open-source Kubernetes monitoring stack uses Prometheus for metrics collection, kube-state-metrics for cluster state, node-exporter for node metrics, and Grafana for dashboards.

# Install kube-prometheus-stack via Helm (includes Prometheus + Grafana + AlertManager)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.enabled=true \
  --set alertmanager.enabled=true

# Verify installation
kubectl get pods -n monitoring

# Access Grafana (default: admin/prom-operator)
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

The kube-prometheus-stack Helm chart includes 40+ pre-built alerting rules and 20+ Grafana dashboards for Kubernetes monitoring out of the box — covering pod health, node resources, API server latency, and etcd metrics.

Container Monitoring Best Practices

Always set resource requests and limits. Without limits, a runaway container starves neighbors. Without requests, the scheduler can't make good placement decisions. Set both, always.
Use liveness and readiness probes. Liveness probes tell Kubernetes when to restart a crashed container. Readiness probes prevent traffic from hitting pods that aren't ready. Most teams configure liveness without readiness — a mistake that causes traffic to hit half-started pods during rolling updates.
Monitor image pull failures. kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"} is a silent killer — your deployment is broken but no pods are crashing.
Track container age. Very old containers on a rolling deployment that hasn't fully rolled out indicate a stuck rollout — often a crashlooping pod in canary position.
Alert on PVC usage, not just node disk. PersistentVolumeClaims filling up cause stateful applications (databases) to crash, not just the node.

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your containerized services goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your containerized services + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

FAQ

What is OOMKilled in Kubernetes?

OOMKilled (Out of Memory Killed) means Kubernetes's Linux OOM killer terminated your container because it exceeded its memory limit. Check kubectl describe pod [name] — the Last State section will show Reason: OOMKilled. Increase your memory limit or investigate memory leaks.

How is container CPU throttling different from high CPU usage?

High CPU % means your container is using a lot of CPU. CPU throttling means the container's CPU usage is being actively capped by cgroups because it hit its CPU limit. Throttled containers respond to requests slowly even though they're not at 100% CPU — they're being held back. Monitor container_cpu_cfs_throttled_periods_total alongside CPU usage.

What's the difference between CrashLoopBackOff and OOMKilled?

OOMKilled is one cause of CrashLoopBackOff. CrashLoopBackOff is Kubernetes's state for “this container keeps crashing and I'm applying exponential backoff before retrying.” The underlying cause could be OOMKilled, a runtime error, a missing environment variable, or a failed liveness probe. Check kubectl logs [pod] --previous to see the last crash output.