Containers fundamentally change how you monitor infrastructure. Traditional server monitoring — CPU, memory, disk on a fixed host — only covers part of the picture. Container workloads are ephemeral, horizontally scaled, and managed by orchestrators like Kubernetes that make scheduling decisions you need visibility into.
This guide covers the complete container monitoring picture: Docker metrics for single-host setups, Kubernetes metrics for orchestrated clusters, alerting thresholds, and the best tools for each layer.
Why Container Monitoring Is Different
Three things make container monitoring distinct from traditional infrastructure monitoring:
- Ephemerality. Containers start and stop constantly. A metric spike on a container that no longer exists won't show up in your dashboard unless you designed for short-lived workloads from the start.
- Resource limits. Unlike VMs where you'd gradually exhaust resources, containers hit hard limits and get killed (OOMKilled) or throttled. You need to monitor resource headroom, not just current usage.
- Orchestrator state. Kubernetes knows things your containers don't: which pods are pending (scheduler can't place them), which nodes have pressure, which deployments are rolling out. These orchestrator-level signals are critical for understanding system health.
Essential Docker Container Metrics
| Metric | What It Measures | Alert Threshold | Page Threshold |
|---|---|---|---|
| CPU % | CPU used vs. allocated limit | > 80% sustained 5 min | > 95% (throttling) |
| Memory usage | Bytes used vs. memory limit | > 80% of limit | > 90% (OOM risk) |
| Restart count | Container restarts since creation | > 1 restart in 15 min | > 5 (CrashLoop) |
| Net I/O | Bytes sent/received | Baseline × 5 | Baseline × 20 |
| Block I/O | Disk read/write bytes | Depends on workload | Sustained saturation |
# Check all running containers at once
docker stats --no-stream
# Output: container metrics snapshot
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O
a1b2c3d4e5f6 web 12.3% 256MiB / 512MiB 50.0% 1.2GB / 890MB
f1e2d3c4b5a6 worker 0.8% 128MiB / 256MiB 50.0% 45MB / 12MB
# Continuous monitoring with formatting
docker stats --format "table {{.Name}} {{.CPUPerc}} {{.MemPerc}} {{.MemUsage}} {{.NetIO}}"
# Check restart count for specific container
docker inspect --format='{{.RestartCount}}' webKubernetes Monitoring: The Full Picture
Kubernetes adds an orchestration layer above individual containers. You need visibility at four levels:
1. Pod-Level Metrics
| Metric | Prometheus Query | Alert On |
|---|---|---|
| Pod restarts | kube_pod_container_status_restarts_total | > 3 in 15 minutes |
| Pod CPU usage | container_cpu_usage_seconds_total | > 80% of request |
| Memory pressure | container_memory_working_set_bytes | > 80% of limit |
| OOMKilled | kube_pod_container_status_last_terminated_reason | Any OOMKilled event |
| Pod phase | kube_pod_status_phase | Pending > 10 min |
2. Deployment-Level Metrics
# Prometheus: Alert when deployment has unavailable replicas
alert: DeploymentUnavailableReplicas
expr: kube_deployment_status_replicas_unavailable > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Deployment {{ $labels.deployment }} has {{ $value }} unavailable replicas"
# Alert when rollout stalls (requested != ready)
alert: DeploymentRolloutStuck
expr: |
kube_deployment_status_replicas_updated != kube_deployment_spec_replicas
AND kube_deployment_spec_replicas > 0
for: 15m
labels:
severity: critical3. Node-Level Metrics
- Node CPU/memory pressure — node has insufficient resources to schedule new pods
- Disk pressure — node's disk is filling up (kubelet will evict pods to reclaim space)
- PID pressure — too many processes on the node
- Node readiness — is the node able to accept pods at all
# Check node conditions
kubectl describe nodes | grep -A5 "Conditions:"
# Prometheus: Alert on node memory pressure
alert: NodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is under memory pressure — pod evictions may start"4. HPA (Autoscaler) Metrics
If you use Horizontal Pod Autoscaler, monitor both the current and desired replica counts, and alert when HPA is stuck at its maximum replica count — that means demand is outpacing your configured maximum scale.
# HPA at max replicas — can't scale further
alert: HPAMaxedOut
expr: |
kube_horizontalpodautoscaler_status_current_replicas
== kube_horizontalpodautoscaler_spec_max_replicas
for: 10m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.horizontalpodautoscaler }} is at max replicas — increase max or investigate traffic spike"Monitor your container endpoints with Better Stack
Better Stack runs synthetic checks on your containerized services from 30+ global locations. HTTP, TCP, and keyword checks — with on-call alerting when containers go down.
Try Better Stack Free →Container Monitoring Tool Comparison
| Tool | Best For | Docker Support | K8s Support | Pricing |
|---|---|---|---|---|
| Prometheus + Grafana | Open-source DIY | Via cAdvisor | Native (kube-state-metrics) | Free (self-hosted) |
| Datadog | Enterprise, multi-cloud | Auto-discovery | Deep K8s integration | $18/host/mo + infra |
| New Relic | Full-stack observability | Via agent | K8s cluster explorer | Free 100GB/mo |
| Grafana Cloud | Managed Prometheus/Loki | Via scrape configs | K8s monitoring bundle | Free tier / $29+/mo |
| Dynatrace | Auto-discovery, AIOps | OneAgent auto | Excellent K8s support | $0.08/hour/host |
| Better Stack | Uptime + endpoint health | HTTP/TCP checks | External checks | Free tier / $20+/mo |
Setting Up Prometheus for Kubernetes
The standard open-source Kubernetes monitoring stack uses Prometheus for metrics collection, kube-state-metrics for cluster state, node-exporter for node metrics, and Grafana for dashboards.
# Install kube-prometheus-stack via Helm (includes Prometheus + Grafana + AlertManager)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.enabled=true \
--set alertmanager.enabled=true
# Verify installation
kubectl get pods -n monitoring
# Access Grafana (default: admin/prom-operator)
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80The kube-prometheus-stack Helm chart includes 40+ pre-built alerting rules and 20+ Grafana dashboards for Kubernetes monitoring out of the box — covering pod health, node resources, API server latency, and etcd metrics.
Container Monitoring Best Practices
- Always set resource requests and limits. Without limits, a runaway container starves neighbors. Without requests, the scheduler can't make good placement decisions. Set both, always.
- Use liveness and readiness probes. Liveness probes tell Kubernetes when to restart a crashed container. Readiness probes prevent traffic from hitting pods that aren't ready. Most teams configure liveness without readiness — a mistake that causes traffic to hit half-started pods during rolling updates.
- Monitor image pull failures.
kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"}is a silent killer — your deployment is broken but no pods are crashing. - Track container age. Very old containers on a rolling deployment that hasn't fully rolled out indicate a stuck rollout — often a crashlooping pod in canary position.
- Alert on PVC usage, not just node disk. PersistentVolumeClaims filling up cause stateful applications (databases) to crash, not just the node.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your containerized services goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your containerized services + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
FAQ
What is OOMKilled in Kubernetes?
OOMKilled (Out of Memory Killed) means Kubernetes's Linux OOM killer terminated your container because it exceeded its memory limit. Check kubectl describe pod [name] — the Last State section will show Reason: OOMKilled. Increase your memory limit or investigate memory leaks.
How is container CPU throttling different from high CPU usage?
High CPU % means your container is using a lot of CPU. CPU throttling means the container's CPU usage is being actively capped by cgroups because it hit its CPU limit. Throttled containers respond to requests slowly even though they're not at 100% CPU — they're being held back. Monitor container_cpu_cfs_throttled_periods_total alongside CPU usage.
What's the difference between CrashLoopBackOff and OOMKilled?
OOMKilled is one cause of CrashLoopBackOff. CrashLoopBackOff is Kubernetes's state for “this container keeps crashing and I'm applying exponential backoff before retrying.” The underlying cause could be OOMKilled, a runtime error, a missing environment variable, or a failed liveness probe. Check kubectl logs [pod] --previous to see the last crash output.