The Infrastructure Monitoring Stack
Modern infrastructure spans multiple layers. Each requires different metrics and different tools:
| Layer | What to Monitor | Primary Tools |
|---|---|---|
| Servers / VMs | CPU, memory, disk, network, processes | node_exporter, CloudWatch, DataDog Agent |
| Containers | Pod restarts, CPU throttling, OOM kills | cAdvisor, kube-state-metrics, Prometheus |
| Kubernetes | Node health, pod scheduling, resource quotas | Prometheus, Grafana, Lens, DataDog |
| Network | Latency, packet loss, bandwidth, DNS | SNMP, Prometheus blackbox_exporter |
| Cloud (AWS/GCP/Azure) | Service health, billing, quota usage | CloudWatch, Cloud Monitoring, Azure Monitor |
| Uptime / Availability | Endpoint availability, SSL, DNS resolution | Better Stack, UptimeRobot, API Status Check |
Core Server Metrics
These are the foundation of infrastructure monitoring. Set these up for every server before adding anything more sophisticated.
CPU Utilization
Track CPU utilization as a percentage across all cores. High CPU can indicate runaway processes, traffic spikes, or missing caching.
| Condition | Action |
|---|---|
| CPU > 80% for 5+ minutes | Warning alert — investigate, may need to scale |
| CPU > 95% for 2+ minutes | Critical alert — likely impacting application performance |
| CPU steal > 10% | Noisy neighbor on shared infrastructure — consider dedicated hosting |
Track iowait CPU separately — high iowait means CPU is idle waiting for disk operations, which points to disk I/O problems rather than CPU contention.
Memory Usage
Monitor memory as a percentage of total RAM. Unlike CPU, memory doesn't spike and recover — it tends to grow until a process crashes (OOM kill) or the server runs out entirely.
- Used memory — what's actually allocated (not counting cached/buffers)
- Available memory — what can be allocated immediately
- Swap usage — any swap usage indicates memory pressure; heavy swap = performance degradation
- OOM kill rate — how often the kernel is killing processes due to memory exhaustion
Disk Usage and I/O
Full disks cause immediate, hard-to-recover outages. Most applications fail catastrophically when they can't write to disk.
- Alert at 75% disk usage — gives time to archive/compress/clean
- Page at 90% disk usage — needs immediate intervention
- Monitor inode usage separately — you can run out of inodes before disk space (especially problematic with many small files)
- Track disk read/write IOPS and throughput — sudden drops indicate hardware failure
Network I/O
Key network metrics:
- Bytes in/out: Monitor against your bandwidth capacity; alert when approaching limits
- Packet error rate: Any sustained error rate above 0.1% indicates hardware or network issues
- TCP retransmit rate: High retransmits indicate packet loss between services
- Network latency between services: Especially important for microservices where inter-service latency compounds
📡 Monitor your infrastructure uptime every 30 seconds — get alerted in under a minute
Trusted by 100,000+ websites · Free tier available
Container and Kubernetes Monitoring
Containerized infrastructure introduces failure modes that don't exist with bare-metal or VM deployments.
Container-Specific Metrics
- CPU throttling percentage: Containers that hit their CPU limits are throttled (requests slow down) rather than killed. High throttling means your CPU limits are too low.
- OOM kill count: Container memory limit exceeded — the container was killed and restarted. Usually indicates a memory leak or misconfigured memory limits.
- Restart count: Containers restarting frequently indicates crashes. Check logs for the failure reason.
- Image pull latency: Slow container starts can indicate registry performance issues or missing image layers.
Kubernetes-Specific Alerts
| Alert | What It Means | Severity |
|---|---|---|
| Pod CrashLoopBackOff | Pod crashing repeatedly — application error | Critical |
| Pod Pending > 5 min | Scheduler can't place pod — resource exhaustion | Warning |
| Node NotReady | Node failed kubelet check | Critical |
| Deployment below desired replicas | Expected pods not running | Critical |
| PVC near capacity | Persistent volume filling up | Warning |
Cloud Infrastructure Monitoring
Cloud infrastructure adds service-level abstractions where the infrastructure itself can fail in ways invisible to traditional server monitoring.
AWS Monitoring Priorities
- EC2/ECS health checks: Auto Scaling Group health, unhealthy instance replacement
- RDS multi-AZ failover: Monitor primary/replica health and failover events
- ALB/NLB target health: Percentage of registered targets healthy; alert below 100%
- SQS queue depth: Growing queues indicate consumer failures or traffic spikes
- Lambda error rate and duration: Function errors and approaching timeout limits
The AWS Health Dashboard
Always monitor AWS Service Health alongside your own metrics. Your servers may be healthy while AWS is degrading an underlying service (like EBS volumes or EC2 instances). Set up Health API webhooks to get notified when AWS publishes service events affecting your regions.
Unified infrastructure monitoring with Better Stack
Monitor servers, containers, and cloud infrastructure alongside your APIs and uptime checks. Get unified alerting in one dashboard.
Try Better Stack Free →Best Infrastructure Monitoring Tools in 2026
Open Source Stack
- Prometheus + Grafana: The industry standard open-source monitoring stack. Prometheus scrapes metrics from exporters; Grafana visualizes them. Steep initial setup but massively flexible.
- node_exporter: Prometheus exporter for Linux server metrics. Install on every server to expose CPU, memory, disk, and network metrics.
- cAdvisor: Container resource usage metrics. Works with Docker and Kubernetes. Required for container-level CPU/memory tracking.
- kube-state-metrics: Exposes Kubernetes object state (pod health, deployment replicas) as Prometheus metrics. Essential for Kubernetes monitoring.
Commercial Options
- Better Stack: Unified platform for infrastructure alerts + uptime monitoring + log management. Simpler setup than Prometheus. Good for teams that want signal without extensive configuration.
- DataDog: Full-stack monitoring with deep AWS/GCP/Azure integrations, APM, log management, and security monitoring. Expensive but comprehensive for large teams.
- New Relic: Strong APM and infrastructure monitoring in one platform. Per-user pricing makes it cost-effective for smaller teams.
- Zabbix: Open-source enterprise monitoring platform. More complex than Prometheus/Grafana but includes alerting and visualization in one tool. See our Zabbix alternatives guide.
Uptime and External Monitoring
- API Status Check Alert Pro: Monitors your APIs and endpoints from external locations. Detects when your infrastructure is up but external connectivity is broken.
- Better Stack: External HTTP monitoring + internal infrastructure metrics in one platform.
Infrastructure Monitoring Checklist
- ☐Server metrics: CPU, memory, disk, network on every host
- ☐Disk usage alert at 75%, page at 90% — separate alert for inodes
- ☐Memory OOM kill monitoring and swap usage tracking
- ☐Process availability monitoring for critical services
- ☐Container restart count and OOM kill alerts
- ☐Kubernetes: pod health, node health, deployment replicas
- ☐Network packet error rate and TCP retransmit monitoring
- ☐Cloud service health dashboard subscribed (AWS Health, GCP Status)
- ☐Load balancer target health — alert when any target goes unhealthy
- ☐External uptime monitoring from multiple global locations
- ☐Runbook for every infrastructure alert type
Frequently Asked Questions
What is infrastructure monitoring?
Infrastructure monitoring is the continuous measurement of the health, availability, and performance of servers, containers, cloud resources, and network devices. It provides early warning of resource exhaustion, hardware failure, and configuration drift before application impact.
What are the most important infrastructure metrics to monitor?
The five core metrics are CPU utilization (alert > 80%), memory usage (alert > 85%), disk usage (alert > 75%), network error rate, and process availability. For containerized workloads, add container restart count and CPU throttling percentage.
What is the best infrastructure monitoring tool in 2026?
For cloud-native teams: DataDog or New Relic for deep integrations. For Kubernetes: Prometheus + Grafana is the standard. For simplicity: Better Stack combines uptime and infrastructure monitoring in one platform.
How is infrastructure monitoring different from application monitoring?
Infrastructure monitoring watches the resources applications run on (servers, memory, disk). Application monitoring watches what your code does (response times, error rates). Both layers are necessary — infrastructure shows the problem resource, application monitoring shows which code is causing it.
What is the difference between monitoring and observability?
Monitoring alerts on known failure modes. Observability lets you investigate unknown problems by correlating metrics, logs, and traces. Infrastructure monitoring is primarily metrics-based monitoring. Full observability adds structured logs and distributed traces to explain why infrastructure metrics are abnormal.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your infrastructure goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your infrastructure + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial