What Cloud Monitoring Covers
Cloud monitoring spans several distinct categories:
| Category | What It Covers | Key Tools |
|---|---|---|
| Infrastructure monitoring | VMs, containers, serverless functions | CloudWatch, Datadog, Zabbix |
| Application monitoring (APM) | Response times, error rates, traces | Datadog APM, New Relic, Dynatrace |
| Uptime monitoring | External availability checks | Better Stack, Pingdom, API Status Check |
| Log monitoring | Log aggregation and analysis | CloudWatch Logs, Better Stack Logs, Splunk |
| Cost monitoring | Spend anomalies, budget alerts | AWS Cost Explorer, Infracost, CloudHealth |
AWS Monitoring: Key Metrics by Service
EC2 Instances
| Metric | Warning | Critical |
|---|---|---|
| CPUUtilization | > 70% for 10min | > 90% for 5min |
| StatusCheckFailed | Any value > 0 | Sustained > 0 |
| NetworkIn/Out | > 80% instance limit | > 95% instance limit |
| DiskReadOps/WriteOps | Baseline + 3σ | Baseline + 5σ |
RDS / Aurora
| Metric | Alert Threshold |
|---|---|
| FreeStorageSpace | Alert at < 20% remaining; page at < 10% |
| DatabaseConnections | Alert at 80% of max_connections parameter |
| ReadLatency / WriteLatency | Alert if P99 exceeds 2x normal baseline |
| ReplicaLag | Alert > 30s; page > 5min |
| CPUUtilization | Alert > 80% sustained; investigate query patterns |
Lambda / Serverless
Lambda monitoring requires different thinking — there are no servers to monitor, only function invocations:
- Error rate: Alert when errors > 1% of invocations in a 5-minute window
- Duration P99: Alert if P99 approaches your function timeout (leaves no headroom)
- Throttles: Any throttles indicate you've hit concurrency limits — scale up reserved concurrency
- Cold starts: High cold start rate (> 10%) degrades user-facing latency — use provisioned concurrency for critical functions
# CloudWatch alarm for Lambda error rate
aws cloudwatch put-metric-alarm \
--alarm-name "lambda-high-error-rate" \
--metric-name Errors \
--namespace AWS/Lambda \
--statistic Sum \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=FunctionName,Value=my-function \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789:alertsUnified cloud monitoring with Better Stack
Better Stack monitors your cloud infrastructure across AWS, GCP, and Azure — HTTP, TCP, ping checks from 30+ locations with on-call alerting built in.
Try Better Stack Free →GCP Cloud Monitoring
Google Cloud Monitoring (formerly Stackdriver) is GCP's native observability platform. Key metrics by service:
| GCP Service | Key Metrics |
|---|---|
| GCE (VM) | cpu/utilization, disk/read_bytes_count, network/received_bytes_count |
| Cloud SQL | database/cpu/utilization, database/memory/utilization, database/disk/utilization, database/replication/replica_lag |
| GKE | container/cpu/core_usage_time, container/memory/used_bytes, pod/volume/used_bytes |
| Cloud Functions | function/execution_count, function/execution_times, function/user_memory_bytes |
| Cloud Run | run/request_count, run/request_latencies, run/container/cpu/utilization |
Azure Monitor
Azure Monitor is Microsoft's unified monitoring platform for Azure infrastructure. Key concepts:
- Azure Metrics: Time-series data from all Azure resources, stored 93 days, queryable via Metrics Explorer
- Log Analytics (Kusto): Centralized log ingestion and KQL (Kusto Query Language) querying
- Application Insights: APM for web applications — request rates, dependency tracking, exceptions, custom events
- Azure Alerts: Metric, log search, and activity log alerts with Action Groups for notifications
Multi-Cloud Monitoring Strategy
Most organizations run workloads on multiple clouds. Native tools don't span cloud boundaries — a CloudWatch dashboard can't show your GCP Cloud SQL metrics. For multi-cloud teams, the options are:
| Approach | Pros | Cons |
|---|---|---|
| Native tools per cloud | Free, deep integration | Siloed — no cross-cloud correlation |
| Datadog / New Relic | Unified view, powerful alerting | Expensive at scale ($15-25/host/mo) |
| Prometheus + Grafana | Free, flexible, powerful | Self-hosted operational burden |
| Better Stack (uptime layer) | Simple, cloud-agnostic external checks | Uptime focus, not deep infra metrics |
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your cloud services goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your cloud services + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Cloud Cost Monitoring: The Hidden Dimension
Cloud monitoring isn't just about availability and performance — it's also about cost. A misconfigured auto-scaling group or forgotten development environment can generate thousands in unexpected cloud spend. Best practices:
- Set billing alerts early. Configure AWS Budget Alerts, GCP Budget Alerts, and Azure Cost Alerts as soon as you create an account. Alert at 80% of expected monthly spend.
- Tag everything. Resource tagging enables cost attribution by team, service, and environment. Without tags, you can't diagnose where spend spikes come from.
- Watch data transfer costs. Cross-region and egress data transfer costs are often the surprise line item. Monitor monthly egress volumes.
- Audit reserved capacity. Unused reserved instances waste money. Review utilization monthly and right-size or sell unused reservations.