Running infrastructure on AWS without proper monitoring is like flying blind. AWS gives you an enormous toolbox โ CloudWatch, X-Ray, CloudTrail, AWS Health, and more โ but knowing which tools to use, which metrics actually matter, and how to avoid alert fatigue takes real experience. This guide distills production AWS monitoring into actionable best practices.
The AWS Monitoring Stack: What's Available
AWS provides several native monitoring services, each covering a different layer of observability:
| Service | What It Does | Best For |
|---|---|---|
| CloudWatch Metrics | Collect and graph metrics from all AWS services | CPU, memory, latency, error rates |
| CloudWatch Logs | Centralize and search application and service logs | Log aggregation, error searching |
| CloudWatch Alarms | Alert on metric thresholds via SNS | Proactive alerting, auto-scaling triggers |
| AWS X-Ray | Distributed request tracing across services | Latency debugging, microservices |
| CloudTrail | Audit log of all AWS API calls | Security, compliance, change tracking |
| AWS Health | AWS service health events affecting your account | Outage awareness, incident tracking |
| VPC Flow Logs | Network traffic logs for VPC | Security, network debugging |
| AWS Config | Track AWS resource configuration changes | Compliance, drift detection |
AWS CloudWatch: Core Monitoring
Understanding CloudWatch Metrics
CloudWatch metrics are the foundation of AWS monitoring. Every AWS service publishes metrics automatically โ EC2, RDS, Lambda, ALB, SQS, and hundreds more. Metrics are organized into namespaces (e.g., AWS/EC2) and have dimensions (e.g., InstanceId).
Key CloudWatch concepts:
- Default metrics: Free metrics published automatically (EC2 CPU, RDS connections, etc.)
- Detailed monitoring: 1-minute granularity (vs. 5-minute default) for EC2 โ $3.50/instance/month
- Custom metrics: Publish your own application metrics via the CloudWatch API or SDK
- Metric math: Combine metrics with expressions (e.g., error rate = errors / requests ร 100)
- Metric retention: Data stored for 15 months; high-resolution (1-second) retained 3 hours
Critical Metrics by AWS Service
EC2 Instances
- CPUUtilization: Alert at >80% sustained for 5+ minutes
- NetworkIn/NetworkOut: Baseline and alert on anomalies
- DiskReadOps/DiskWriteOps: I/O saturation on EBS volumes
- StatusCheckFailed: Critical โ instance or system health failed
- Memory/Disk: Not available by default โ install CloudWatch Agent
AWS Lambda
- Errors: Alert on any sustained error rate >1%
- Duration: Watch P99 โ Lambda timeouts default at 15 min max
- Throttles: Indicates you're hitting concurrency limits
- ConcurrentExecutions: Approaching account limit triggers throttles
- IteratorAge (streams): Lag in Kinesis/DynamoDB event processing
Amazon RDS / Aurora
- CPUUtilization: Alert at >70% (RDS is CPU-sensitive)
- FreeStorageSpace: Alert when <20% remaining
- DatabaseConnections: Watch for connection pool exhaustion
- ReadLatency/WriteLatency: Baseline and alert on p99 spikes
- ReplicaLag: For read replicas โ alert if lag >30 seconds
Application Load Balancer (ALB)
- TargetResponseTime: P99 latency for all backend requests
- HTTPCode_Target_5XX_Count: Backend error rate โ alert >1%
- HTTPCode_ELB_5XX_Count: ALB itself returning errors
- UnHealthyHostCount: Hosts failing health checks
- ActiveConnectionCount: Concurrent active connections
Amazon SQS
- ApproximateNumberOfMessagesVisible: Queue depth โ indicates consumer backlog
- ApproximateAgeOfOldestMessage: How stale messages are getting
- NumberOfMessagesSent: Production rate
- Dead Letter Queue depth: Critical โ failed messages accumulating
Supplement CloudWatch with Better Stack for on-call alerting
CloudWatch can alert via SNS, but Better Stack gives you escalation policies, on-call schedules, and incident management in one place. Integrates with CloudWatch in minutes.
Try Better Stack Free โCloudWatch Alarms: Setting Up Effective Alerts
CloudWatch Alarms monitor a metric and trigger actions (SNS notification, Auto Scaling, EC2 action) when a threshold is breached. Getting alarms right is critical โ too many creates alert fatigue, too few leaves you blind.
Alarm Best Practices
- Use anomaly detection instead of static thresholds for metrics with diurnal patterns (traffic naturally higher during business hours).
- Set evaluation periods, not just thresholds โ require 3 out of 5 data points to breach before alarming to reduce noise.
- Treat missing data intentionally: For Lambda functions, missing data means no invocations (potentially OK) โ configure accordingly.
- Use composite alarms to reduce noise โ alert only when CPU AND memory AND latency are all elevated simultaneously.
- Route alarms through SNS โ PagerDuty/OpsGenie/Better Stack for proper on-call management rather than email floods.
Essential Alarms to Create First
# Minimum viable alarm set for production AWS:
โ EC2 CPUUtilization > 80% for 5 consecutive minutes
โ EC2 StatusCheckFailed = 1 (any data point)
โ ALB TargetResponseTime P99 > 2 seconds
โ ALB HTTPCode_Target_5XX_Count > 10 per minute
โ RDS FreeStorageSpace < 20% of total
โ Lambda Errors > 1% of invocations
โ Lambda Throttles > 0 for 10+ minutes
โ SQS DLQ depth > 0 (any messages = investigate)
CloudWatch Logs: Centralized Log Management
CloudWatch Logs is AWS's managed log service. Most AWS services automatically route logs to CloudWatch โ Lambda, API Gateway, ECS, and more. EC2 instances need the CloudWatch Agent installed.
Setting Up the CloudWatch Agent (EC2)
The CloudWatch Agent extends EC2 monitoring with memory, disk, and custom log collection:
# Install CloudWatch Agent
sudo yum install amazon-cloudwatch-agent
# Configure via wizard
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 -s \
-c ssm:/AmazonCloudWatch-linux
CloudWatch Logs Insights
CloudWatch Logs Insights lets you query logs with a SQL-like syntax. It's powerful for debugging but can be expensive at scale ($0.005/GB queried). Key queries:
# Find Lambda errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# P99 latency for API Gateway
fields @timestamp, status, responseLatency
| stats pct(responseLatency, 99) as p99 by bin(5m)
AWS X-Ray: Distributed Tracing
X-Ray adds distributed tracing to your AWS applications, letting you follow a single request through Lambda โ API Gateway โ DynamoDB โ SQS and see exactly where time is spent or where errors occur.
Enabling X-Ray
X-Ray is enabled per-service:
- Lambda: Enable Active Tracing in function configuration or via IaC.
- API Gateway: Enable X-Ray tracing in Stage settings.
- ECS: Add X-Ray daemon as a sidecar container.
- EC2: Install X-Ray daemon and instrument your application code.
X-Ray Sampling
X-Ray samples a percentage of requests to control cost. Default: 5% of requests beyond the first per second. For production, customize sampling rules based on URL patterns โ sample 100% of error-producing requests, 1% of health checks.
Add uptime monitoring and incident management to your AWS stack
Better Stack monitors your AWS-hosted APIs and websites from 20+ global locations. When CloudWatch alarms fire, Better Stack manages the on-call escalation. Free tier monitors up to 10 endpoints.
Try Better Stack Free โAWS CloudWatch vs. Third-Party Monitoring Tools
CloudWatch is powerful but has real limitations in production:
| Feature | CloudWatch | Datadog / New Relic | Better Stack |
|---|---|---|---|
| Dashboards | Basic, limited widgets | Rich, customizable | Clean, focused on uptime |
| Multi-cloud | AWS only | AWS + GCP + Azure + K8s | URL/API based (any) |
| On-call management | Via SNS only | Built-in rotation/escalation | Built-in schedules |
| Log analytics | Logs Insights (expensive) | Powerful, ML-based | Logtail (modern UI) |
| APM / Tracing | X-Ray (limited) | Full APM + profiling | Not included |
| Pricing | $0.30/metric/mo | $18+/host/mo | $25/mo for uptime+logs |
| Setup complexity | Low (built into AWS) | Medium (agent install) | Low (URL-based) |
Recommendation: Use CloudWatch for native AWS metric collection and alerting. Add Better Stack or Datadog for user-facing uptime monitoring, cross-service dashboards, and on-call management. Don't pay for both a full Datadog license AND CloudWatch unless you have a large team โ the overlap is significant.
AWS Monitoring Architecture: Production Best Practices
1. Use the Four Golden Signals
Google SRE's Four Golden Signals apply directly to AWS: Latency (ALB TargetResponseTime P99), Traffic (RequestCount), Errors (5XX rate), and Saturation (CPU, queue depth, connection pool usage). If you monitor nothing else, monitor these four.
2. Tag Everything for Cost Attribution
AWS resources without tags are unmonitorable at scale. Enforce tagging policies (Environment: prod/staging/dev, Team, Service) and filter CloudWatch metrics and alarms by tag. This lets you drill into per-service costs and per-team incident rates.
3. Use AWS Health Events
Subscribe to AWS Health events in your region to get proactive notifications about infrastructure events affecting your account. Set up an EventBridge rule to forward Health events to Slack or your on-call tool:
# EventBridge rule pattern for AWS Health events
{
"source": ["aws.health"],
"detail-type": ["AWS Health Event"],
"detail": {
"eventTypeCategory": ["issue", "scheduledChange"]
}
}
4. Implement Log Retention Policies
CloudWatch Logs default to infinite retention at $0.03/GB/month stored. For high-volume Lambda functions this adds up fast. Set retention policies:
# Set 30-day retention on a log group (CLI)
aws logs put-retention-policy \
--log-group-name /aws/lambda/my-function \
--retention-in-days 30
5. Infrastructure as Code for Monitoring
Define all CloudWatch alarms, dashboards, and log metrics in CloudFormation or Terraform โ not manually. Manual monitoring setup is impossible to audit, reproduce, or review. See our Monitoring as Code guide for patterns and examples.
Frequently Asked Questions
What is the best tool for AWS monitoring?
CloudWatch is the default and handles most cases. Add Better Stack for user-facing uptime monitoring and Datadog or Grafana Cloud for cross-service dashboards if your team needs richer visualizations. The right stack depends on team size and budget.
How do I monitor AWS Lambda functions?
Monitor Lambda via CloudWatch Metrics: Errors, Duration (P99), Throttles, ConcurrentExecutions, and IteratorAge for stream triggers. Enable X-Ray for tracing. CloudWatch Logs Insights lets you query function logs for debugging.
How do I reduce AWS CloudWatch costs?
Reduce CloudWatch costs by: 1) Setting log retention policies (30-90 days), 2) Using Log Insights sparingly and exporting to S3 for archival, 3) Consolidating custom metrics with metric math, 4) Using contributor insights only for specific high-traffic APIs, 5) Sampling X-Ray traces at 1-5% for non-error requests.
How do I set up alerting from CloudWatch to Slack?
Create a CloudWatch Alarm โ SNS Topic โ Lambda function (or EventBridge โ Lambda). The Lambda posts to Slack via webhook. Better Stack and Datadog provide simpler Slack integrations with richer formatting out of the box.
What AWS monitoring certifications exist?
AWS Certified DevOps Engineer - Professional and AWS Certified SysOps Administrator - Associate both cover monitoring and observability on AWS in depth. The SysOps certification includes heavy CloudWatch coverage.
Further Reading
- Cloud Monitoring Guide โ covers GCP and Azure alongside AWS
- Kubernetes Monitoring Guide โ for EKS and container workloads
- Distributed Tracing Guide โ deep dive on X-Ray and OpenTelemetry
- Monitoring as Code โ Terraform and CloudFormation for CloudWatch
- OpenTelemetry Guide โ vendor-neutral observability standard