Staff Pick

๐Ÿ“ก Monitor your APIs โ€” know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free โ†’

Affiliate link โ€” we may earn a commission at no extra cost to you

Blogโ€บAWS Monitoring Guide

AWS Monitoring Guide 2026: CloudWatch, X-Ray, and Best Practices

A complete guide to monitoring AWS infrastructure in production โ€” covering CloudWatch metrics, logs, alarms, X-Ray tracing, and when to add third-party tools.

Updated: May 2026ยท20 min read

Is AWS down right now? Check the AWS Service Health Dashboard or read our AWS outage guide.

Running infrastructure on AWS without proper monitoring is like flying blind. AWS gives you an enormous toolbox โ€” CloudWatch, X-Ray, CloudTrail, AWS Health, and more โ€” but knowing which tools to use, which metrics actually matter, and how to avoid alert fatigue takes real experience. This guide distills production AWS monitoring into actionable best practices.

The AWS Monitoring Stack: What's Available

AWS provides several native monitoring services, each covering a different layer of observability:

ServiceWhat It DoesBest For
CloudWatch MetricsCollect and graph metrics from all AWS servicesCPU, memory, latency, error rates
CloudWatch LogsCentralize and search application and service logsLog aggregation, error searching
CloudWatch AlarmsAlert on metric thresholds via SNSProactive alerting, auto-scaling triggers
AWS X-RayDistributed request tracing across servicesLatency debugging, microservices
CloudTrailAudit log of all AWS API callsSecurity, compliance, change tracking
AWS HealthAWS service health events affecting your accountOutage awareness, incident tracking
VPC Flow LogsNetwork traffic logs for VPCSecurity, network debugging
AWS ConfigTrack AWS resource configuration changesCompliance, drift detection

AWS CloudWatch: Core Monitoring

Understanding CloudWatch Metrics

CloudWatch metrics are the foundation of AWS monitoring. Every AWS service publishes metrics automatically โ€” EC2, RDS, Lambda, ALB, SQS, and hundreds more. Metrics are organized into namespaces (e.g., AWS/EC2) and have dimensions (e.g., InstanceId).

Key CloudWatch concepts:

Critical Metrics by AWS Service

EC2 Instances

  • CPUUtilization: Alert at >80% sustained for 5+ minutes
  • NetworkIn/NetworkOut: Baseline and alert on anomalies
  • DiskReadOps/DiskWriteOps: I/O saturation on EBS volumes
  • StatusCheckFailed: Critical โ€” instance or system health failed
  • Memory/Disk: Not available by default โ€” install CloudWatch Agent

AWS Lambda

  • Errors: Alert on any sustained error rate >1%
  • Duration: Watch P99 โ€” Lambda timeouts default at 15 min max
  • Throttles: Indicates you're hitting concurrency limits
  • ConcurrentExecutions: Approaching account limit triggers throttles
  • IteratorAge (streams): Lag in Kinesis/DynamoDB event processing

Amazon RDS / Aurora

  • CPUUtilization: Alert at >70% (RDS is CPU-sensitive)
  • FreeStorageSpace: Alert when <20% remaining
  • DatabaseConnections: Watch for connection pool exhaustion
  • ReadLatency/WriteLatency: Baseline and alert on p99 spikes
  • ReplicaLag: For read replicas โ€” alert if lag >30 seconds

Application Load Balancer (ALB)

  • TargetResponseTime: P99 latency for all backend requests
  • HTTPCode_Target_5XX_Count: Backend error rate โ€” alert >1%
  • HTTPCode_ELB_5XX_Count: ALB itself returning errors
  • UnHealthyHostCount: Hosts failing health checks
  • ActiveConnectionCount: Concurrent active connections

Amazon SQS

  • ApproximateNumberOfMessagesVisible: Queue depth โ€” indicates consumer backlog
  • ApproximateAgeOfOldestMessage: How stale messages are getting
  • NumberOfMessagesSent: Production rate
  • Dead Letter Queue depth: Critical โ€” failed messages accumulating
๐Ÿ“ก
Recommended

Supplement CloudWatch with Better Stack for on-call alerting

CloudWatch can alert via SNS, but Better Stack gives you escalation policies, on-call schedules, and incident management in one place. Integrates with CloudWatch in minutes.

Try Better Stack Free โ†’

CloudWatch Alarms: Setting Up Effective Alerts

CloudWatch Alarms monitor a metric and trigger actions (SNS notification, Auto Scaling, EC2 action) when a threshold is breached. Getting alarms right is critical โ€” too many creates alert fatigue, too few leaves you blind.

Alarm Best Practices

Essential Alarms to Create First

# Minimum viable alarm set for production AWS:

โœ“ EC2 CPUUtilization > 80% for 5 consecutive minutes

โœ“ EC2 StatusCheckFailed = 1 (any data point)

โœ“ ALB TargetResponseTime P99 > 2 seconds

โœ“ ALB HTTPCode_Target_5XX_Count > 10 per minute

โœ“ RDS FreeStorageSpace < 20% of total

โœ“ Lambda Errors > 1% of invocations

โœ“ Lambda Throttles > 0 for 10+ minutes

โœ“ SQS DLQ depth > 0 (any messages = investigate)

CloudWatch Logs: Centralized Log Management

CloudWatch Logs is AWS's managed log service. Most AWS services automatically route logs to CloudWatch โ€” Lambda, API Gateway, ECS, and more. EC2 instances need the CloudWatch Agent installed.

Setting Up the CloudWatch Agent (EC2)

The CloudWatch Agent extends EC2 monitoring with memory, disk, and custom log collection:

# Install CloudWatch Agent

sudo yum install amazon-cloudwatch-agent

# Configure via wizard

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Start the agent

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \

-a fetch-config -m ec2 -s \

-c ssm:/AmazonCloudWatch-linux

CloudWatch Logs Insights

CloudWatch Logs Insights lets you query logs with a SQL-like syntax. It's powerful for debugging but can be expensive at scale ($0.005/GB queried). Key queries:

# Find Lambda errors in last hour

fields @timestamp, @message

| filter @message like /ERROR/

| sort @timestamp desc

| limit 100

# P99 latency for API Gateway

fields @timestamp, status, responseLatency

| stats pct(responseLatency, 99) as p99 by bin(5m)

AWS X-Ray: Distributed Tracing

X-Ray adds distributed tracing to your AWS applications, letting you follow a single request through Lambda โ†’ API Gateway โ†’ DynamoDB โ†’ SQS and see exactly where time is spent or where errors occur.

Enabling X-Ray

X-Ray is enabled per-service:

X-Ray Sampling

X-Ray samples a percentage of requests to control cost. Default: 5% of requests beyond the first per second. For production, customize sampling rules based on URL patterns โ€” sample 100% of error-producing requests, 1% of health checks.

๐Ÿ“ก
Recommended

Add uptime monitoring and incident management to your AWS stack

Better Stack monitors your AWS-hosted APIs and websites from 20+ global locations. When CloudWatch alarms fire, Better Stack manages the on-call escalation. Free tier monitors up to 10 endpoints.

Try Better Stack Free โ†’

AWS CloudWatch vs. Third-Party Monitoring Tools

CloudWatch is powerful but has real limitations in production:

FeatureCloudWatchDatadog / New RelicBetter Stack
DashboardsBasic, limited widgetsRich, customizableClean, focused on uptime
Multi-cloudAWS onlyAWS + GCP + Azure + K8sURL/API based (any)
On-call managementVia SNS onlyBuilt-in rotation/escalationBuilt-in schedules
Log analyticsLogs Insights (expensive)Powerful, ML-basedLogtail (modern UI)
APM / TracingX-Ray (limited)Full APM + profilingNot included
Pricing$0.30/metric/mo$18+/host/mo$25/mo for uptime+logs
Setup complexityLow (built into AWS)Medium (agent install)Low (URL-based)

Recommendation: Use CloudWatch for native AWS metric collection and alerting. Add Better Stack or Datadog for user-facing uptime monitoring, cross-service dashboards, and on-call management. Don't pay for both a full Datadog license AND CloudWatch unless you have a large team โ€” the overlap is significant.

AWS Monitoring Architecture: Production Best Practices

1. Use the Four Golden Signals

Google SRE's Four Golden Signals apply directly to AWS: Latency (ALB TargetResponseTime P99), Traffic (RequestCount), Errors (5XX rate), and Saturation (CPU, queue depth, connection pool usage). If you monitor nothing else, monitor these four.

2. Tag Everything for Cost Attribution

AWS resources without tags are unmonitorable at scale. Enforce tagging policies (Environment: prod/staging/dev, Team, Service) and filter CloudWatch metrics and alarms by tag. This lets you drill into per-service costs and per-team incident rates.

3. Use AWS Health Events

Subscribe to AWS Health events in your region to get proactive notifications about infrastructure events affecting your account. Set up an EventBridge rule to forward Health events to Slack or your on-call tool:

# EventBridge rule pattern for AWS Health events

{

"source": ["aws.health"],

"detail-type": ["AWS Health Event"],

"detail": {

"eventTypeCategory": ["issue", "scheduledChange"]

}

}

4. Implement Log Retention Policies

CloudWatch Logs default to infinite retention at $0.03/GB/month stored. For high-volume Lambda functions this adds up fast. Set retention policies:

# Set 30-day retention on a log group (CLI)

aws logs put-retention-policy \

--log-group-name /aws/lambda/my-function \

--retention-in-days 30

5. Infrastructure as Code for Monitoring

Define all CloudWatch alarms, dashboards, and log metrics in CloudFormation or Terraform โ€” not manually. Manual monitoring setup is impossible to audit, reproduce, or review. See our Monitoring as Code guide for patterns and examples.

Frequently Asked Questions

What is the best tool for AWS monitoring?

CloudWatch is the default and handles most cases. Add Better Stack for user-facing uptime monitoring and Datadog or Grafana Cloud for cross-service dashboards if your team needs richer visualizations. The right stack depends on team size and budget.

How do I monitor AWS Lambda functions?

Monitor Lambda via CloudWatch Metrics: Errors, Duration (P99), Throttles, ConcurrentExecutions, and IteratorAge for stream triggers. Enable X-Ray for tracing. CloudWatch Logs Insights lets you query function logs for debugging.

How do I reduce AWS CloudWatch costs?

Reduce CloudWatch costs by: 1) Setting log retention policies (30-90 days), 2) Using Log Insights sparingly and exporting to S3 for archival, 3) Consolidating custom metrics with metric math, 4) Using contributor insights only for specific high-traffic APIs, 5) Sampling X-Ray traces at 1-5% for non-error requests.

How do I set up alerting from CloudWatch to Slack?

Create a CloudWatch Alarm โ†’ SNS Topic โ†’ Lambda function (or EventBridge โ†’ Lambda). The Lambda posts to Slack via webhook. Better Stack and Datadog provide simpler Slack integrations with richer formatting out of the box.

What AWS monitoring certifications exist?

AWS Certified DevOps Engineer - Professional and AWS Certified SysOps Administrator - Associate both cover monitoring and observability on AWS in depth. The SysOps certification includes heavy CloudWatch coverage.

Further Reading

Add External Uptime Monitoring to Your AWS Stack

CloudWatch monitors from inside AWS. Better Stack monitors your endpoints from 20+ global locations โ€” catching DNS failures, CDN issues, and regional outages that CloudWatch can't see.

Try Better Stack Free โ€” No Credit Card Required

Or use APIStatusCheck Alert Pro โ€” API monitoring from $9/mo

๐Ÿ›  Tools We Use & Recommend

Tested across our own infrastructure monitoring 200+ APIs daily

SEMrushBest for SEO

SEO & Site Performance Monitoring

Used by 10M+ marketers

Track your site health, uptime, search rankings, and competitor movements from one dashboard.

โ€œWe use SEMrush to track how our API status pages rank and catch site health issues early.โ€

From $129.95/moTry SEMrush Free
View full comparison & more tools โ†’Affiliate links โ€” we earn a commission at no extra cost to you