What metrics should I monitor in AWS?

Critical AWS metrics to monitor: EC2 (CPUUtilization, NetworkIn/Out, DiskReadOps), RDS (CPUUtilization, FreeStorageSpace, DatabaseConnections, ReadLatency), Lambda (Errors, Duration, Throttles, ConcurrentExecutions), ALB (TargetResponseTime, HTTPCode_Target_4XX/5XX, UnHealthyHostCount), and SQS (ApproximateNumberOfMessagesVisible for queue depth).

Is AWS CloudWatch free?

AWS CloudWatch has a free tier: 10 custom metrics, 10 alarms, 5GB log ingestion, and 3 dashboards per month. Beyond the free tier, costs include $0.30/custom metric/month, $0.10/alarm/month, and $0.50/GB for log ingestion. Production workloads typically cost $50-500/month for CloudWatch depending on scale.

What is the difference between CloudWatch and X-Ray?

CloudWatch monitors infrastructure metrics and logs (CPU, memory, error rates, latency aggregates). AWS X-Ray provides distributed tracing — it traces individual requests through your application to show exactly where latency or errors occur in a microservices flow. Use both together for complete observability.

AWS Monitoring Guide 2026: CloudWatch, X-Ray, and Beyond

Q: What is the best tool for AWS monitoring?

AWS CloudWatch is the default monitoring tool built into AWS and covers most use cases. For production workloads, teams often add Datadog, New Relic, or Grafana Cloud for better visualization, alerting, and cross-platform monitoring. Better Stack is an excellent choice for uptime monitoring and log management alongside CloudWatch.

Q: How do I monitor AWS Lambda functions?

Monitor AWS Lambda using CloudWatch Metrics (invocations, duration, errors, throttles, concurrent executions), CloudWatch Logs (function logs are automatically streamed), and AWS X-Ray for distributed tracing. Set CloudWatch alarms on error rate and duration P99 to catch issues early.

Running infrastructure on AWS without proper monitoring is like flying blind. AWS gives you an enormous toolbox — CloudWatch, X-Ray, CloudTrail, AWS Health, and more — but knowing which tools to use, which metrics actually matter, and how to avoid alert fatigue takes real experience. This guide distills production AWS monitoring into actionable best practices.

The AWS Monitoring Stack: What's Available

AWS provides several native monitoring services, each covering a different layer of observability:

Service	What It Does	Best For
CloudWatch Metrics	Collect and graph metrics from all AWS services	CPU, memory, latency, error rates
CloudWatch Logs	Centralize and search application and service logs	Log aggregation, error searching
CloudWatch Alarms	Alert on metric thresholds via SNS	Proactive alerting, auto-scaling triggers
AWS X-Ray	Distributed request tracing across services	Latency debugging, microservices
CloudTrail	Audit log of all AWS API calls	Security, compliance, change tracking
AWS Health	AWS service health events affecting your account	Outage awareness, incident tracking
VPC Flow Logs	Network traffic logs for VPC	Security, network debugging
AWS Config	Track AWS resource configuration changes	Compliance, drift detection

AWS CloudWatch: Core Monitoring

Understanding CloudWatch Metrics

CloudWatch metrics are the foundation of AWS monitoring. Every AWS service publishes metrics automatically — EC2, RDS, Lambda, ALB, SQS, and hundreds more. Metrics are organized into namespaces (e.g., AWS/EC2) and have dimensions (e.g., InstanceId).

Key CloudWatch concepts:

Default metrics: Free metrics published automatically (EC2 CPU, RDS connections, etc.)
Detailed monitoring: 1-minute granularity (vs. 5-minute default) for EC2 — $3.50/instance/month
Custom metrics: Publish your own application metrics via the CloudWatch API or SDK
Metric math: Combine metrics with expressions (e.g., error rate = errors / requests × 100)
Metric retention: Data stored for 15 months; high-resolution (1-second) retained 3 hours

Critical Metrics by AWS Service

EC2 Instances

CPUUtilization: Alert at >80% sustained for 5+ minutes
NetworkIn/NetworkOut: Baseline and alert on anomalies
DiskReadOps/DiskWriteOps: I/O saturation on EBS volumes
StatusCheckFailed: Critical — instance or system health failed
Memory/Disk: Not available by default — install CloudWatch Agent

AWS Lambda

Errors: Alert on any sustained error rate >1%
Duration: Watch P99 — Lambda timeouts default at 15 min max
Throttles: Indicates you're hitting concurrency limits
ConcurrentExecutions: Approaching account limit triggers throttles
IteratorAge (streams): Lag in Kinesis/DynamoDB event processing

Amazon RDS / Aurora

CPUUtilization: Alert at >70% (RDS is CPU-sensitive)
FreeStorageSpace: Alert when <20% remaining
DatabaseConnections: Watch for connection pool exhaustion
ReadLatency/WriteLatency: Baseline and alert on p99 spikes
ReplicaLag: For read replicas — alert if lag >30 seconds

Application Load Balancer (ALB)

TargetResponseTime: P99 latency for all backend requests
HTTPCode_Target_5XX_Count: Backend error rate — alert >1%
HTTPCode_ELB_5XX_Count: ALB itself returning errors
UnHealthyHostCount: Hosts failing health checks
ActiveConnectionCount: Concurrent active connections

Amazon SQS

ApproximateNumberOfMessagesVisible: Queue depth — indicates consumer backlog
ApproximateAgeOfOldestMessage: How stale messages are getting
NumberOfMessagesSent: Production rate
Dead Letter Queue depth: Critical — failed messages accumulating

📡

Recommended

Supplement CloudWatch with Better Stack for on-call alerting

CloudWatch can alert via SNS, but Better Stack gives you escalation policies, on-call schedules, and incident management in one place. Integrates with CloudWatch in minutes.

Try Better Stack Free →

CloudWatch Alarms: Setting Up Effective Alerts

CloudWatch Alarms monitor a metric and trigger actions (SNS notification, Auto Scaling, EC2 action) when a threshold is breached. Getting alarms right is critical — too many creates alert fatigue, too few leaves you blind.

Alarm Best Practices

Use anomaly detection instead of static thresholds for metrics with diurnal patterns (traffic naturally higher during business hours).
Set evaluation periods, not just thresholds — require 3 out of 5 data points to breach before alarming to reduce noise.
Treat missing data intentionally: For Lambda functions, missing data means no invocations (potentially OK) — configure accordingly.
Use composite alarms to reduce noise — alert only when CPU AND memory AND latency are all elevated simultaneously.
Route alarms through SNS → PagerDuty/OpsGenie/Better Stack for proper on-call management rather than email floods.

Essential Alarms to Create First

# Minimum viable alarm set for production AWS:

✓ EC2 CPUUtilization > 80% for 5 consecutive minutes

✓ EC2 StatusCheckFailed = 1 (any data point)

✓ ALB TargetResponseTime P99 > 2 seconds

✓ ALB HTTPCode_Target_5XX_Count > 10 per minute

✓ RDS FreeStorageSpace < 20% of total

✓ Lambda Errors > 1% of invocations

✓ Lambda Throttles > 0 for 10+ minutes

✓ SQS DLQ depth > 0 (any messages = investigate)

CloudWatch Logs: Centralized Log Management

CloudWatch Logs is AWS's managed log service. Most AWS services automatically route logs to CloudWatch — Lambda, API Gateway, ECS, and more. EC2 instances need the CloudWatch Agent installed.

Setting Up the CloudWatch Agent (EC2)

The CloudWatch Agent extends EC2 monitoring with memory, disk, and custom log collection:

# Install CloudWatch Agent

sudo yum install amazon-cloudwatch-agent

# Configure via wizard

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Start the agent

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \

-a fetch-config -m ec2 -s \

-c ssm:/AmazonCloudWatch-linux

CloudWatch Logs Insights

CloudWatch Logs Insights lets you query logs with a SQL-like syntax. It's powerful for debugging but can be expensive at scale ($0.005/GB queried). Key queries:

# Find Lambda errors in last hour

fields @timestamp, @message

| filter @message like /ERROR/

| sort @timestamp desc

| limit 100

# P99 latency for API Gateway

fields @timestamp, status, responseLatency

| stats pct(responseLatency, 99) as p99 by bin(5m)

AWS X-Ray: Distributed Tracing

X-Ray adds distributed tracing to your AWS applications, letting you follow a single request through Lambda → API Gateway → DynamoDB → SQS and see exactly where time is spent or where errors occur.

Enabling X-Ray

X-Ray is enabled per-service:

Lambda: Enable Active Tracing in function configuration or via IaC.
API Gateway: Enable X-Ray tracing in Stage settings.
ECS: Add X-Ray daemon as a sidecar container.
EC2: Install X-Ray daemon and instrument your application code.

X-Ray Sampling

X-Ray samples a percentage of requests to control cost. Default: 5% of requests beyond the first per second. For production, customize sampling rules based on URL patterns — sample 100% of error-producing requests, 1% of health checks.

📡

Recommended

Add uptime monitoring and incident management to your AWS stack

Better Stack monitors your AWS-hosted APIs and websites from 20+ global locations. When CloudWatch alarms fire, Better Stack manages the on-call escalation. Free tier monitors up to 10 endpoints.

Try Better Stack Free →

AWS CloudWatch vs. Third-Party Monitoring Tools

CloudWatch is powerful but has real limitations in production:

Feature	CloudWatch	Datadog / New Relic	Better Stack
Dashboards	Basic, limited widgets	Rich, customizable	Clean, focused on uptime
Multi-cloud	AWS only	AWS + GCP + Azure + K8s	URL/API based (any)
On-call management	Via SNS only	Built-in rotation/escalation	Built-in schedules
Log analytics	Logs Insights (expensive)	Powerful, ML-based	Logtail (modern UI)
APM / Tracing	X-Ray (limited)	Full APM + profiling	Not included
Pricing	$0.30/metric/mo	$18+/host/mo	$25/mo for uptime+logs
Setup complexity	Low (built into AWS)	Medium (agent install)	Low (URL-based)

Recommendation: Use CloudWatch for native AWS metric collection and alerting. Add Better Stack or Datadog for user-facing uptime monitoring, cross-service dashboards, and on-call management. Don't pay for both a full Datadog license AND CloudWatch unless you have a large team — the overlap is significant.

AWS Monitoring Architecture: Production Best Practices

1. Use the Four Golden Signals

Google SRE's Four Golden Signals apply directly to AWS: Latency (ALB TargetResponseTime P99), Traffic (RequestCount), Errors (5XX rate), and Saturation (CPU, queue depth, connection pool usage). If you monitor nothing else, monitor these four.

2. Tag Everything for Cost Attribution

AWS resources without tags are unmonitorable at scale. Enforce tagging policies (Environment: prod/staging/dev, Team, Service) and filter CloudWatch metrics and alarms by tag. This lets you drill into per-service costs and per-team incident rates.

3. Use AWS Health Events

Subscribe to AWS Health events in your region to get proactive notifications about infrastructure events affecting your account. Set up an EventBridge rule to forward Health events to Slack or your on-call tool:

# EventBridge rule pattern for AWS Health events

{

"source": ["aws.health"],

"detail-type": ["AWS Health Event"],

"detail": {

"eventTypeCategory": ["issue", "scheduledChange"]

}

4. Implement Log Retention Policies

CloudWatch Logs default to infinite retention at $0.03/GB/month stored. For high-volume Lambda functions this adds up fast. Set retention policies:

# Set 30-day retention on a log group (CLI)

aws logs put-retention-policy \

--log-group-name /aws/lambda/my-function \

--retention-in-days 30

5. Infrastructure as Code for Monitoring

Define all CloudWatch alarms, dashboards, and log metrics in CloudFormation or Terraform — not manually. Manual monitoring setup is impossible to audit, reproduce, or review. See our Monitoring as Code guide for patterns and examples.

Frequently Asked Questions

What is the best tool for AWS monitoring?

CloudWatch is the default and handles most cases. Add Better Stack for user-facing uptime monitoring and Datadog or Grafana Cloud for cross-service dashboards if your team needs richer visualizations. The right stack depends on team size and budget.

How do I monitor AWS Lambda functions?

Monitor Lambda via CloudWatch Metrics: Errors, Duration (P99), Throttles, ConcurrentExecutions, and IteratorAge for stream triggers. Enable X-Ray for tracing. CloudWatch Logs Insights lets you query function logs for debugging.

How do I reduce AWS CloudWatch costs?

Reduce CloudWatch costs by: 1) Setting log retention policies (30-90 days), 2) Using Log Insights sparingly and exporting to S3 for archival, 3) Consolidating custom metrics with metric math, 4) Using contributor insights only for specific high-traffic APIs, 5) Sampling X-Ray traces at 1-5% for non-error requests.

How do I set up alerting from CloudWatch to Slack?

Create a CloudWatch Alarm → SNS Topic → Lambda function (or EventBridge → Lambda). The Lambda posts to Slack via webhook. Better Stack and Datadog provide simpler Slack integrations with richer formatting out of the box.

What AWS monitoring certifications exist?

AWS Certified DevOps Engineer - Professional and AWS Certified SysOps Administrator - Associate both cover monitoring and observability on AWS in depth. The SysOps certification includes heavy CloudWatch coverage.

AWS Monitoring Guide 2026: CloudWatch, X-Ray, and Best Practices