Serverless Monitoring Guide: AWS Lambda, Cold Starts & Observability (2026)
Serverless functions are opaque by default — they crash silently, cold starts add invisible latency, and throttling drops requests without surfacing errors. This guide covers how to achieve full observability for AWS Lambda and other serverless platforms.
📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you
TL;DR — Serverless Monitoring Checklist
- ✅ Enable Lambda Enhanced Monitoring (costs $0.01/function/month, worth it)
- ✅ Alert on Throttles > 0 — every throttle is a dropped request
- ✅ Track Init Duration (cold starts) separately from execution duration
- ✅ Monitor p99 duration not average — cold starts are outliers that kill tail latency
- ✅ Enable X-Ray active tracing for distributed service map visibility
- ✅ Add external endpoint check for API Gateway endpoints — Lambda errors don't always surface
Why Serverless Is Hard to Monitor
Traditional monitoring assumes long-running processes with stable memory and CPU metrics. Serverless breaks all these assumptions:
Serverless monitoring challenges
- • Cold starts add 100ms–10s of invisible latency
- • Functions scale to zero — no persistent process to monitor
- • Throttling silently drops requests (no HTTP error to caller)
- • Log correlation across invocations requires request IDs
- • Ephemeral containers make profiling impossible
- • Cost surprises from runaway retry loops
Serverless monitoring wins
- • Per-invocation billing = built-in cost metrics
- • CloudWatch auto-collects basic metrics for free
- • Structured logs easy with JSON + Lambda Powertools
- • Request isolation makes failures containable
- • Dead letter queues provide automatic failure capture
Core Lambda Metrics Reference
| Metric | CloudWatch Name | Alert Threshold |
|---|---|---|
| Invocations | Invocations | Anomaly detection vs baseline |
| Errors | Errors | Error rate > 1% for 5m (critical) |
| Throttles | Throttles | > 0 for 1m (warning) — every throttle is a dropped call |
| Duration (p99) | Duration | > 80% of timeout setting (warn before timeout kills requests) |
| Init Duration | InitDuration (Enhanced) | > 1s (warn), > 5s (critical — consider SnapStart) |
| Concurrent Executions | ConcurrentExecutions | > 80% of reserved concurrency limit |
| Memory Used | MaxMemoryUsed (Enhanced) | > 80% of configured memory (OOM risk) |
Monitor your API Gateway endpoints with Better Stack
Better Stack runs synthetic checks on your serverless API endpoints from 30+ global locations — so you catch Lambda failures before your users do.
Try Better Stack Free →Cold Start Analysis
Cold starts happen when Lambda allocates a new execution environment — downloading your code, initializing the runtime, and running your initialization code. They add latency that's invisible until you look at p99 duration.
| Runtime | Typical Cold Start | Reduction Strategy |
|---|---|---|
| Node.js 20 | 200–500ms | Reduce package size, lazy imports |
| Python 3.12 | 300–700ms | Reduce import count, use Lambda layers |
| Java 21 (standard) | 4–12 seconds | Use SnapStart (reduces to <1s) |
| Java 21 + SnapStart | 100–500ms | Restore from snapshot; enable in function config |
| Go 1.x | 100–200ms | Already fast; minimize init dependencies |
| Rust (custom runtime) | 5–50ms | Best cold start performance available |
# CloudWatch Insights query: Identify cold start invocations
fields @timestamp, @requestId, @initDuration, @duration, @memorySize
| filter ispresent(@initDuration)
| sort @timestamp desc
| limit 100
# Lambda Powertools for structured logging with cold start detection
import { Logger } from '@aws-lambda-powertools/logger';
const logger = new Logger({ serviceName: 'order-service' });
// Cold start flag in logs — filter on is_cold_start=true in CloudWatch
export const handler = async (event: APIGatewayEvent) => {
logger.addContext(context);
logger.info('Processing request', {
event,
is_cold_start: (context as any).coldStart, // Powertools tracks this
});
};CloudWatch Alarms Setup
# Terraform: Essential Lambda CloudWatch alarms
resource "aws_cloudwatch_metric_alarm" "lambda_errors" {
alarm_name = "${var.function_name}-errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "Errors"
namespace = "AWS/Lambda"
period = 300
statistic = "Sum"
threshold = var.invocations_per_period * 0.01 # 1% error rate
dimensions = {
FunctionName = var.function_name
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
resource "aws_cloudwatch_metric_alarm" "lambda_throttles" {
alarm_name = "${var.function_name}-throttles"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "Throttles"
namespace = "AWS/Lambda"
period = 60
statistic = "Sum"
threshold = 0 # Alert on any throttle
dimensions = {
FunctionName = var.function_name
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
resource "aws_cloudwatch_metric_alarm" "lambda_duration_p99" {
alarm_name = "${var.function_name}-duration-p99"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
extended_statistic = "p99"
metric_name = "Duration"
namespace = "AWS/Lambda"
period = 300
threshold = var.timeout_ms * 0.8 # 80% of timeout
dimensions = {
FunctionName = var.function_name
}
alarm_actions = [aws_sns_topic.alerts.arn]
}Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your serverless endpoints goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your serverless endpoints + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Distributed Tracing with X-Ray
AWS X-Ray traces requests across Lambda functions, API Gateway, DynamoDB, SQS, and other AWS services. Enable it at the Lambda configuration level (zero code changes), then add subsegment annotations for key operations.
# Enable X-Ray in SAM template
Resources:
OrderFunction:
Type: AWS::Serverless::Function
Properties:
Tracing: Active # Enables X-Ray
# Enable in Terraform
resource "aws_lambda_function" "order" {
tracing_config {
mode = "Active"
}
}
# Node.js — instrument AWS SDK calls
import AWSXRay from 'aws-xray-sdk';
import { DynamoDB } from '@aws-sdk/client-dynamodb';
// Wrap SDK client to auto-create X-Ray subsegments
const dynamodb = AWSXRay.captureAWSv3Client(new DynamoDB({}));
// Add custom annotations for business context
export const handler = async (event: any) => {
const segment = AWSXRay.getSegment();
const subsegment = segment?.addNewSubsegment('order-validation');
subsegment?.addAnnotation('orderId', event.orderId);
subsegment?.addAnnotation('customerId', event.customerId);
subsegment?.addMetadata('orderDetails', event);
try {
await validateOrder(event);
subsegment?.close();
} catch (err) {
subsegment?.close(err as Error);
throw err;
}
};Serverless Monitoring Tools Comparison
| Tool | Best For | Lambda Feature | Pricing |
|---|---|---|---|
| AWS CloudWatch | Native AWS monitoring | All Lambda metrics, Insights query, alarms | First 5GB logs free; $0.50/GB after |
| Better Stack | Log + uptime monitoring | Lambda log ingestion, API Gateway uptime checks | Free + $20/mo |
| Datadog Serverless | Enterprise full-stack | Lambda forwarder, enhanced metrics, flamegraphs | $5/million invocations |
| Lumigo | Serverless-specific | Auto-trace, cost insights, cold start visualization | Free 150K traces/mo + $0.50/1M traces |
| New Relic Serverless | Full-stack APM + Lambda | Lambda layer, distributed traces, free 100GB/mo | Free + $0.35/GB |
FAQ
What metrics should I monitor for AWS Lambda?
The seven critical Lambda metrics: Invocations, Errors, Throttles, Duration (p99), Init Duration (cold starts), ConcurrentExecutions, and MaxMemoryUsed. Throttles and cold start duration are the most Lambda-specific concerns — throttles are silent dropped requests, and cold starts create p99 latency outliers invisible in average duration stats.
How do I reduce Lambda cold start times?
Five effective strategies: (1) Lambda SnapStart for Java (reduces 8-12s to under 1s), (2) Keep package size minimal via tree-shaking and esbuild, (3) Move heavy initialization outside the handler (runs once per container), (4) Provisioned Concurrency for latency-critical functions, (5) Consider Rust or Go custom runtimes for sub-100ms cold starts. SnapStart is the biggest single win for Java Lambda.
What is Lambda throttling and how do I fix it?
Throttling occurs when invocations exceed your concurrency limit (default 1,000/region). Throttled invocations return 429 and are often silently retried. Fix: request a limit increase, add SQS as a buffer between trigger and function (SQS handles retries), set reserved concurrency intentionally to protect downstream databases, add exponential backoff in invokers.
How do I trace serverless requests across multiple Lambda functions?
Enable AWS X-Ray active tracing (zero code change needed). For vendor-neutral: use the OpenTelemetry Lambda layer which auto-instruments function calls and exports to any OTLP backend. Use correlation IDs (pass through as event and context) to link logs across invocations even without formal tracing.
How do I monitor Lambda costs?
Lambda costs = invocations × $0.0000002 + GB-seconds × $0.0000166667. Key optimizations: right-size memory using AWS Lambda Power Tuning tool (sometimes more memory = lower cost due to less duration), enable Cost Explorer tags per function, alert on cost anomalies. A runaway retry loop can multiply costs 100x in minutes — alert on unusual invocation rate.
🛠 Tools We Use & Recommend
Tested across our own infrastructure monitoring 200+ APIs daily
Uptime Monitoring & Incident Management
Used by 100,000+ websites
Monitors your APIs every 30 seconds. Instant alerts via Slack, email, SMS, and phone calls when something goes down.
“We use Better Stack to monitor every API on this site. It caught 23 outages last month before users reported them.”
Secrets Management & Developer Security
Trusted by 150,000+ businesses
Manage API keys, database passwords, and service tokens with CLI integration and automatic rotation.
“After covering dozens of outages caused by leaked credentials, we recommend every team use a secrets manager.”
Automated Personal Data Removal
Removes data from 350+ brokers
Removes your personal data from 350+ data broker sites. Protects against phishing and social engineering attacks.
“Service outages sometimes involve data breaches. Optery keeps your personal info off the sites attackers use first.”
AI Voice & Audio Generation
Used by 1M+ developers
Text-to-speech, voice cloning, and audio AI for developers. Build voice features into your apps with a simple API.
“The best AI voice API we've tested — natural-sounding speech with low latency. Essential for any app adding voice features.”
SEO & Site Performance Monitoring
Used by 10M+ marketers
Track your site health, uptime, search rankings, and competitor movements from one dashboard.
“We use SEMrush to track how our API status pages rank and catch site health issues early.”
Related Guides
Cloud Monitoring Guide
AWS, GCP, and Azure infrastructure monitoring beyond Lambda.
Distributed Tracing Guide
X-Ray, OpenTelemetry, Jaeger — trace requests across services.
API Monitoring at Scale
Monitor API Gateway + Lambda at production traffic levels.
Best Log Management Tools 2026
Compare CloudWatch, Better Stack, Loki for Lambda logs.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your serverless endpoints goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your serverless endpoints + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial