Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

Blog/API Gateway Monitoring

API Gateway Monitoring: Complete Guide 2026

Your API gateway is the front door to your backend services. Every request goes through it. When it degrades, every service degrades. This guide covers the metrics that matter, how to monitor the major gateways (Kong, AWS, Nginx), and how to build alerting that catches issues before your users do.

What is an API Gateway?

An API gateway is a reverse proxy that sits between clients and your backend services. It handles request routing, authentication, rate limiting, SSL termination, caching, and observability. Common gateways: AWS API Gateway, Kong, Nginx, Traefik,Envoy, Azure API Management, and Apigee.

The 8 Essential API Gateway Metrics

Request Rate

Must track

Total requests per second (RPS) passing through the gateway

Alert threshold: Alert on sudden drops (>50% decrease) or unexpected spikes (>3x baseline)

Why it matters: Sudden drops indicate upstream failures or routing issues; spikes indicate traffic surge or DDoS

Error Rate (5xx)

Must track

Percentage of requests returning 5xx server errors

Alert threshold: Alert when 5xx rate exceeds 1% over 5 minutes

Why it matters: Gateway-level 5xx means upstream services are failing or gateway itself is misconfigured

Error Rate (4xx)

Must track

Percentage of requests returning 4xx client errors

Alert threshold: Alert when 4xx rate exceeds 10% (may indicate auth issues or client bugs)

Why it matters: High 401/403 rates suggest auth service problems; high 429 suggests rate limiting is too aggressive

Latency (P95/P99)

Must track

Response time at the 95th and 99th percentile

Alert threshold: Alert when P95 exceeds your SLA target (e.g., >500ms for most APIs)

Why it matters: P99 catches tail latency that averages hide — this is what your worst-case users experience

Upstream Latency

Must track

Time spent waiting for the upstream backend to respond

Alert threshold: Alert when upstream latency exceeds gateway latency by >200% consistently

Why it matters: Isolates whether slowness is in the gateway layer or the backend service

Active Connections

Must track

Number of concurrent connections being handled

Alert threshold: Alert at 80% of configured max connections

Why it matters: Connection pool exhaustion causes new requests to fail immediately with connection refused

Cache Hit Rate

Must track

Percentage of responses served from gateway cache

Alert threshold: Alert if cache hit rate drops >20% from baseline

Why it matters: Sudden cache miss spikes increase load on backend services and increase latency

Rate Limit Violations

Must track

Count of requests rejected due to rate limiting (429 responses)

Alert threshold: Alert on unexpected spikes in rate limiting

Why it matters: Excessive rate limiting may indicate misconfigured limits or a client with a bug making too many calls

📡
Recommended

External monitoring to complement your gateway metrics

Internal gateway metrics tell you what's happening inside. Better Stack adds external synthetic checks — verifying your API works end-to-end from the client's perspective.

Try Better Stack Free →

Monitoring by Gateway Type

AWS API Gateway

AWS API Gateway automatically sends metrics to CloudWatch. Enable detailed CloudWatch metricsin your API settings for per-resource and per-method granularity (default is aggregate only).

Key CloudWatch metrics:

Count, Latency, IntegrationLatency

4XXError, 5XXError

CacheHitCount, CacheMissCount

Enable X-Ray tracing to trace requests through API Gateway to Lambda/ECS backends
Set CloudWatch alarms on 5XXError and P99 Latency breaching SLA thresholds
Use API Gateway access logs for per-request detail (log to CloudWatch Logs)
Consider AWS Managed Grafana for dashboarding CloudWatch metrics

Kong Gateway

Kong exposes Prometheus metrics via the Prometheus plugin. Install it globally to get metrics for all services and routes.

# Enable Prometheus plugin globally

curl -X POST http://localhost:8001/plugins \

--data "name=prometheus"

# Scrape endpoint

curl http://localhost:8001/metrics

Key metrics: kong_http_requests_total (by status), kong_latency_bucket (by type: request/upstream/kong)
Use Grafana with the official Kong dashboard (ID: 7424) for instant visualization
Kong Vitals (Enterprise) adds built-in dashboards without Prometheus setup
Alert on kong_datastore_reachable gauge dropping to 0 — means gateway can't reach its database

Nginx / Nginx Plus

Open-source Nginx provides basic metrics via stub_status. Nginx Plus adds a full JSON metrics API at /api/ with per-upstream metrics.

# nginx.conf — enable stub_status

location /nginx_status {

stub_status;

allow 127.0.0.1;

deny all;

}

Use nginx-prometheus-exporter to expose stub_status metrics to Prometheus
For access log analysis: ship to Datadog, Grafana Loki, or Elastic for per-route metrics
Nginx Plus /api/ endpoint gives per-upstream server health, active connections, and error counts
Monitor worker process count — sudden drop means Nginx is crashing and restarting

Alerting Strategy: What to Alert On

Most teams either under-alert (miss real incidents) or over-alert (alert fatigue kills response times). Use this tiered approach:

P0 — Page immediately

  • 5xx error rate > 5% for 2+ consecutive minutes
  • Gateway uptime check failing (gateway unreachable)
  • P99 latency > 5s for 5+ minutes
  • Active connections > 95% of configured max

P1 — Alert during business hours

  • 5xx error rate > 1% sustained over 10 minutes
  • P95 latency > your SLA threshold (e.g., 500ms)
  • Authentication failure rate > 20%
  • Rate limit violations > 2x baseline

P2 — Daily digest / dashboard review

  • Cache hit rate trending down week-over-week
  • 4xx rate gradually increasing (could indicate client-side bug)
  • Traffic volume anomalies (unexpected drops in off-hours)
  • Upstream latency creeping up (early warning of backend degradation)
📡
Recommended

All your gateway and API monitoring in one place

Better Stack combines uptime monitoring, log management, and incident alerting. Monitor your API gateway endpoints, correlate with logs, and get on-call rotations without stitching together 5 tools.

Try Better Stack Free →

API Gateway Monitoring Tools Comparison

ToolBest ForGateway SupportStarting Price
DatadogEnterprise full-stack observabilityAWS, Kong, Nginx, Envoy, HAProxy$15/host/mo
Better StackUptime + log monitoring comboAny (external health checks)$24/mo
Grafana + PrometheusOpen-source, self-hostedKong, Nginx, Envoy, TraefikFree (infra costs)
New RelicAPM + infrastructure monitoringAWS API GW, Nginx, KongFree tier available
AWS CloudWatchAWS API Gateway native monitoringAWS API Gateway onlyPay per metric/alarm
DynatraceAI-powered anomaly detectionNginx, Kong, AWS, Traefik$69/host/mo

Frequently Asked Questions

What metrics should I monitor for an API gateway?

The essentials: request rate, 5xx error rate, 4xx error rate, P95/P99 latency, upstream latency, active connections, and cache hit rate. Start with error rate and P99 latency — these are the two metrics that directly impact user experience.

How do I monitor AWS API Gateway?

Enable detailed CloudWatch metrics in API Gateway settings. Set up alarms on 5XXError (alert >1%) and Latency P99 (alert when exceeding your SLA). Enable X-Ray tracing to see full request traces through to Lambda/backends. Use API Gateway access logs for per-request detail.

What is the difference between API gateway monitoring and API monitoring?

Gateway monitoring covers infrastructure concerns: request routing, rate limiting, auth, and traffic management. API monitoring covers business logic: does the API return correct data, are responses valid, is the service doing what it should? Both are essential — use synthetic API tests (Better Stack, Checkly) alongside gateway metrics.

How do I reduce alert noise in API gateway monitoring?

Use anomaly detection instead of static thresholds for traffic-based alerts. Require 2+ consecutive breach minutes before alerting to filter transient spikes. Tier your alerts (P0/P1/P2) so only critical thresholds page on-call. Group related alerts to prevent alert storms during cascade failures.

Should I monitor my API gateway externally as well as internally?

Yes, always. Internal metrics tell you what's happening inside the gateway. External synthetic checks (from a monitoring service outside your network) tell you what your users actually experience. A gateway can report healthy internally while users see timeouts due to network or DNS issues upstream.

Related Monitoring Guides

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time API Gateway goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for API Gateway + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial

🛠 Tools We Use & Recommend

Tested across our own infrastructure monitoring 200+ APIs daily

SEMrushBest for SEO

SEO & Site Performance Monitoring

Used by 10M+ marketers

Track your site health, uptime, search rankings, and competitor movements from one dashboard.

We use SEMrush to track how our API status pages rank and catch site health issues early.

From $129.95/moTry SEMrush Free
View full comparison & more tools →Affiliate links — we earn a commission at no extra cost to you