What metrics should I monitor for LLM APIs?

The five critical LLM API metrics are: 1) Uptime/availability (is the API responding?), 2) Time to First Token (TTFT — how fast does streaming start?), 3) Tokens per second (TPS — generation speed), 4) Error rate by type (429 rate limits vs 503 service errors), and 5) Cost per request (track token consumption against budgets).

How do I monitor OpenAI API uptime?

Monitor OpenAI API uptime by: 1) Checking status.openai.com for official incident reports, 2) Setting up synthetic monitoring that pings the /v1/models endpoint every 30-60 seconds, 3) Using API Status Check for automated uptime tracking with Slack/PagerDuty alerts, and 4) Tracking your own application error logs for 500-series responses from api.openai.com.

What is the difference between LLM monitoring and APM?

Traditional APM (Application Performance Monitoring) tracks infrastructure metrics — CPU, memory, request throughput. LLM monitoring adds AI-specific dimensions: token consumption, model version, prompt/completion quality, hallucination rates, and per-model cost attribution. LLM monitoring is APM plus a token-aware billing layer plus semantic quality tracking.

How do I handle LLM API rate limits in production?

To handle LLM rate limits: 1) Implement exponential backoff with jitter on 429 responses, 2) Track requests per minute and tokens per minute against your tier limits, 3) Use a request queue for burst traffic, 4) Set up alerts at 70% of your rate limit threshold before you hit the hard cap, and 5) Configure automatic failover to a secondary provider when sustained rate limiting is detected.

How do I set up multi-provider LLM failover?

Implement multi-provider failover using an inference router like LiteLLM, which provides a unified OpenAI-compatible interface across 100+ LLM providers. Define a provider priority order (e.g., OpenAI → Anthropic → Groq), and the router automatically retries with the next provider on a 503 or timeout. Combine with API Status Check alerts so your team knows when failover has triggered.

LLM API Monitoring Guide 2026: Monitor OpenAI, Anthropic, Groq & More

Two years ago, monitoring your AI stack meant checking whether your GPU server was still alive. Today, most teams have ripped out their own infrastructure in favor of cloud inference APIs — OpenAI, Anthropic Claude, Google Gemini, Groq, and a dozen others. The promise: faster time to market, no GPU management, instant scaling.

The reality: your production SLA now depends on the uptime of five different third-party APIs, any of which can go down without warning.

This guide covers everything engineering teams need to monitor LLM APIs properly — from the right metrics to collect, to setting up automated alerts, to building failover pipelines that survive provider outages.

Why LLM API Monitoring Is Different from Traditional API Monitoring

Standard API monitoring asks: is this endpoint responding in under 200ms? LLM API monitoring has to answer harder questions:

Traditional API Monitoring

• Is the endpoint up? (200 vs 503)
• Response time < threshold
• Error rate < threshold
• Request throughput

LLM API Monitoring (adds)

• Time to First Token (TTFT)
• Tokens per second (streaming speed)
• Rate limit headroom (TPM/RPM)
• Cost per request ($)
• Per-model version tracking
• Context length utilization

A standard uptime monitor will tell you OpenAI is "up" because the endpoint returns 200. It won't tell you that TTFT spiked from 400ms to 8 seconds — which from a user experience perspective is effectively down.

The 5 Core LLM API Metrics to Track

1. Uptime / Availability

The baseline. Track HTTP 200 vs 4xx vs 5xx responses. For LLM APIs, a 429 Too Many Requests is not an outage — it's a rate limit and should be tracked separately. A 503 Service Unavailable or connection timeout is a true availability failure.

Alert threshold: Page immediately on any 503 or timeout. Alert at 70% rate-limit utilization before you hit hard caps.

2. Time to First Token (TTFT)

For streaming responses, TTFT is how long the user waits before text starts appearing. A p95 TTFT above 3–4 seconds degrades user experience significantly, even if the full response eventually completes successfully.

Track TTFT as a percentile distribution (p50, p95, p99). Sudden spikes in p95 TTFT are often the first signal of upstream provider degradation — before it escalates to full outage.

3. Token Throughput (Tokens/Second)

For applications where generation speed matters (real-time chat, code completion), track tokens per second for streaming responses. A significant drop in TPS with no corresponding error increase suggests provider-side throttling or infrastructure degradation.

4. Error Rate by Type

Segment your error rates by status code:

429: Rate limit — monitor TPM/RPM headroom, implement backoff
401: Auth failure — check key expiry and rotation
503/504: Provider outage — trigger failover
400: Bad request — usually a prompt formatting issue, not provider

5. Cost per Request

Every LLM API call consumes tokens that cost money. Track input and output token counts per request and correlate with your billing to detect cost anomalies. A 3x spike in tokens per request often signals a prompt engineering regression or a context window that's growing uncontrolled.

📡

Recommended

All Your LLM API Metrics in One Dashboard

Better Stack provides unified monitoring for OpenAI, Anthropic, Groq, and your own API endpoints. Set up LLM uptime alerts in minutes.

Try Better Stack Free →

How to Monitor Each Major LLM Provider

Monitoring OpenAI API

OpenAI's official status page is status.openai.com. For automated monitoring, probe the /v1/models endpoint — it's lightweight and reflects API availability without consuming tokens.

# Lightweight OpenAI availability check
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  --max-time 5 -o /dev/null -w "%{http_code}"

For functional monitoring (does inference actually work?), send a minimal 1-token completion with max_tokens: 1 to minimize cost while verifying end-to-end functionality.

Track separately: GPT-4o, GPT-4o-mini, and o3 can have different availability profiles during rollouts. Check the OpenAI status page on API Status Check for real-time tracking.

Monitoring Anthropic API (Claude)

Anthropic's status page is status.anthropic.com. Monitor the Messages API endpoint at https://api.anthropic.com/v1/messages. Claude models are versioned (claude-3-5-sonnet-20241022, claude-opus-4-6) — test the specific model version your application uses, not just a generic endpoint.

Important: Anthropic's rate limits are lower than OpenAI's on most tiers. Monitor your requests-per-minute (RPM) and output-tokens-per-minute (OTPM) headroom, as these tend to be the binding constraints. Track the Anthropic API status here.

Monitoring Groq API

Groq's status page is groqstatus.com. The key metric for Groq is tokens per second — it's their primary differentiator. If TPS drops below 100 t/s for small models, investigate immediately; Groq's LPU normally delivers 500–1000 t/s.

Groq has tighter per-model rate limits than OpenAI. Monitor your TPM utilization per model family (Llama 3.3 70B, Llama 3.1 8B, Gemma 2 9B) separately, as limits don't aggregate across models. See the full Groq outage guide for troubleshooting details.

Monitoring Google Gemini API

Google AI Platform status is available at status.cloud.google.com under the AI Platform services. The Gemini API (generativelanguage.googleapis.com) is separate from Vertex AI — monitor both if you use both. Check the Gemini API status here.

📡

Recommended

One Dashboard for All Your AI APIs

API Status Check monitors OpenAI, Anthropic, Groq, Gemini, and your own endpoints. Get PagerDuty, Slack, and email alerts the moment any LLM API degrades.

Try Better Stack Free →

Building a Multi-Provider LLM Failover Strategy

The highest-leverage reliability improvement any AI application team can make: implement multi-provider failover so no single LLM provider outage takes down your service.

Option 1: LiteLLM (Recommended)

LiteLLM provides a unified OpenAI-compatible interface across 100+ LLM providers. A single configuration file defines your provider fallback chain:

# litellm_config.yaml
model_list:
  - model_name: my-chat-model
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: my-chat-model
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: my-chat-model
    litellm_params:
      model: groq/llama-3.3-70b-versatile
      api_key: os.environ/GROQ_API_KEY

router_settings:
  routing_strategy: "latency-based-routing"
  num_retries: 2
  fallbacks: [{"my-chat-model": ["my-chat-model"]}]

With this config, a failed OpenAI call automatically retries with Anthropic, then Groq — all transparently.

Option 2: Application-Layer Try/Except

For simpler implementations, wrap your LLM calls in a failover function:

import anthropic
from openai import OpenAI

openai_client = OpenAI()
anthropic_client = anthropic.Anthropic()

def call_with_fallback(prompt: str) -> str:
    try:
        resp = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            timeout=10,
        )
        return resp.choices[0].message.content
    except Exception:
        # Fallback to Anthropic Claude
        msg = anthropic_client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        )
        return msg.content[0].text

Alerting on Provider Degradation

Failover is reactive. Monitoring is proactive. When your monitoring system detects that OpenAI's p95 TTFT has increased from 600ms to 4s, you want your team to know before users notice — even if requests are technically succeeding.

Set up these alert thresholds for each provider:

P95 TTFT > 3s → Warning (degraded performance)
Error rate > 5% → Warning
Error rate > 20% → Critical (page on-call)
503/504 for 3+ consecutive checks → Critical (switch to backup provider)
Rate limit utilization > 70% → Warning (consider request throttling)

LLM API Monitoring Tools Compared

Tool	LLM-specific metrics	Uptime alerts	Cost tracking	Starting price
API Status Check	Uptime + TTFT + error rates	✓ Slack, email, PagerDuty	Via webhook integration	Free tier available
Better Stack	Uptime + custom dashboards	✓ Multi-channel	Custom logs	$24/mo
Datadog LLM Observability	Full LLM telemetry (tokens, cost, quality)	✓ Enterprise alerting	✓ Built-in	$15/host + usage
Langfuse	Token usage, quality scoring, traces	Via integrations	✓ Built-in	Open source / $29/mo
Helicone	Proxy-based: full request logging	Rate limit alerts	✓ Built-in	Free tier / $20/mo

For infrastructure-level uptime monitoring (is the API up?), use API Status Check or Better Stack. For application-level LLM observability (token usage, quality, cost per feature), use Langfuse or Helicone.

Conclusion: Treat LLM APIs Like Critical Infrastructure

Every engineering team that has shipped an AI-powered feature has eventually had to answer an urgent Slack message: "Is the AI broken or is OpenAI down?" The teams who answer that question in 30 seconds are the ones with monitoring in place. The ones who take 20 minutes are the ones reading this guide for the first time during an incident.

The setup is minimal: configure uptime checks for each provider you use, set p95 TTFT thresholds, alert at 70% rate limit utilization, and have a tested failover path. That's the entire playbook.

Monitor All Your LLM APIs in One Place

API Status Check tracks OpenAI, Anthropic, Groq, Gemini, and your own endpoints. Get instant alerts when any AI provider degrades.

Start Monitoring for Free →

LLM API Monitoring Guide 2026: How to Monitor OpenAI, Anthropic, Groq & More