Claude / AnthropicUpdated May 2026

Anthropic Claude API Best Practices 2026

Everything you need to run the Claude API reliably in production — rate limits by tier, the infamous 529 error explained, prompt caching, and how to set up monitoring.

Quick Reference

  • Status page: status.anthropic.com
  • 529 = overloaded (not 503) — unique Anthropic error, retry with backoff
  • Free tier: 5 RPM — not production-ready. Upgrade to Scale Tier 1 (pay-as-you-go)
  • Prompt caching saves up to 90% on repeated context — always use for system prompts >1K tokens
  • Haiku has 2.5× the TPM limits of Sonnet — use it for high-volume, latency-sensitive tasks

Claude API Rate Limits by Model and Tier

Anthropic's rate limits apply per API key and are tracked on a per-minute and per-day basis. Limits are based on token consumption, not just request count. A single request with a 50K-token context window can consume half your TPM budget.

ModelTierRPMTPMTPD
claude-3-5-haiku-20241022Build (free)525K300K
claude-3-5-haiku-20241022Scale Tier 150100K5M
claude-3-5-sonnet-20241022Build (free)525K300K
claude-3-5-sonnet-20241022Scale Tier 15040K2M
claude-3-opus-20240229Scale Tier 15020K1M

Important: These are Tier 1 (first paid tier) limits. Higher tiers (2-4) unlock significantly more capacity — Tier 4 reaches 4,000 RPM and 400K TPM for Sonnet. Contact Anthropic to request tier upgrades.

Limits shown are for combined input + output tokens. A 10K input + 2K output = 12K tokens against your TPM budget.

📡
Recommended

Monitor your services before your users notice

Try Better Stack Free →

Claude API Error Codes: Complete Reference

Anthropic uses standard HTTP codes plus a unique 529 overloaded_error that you won't find in most other APIs. All errors include a machine-readable type field in the JSON body.

400 invalid_request_errorClient error — fix before retrying

Trigger: Bad request body, exceeding context window, invalid model name

Fix request client-side. Check token count against max_tokens limit. Verify model name matches Anthropic's model list.

401 authentication_errorClient error — fix before retrying

Trigger: Missing or invalid API key

Verify API key in console.anthropic.com. Check environment variable is loaded correctly. Ensure key hasn't been revoked.

403 permission_errorClient error — fix before retrying

Trigger: API key lacks permission for this endpoint or model

Check your tier supports the requested model. claude-3-opus requires higher tier than claude-3-haiku. Contact Anthropic support if you believe this is wrong.

404 not_found_errorClient error — fix before retrying

Trigger: Model or resource doesn't exist

Verify model ID against Anthropic's current model list (e.g., "claude-3-5-sonnet-20241022" not "claude-3-5-sonnet"). Some model versions are deprecated.

429 rate_limit_errorRetryable

Trigger: Exceeded requests per minute (RPM) or tokens per minute (TPM)

Check Retry-After header. Implement exponential backoff. Consider switching to Haiku for better TPM limits. Reduce context size to preserve TPM budget.

500 api_errorRetryable

Trigger: Unexpected Anthropic server error

Retry with exponential backoff (max 3 attempts). Check status.anthropic.com. If persistent, open a ticket with the request ID from the response.

529 overloaded_errorRetryable

Trigger: Anthropic servers are at capacity (unique Anthropic error code)

Retry with backoff — usually resolves in seconds to minutes. Check status.anthropic.com for ongoing incidents. Implement fallback to GPT-4o or Gemini if 529s persist >5 minutes.

The 529 Error: Anthropic's Unique Overload Signal

The 529 status code is non-standard and unique to Anthropic. It means their inference infrastructure is at capacity — different from a 503 (service completely down) or 429 (your rate limit exceeded). During 529 events:

  • Anthropic's infrastructure is healthy but oversubscribed — requests are being shed
  • Typically lasts 1-15 minutes, resolves on its own
  • Most common after model launches (Claude 4 launch in April 2026 caused hours of 529s)
  • Do not flood with retries — this makes it worse for everyone
Recommended 529 handling (TypeScript)
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function callClaudeWithRetry(
  messages: Anthropic.MessageParam[],
  maxRetries = 4
) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create({
        model: 'claude-3-5-sonnet-20241022',
        max_tokens: 1024,
        messages,
      });
    } catch (error) {
      if (error instanceof Anthropic.APIError) {
        // 529 = overloaded, 500 = server error, 529 = retryable
        const retryable = [429, 500, 529].includes(error.status);

        if (!retryable || attempt === maxRetries - 1) throw error;

        // Exponential backoff: 2s, 4s, 8s, 16s + jitter
        const delay = Math.pow(2, attempt + 1) * 1000;
        const jitter = Math.random() * 1000;
        await new Promise((r) => setTimeout(r, delay + jitter));
      } else {
        throw error;
      }
    }
  }
}

Prompt Caching: 90% Cost Reduction on Repeated Context

Anthropic's prompt caching feature is one of the highest-impact optimizations available. If your application sends similar system prompts or large context on every request, you're likely leaving significant cost savings on the table.

1.25×
Cache write cost
Normal input price
0.10×
Cache read cost
Normal input price (90% off)
5 min
Cache TTL
Refresh with reads before expiry

When to use prompt caching

  • System prompts longer than 1,000 tokens — cache the system prompt, vary only user messages
  • RAG applications — cache the retrieved context block (documents, code) across similar queries
  • Few-shot examples — cache your 5-10 example input/output pairs in the system prompt
  • Code analysis — cache the full file contents, send different questions about the same code
  • Conversation history — cache early turns in long conversations to avoid re-ingesting them

Monitoring the Claude API in Production

Official status page

Anthropic publishes incident reports at status.anthropic.com. You can subscribe to email or RSS notifications. However, Anthropic often reports incidents 30-60 minutes after they begin — which is why external monitoring matters.

Check current Claude API status →

Key metrics to track

529 error rate

Primary health signal — spikes indicate overload events. Alert above 2%.

429 error rate

Rate limit pressure — indicates you're approaching TPM/RPM ceiling.

Time to First Token (TTFT)

Latency until streaming starts. Baseline: 300-800ms. Alert if >3s.

Tokens per request (P95)

Track to predict TPM consumption and catch prompt injection attempts.

Cost per 1,000 requests

Alert on unexpected spikes — could indicate misuse or prompt injection.

Automated alerting setup

For external monitoring (detecting Anthropic-side incidents before your application logs them), use API Status Check for Anthropic or configure synthetic monitoring to ping api.anthropic.com/v1/models every 60 seconds.

  • Alert on non-200 responses from /v1/models
  • Alert when response time exceeds 5 seconds
  • Route alerts to Slack/PagerDuty for on-call engineers

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time Anthropic Claude goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for Anthropic Claude + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial

Production Checklist

Use the right model for the task

Haiku for classification, routing, and simple extractions. Sonnet for most production tasks. Opus only for complex multi-step reasoning where quality clearly matters.

Always set max_tokens explicitly

Omitting max_tokens causes Claude to generate until a natural stop — unpredictable cost and latency. Set a tight upper bound for your use case.

Implement request timeouts

Set a 30-second timeout. During incidents, requests can hang indefinitely without one, exhausting your connection pool.

Cache system prompts

Enable prompt caching for any system prompt over 1,000 tokens. The 90% read discount recouped in the first few thousand requests.

Monitor 529 rate as a health signal

A rising 529 rate is an early warning of infrastructure issues — often 5-10 minutes before an official incident is posted to status.anthropic.com.

Build a model fallback chain

On persistent 529s or 500s, route to GPT-4o or Gemini Flash. Your users should never know Anthropic is having a bad day.

Related Guides

Frequently Asked Questions

What does Anthropic Claude API error 529 mean?

Error 529 from the Anthropic API means the API is overloaded — Anthropic's servers are at capacity and cannot process your request right now. This is distinct from a 503 (service unavailable). 529 errors typically resolve within seconds to minutes. Implement exponential backoff: wait 1s, then 2s, 4s, 8s with jitter. During high-demand periods (especially after Claude model launches), 529 errors can persist for 5-15 minutes.

What are the Claude API rate limits?

Claude API rate limits depend on your usage tier. Free (Build) tier: 5 RPM, 25K TPM, 300K TPD for claude-3-5-haiku. Scale Tier 1 (default paid): 50 RPM, 100K TPM, 5M TPD for claude-3-5-haiku; 50 RPM, 40K TPM, 2M TPD for claude-3-5-sonnet. Higher tiers (2-4) unlock up to 4,000 RPM and 400K TPM. Limits also apply to input+output tokens combined. Check console.anthropic.com for your current limits.

How do I monitor the Claude API?

Monitor the Claude API by: 1) Checking status.anthropic.com for official incident reports, 2) Setting up synthetic monitoring that pings the Anthropic API list-models endpoint every 60 seconds, 3) Tracking 529 error rate as your primary health signal (spikes indicate overload), 4) Monitoring TTFT (time to first token) — increases >3x from baseline indicate degraded service, 5) Subscribing to alerts via API Status Check for instant notification when Anthropic has an incident.

What is Claude API prompt caching and how does it reduce costs?

Prompt caching is an Anthropic feature that lets you cache large system prompts and reuse them across requests without paying full input token costs. Cached tokens cost 10% of normal input price (a 90% discount). Write operations (first cache) cost 25% more than normal, but subsequent reads are heavily discounted. Ideal for: long system prompts, RAG context, few-shot examples. Cache TTL is 5 minutes — you must send a read request before that to refresh it.

How do I handle Claude API 429 rate limit errors?

For Claude API 429 errors: 1) Check the Retry-After header — Anthropic specifies the exact wait time, 2) Implement token-aware request queuing — large prompts consume more of your TPM budget, 3) Use claude-3-5-haiku for high-volume tasks (5× higher TPM limits than Sonnet at lower cost), 4) Enable prompt caching to reduce effective token consumption on repeated prompts, 5) Distribute requests across time to avoid burst limits.

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you