Anthropic Claude API Best Practices 2026
Everything you need to run the Claude API reliably in production — rate limits by tier, the infamous 529 error explained, prompt caching, and how to set up monitoring.
Quick Reference
- →Status page:
status.anthropic.com - →529 = overloaded (not 503) — unique Anthropic error, retry with backoff
- →Free tier: 5 RPM — not production-ready. Upgrade to Scale Tier 1 (pay-as-you-go)
- →Prompt caching saves up to 90% on repeated context — always use for system prompts >1K tokens
- →Haiku has 2.5× the TPM limits of Sonnet — use it for high-volume, latency-sensitive tasks
Claude API Rate Limits by Model and Tier
Anthropic's rate limits apply per API key and are tracked on a per-minute and per-day basis. Limits are based on token consumption, not just request count. A single request with a 50K-token context window can consume half your TPM budget.
| Model | Tier | RPM | TPM | TPD |
|---|---|---|---|---|
| claude-3-5-haiku-20241022 | Build (free) | 5 | 25K | 300K |
| claude-3-5-haiku-20241022 | Scale Tier 1 | 50 | 100K | 5M |
| claude-3-5-sonnet-20241022 | Build (free) | 5 | 25K | 300K |
| claude-3-5-sonnet-20241022 | Scale Tier 1 | 50 | 40K | 2M |
| claude-3-opus-20240229 | Scale Tier 1 | 50 | 20K | 1M |
Important: These are Tier 1 (first paid tier) limits. Higher tiers (2-4) unlock significantly more capacity — Tier 4 reaches 4,000 RPM and 400K TPM for Sonnet. Contact Anthropic to request tier upgrades.
Limits shown are for combined input + output tokens. A 10K input + 2K output = 12K tokens against your TPM budget.
Claude API Error Codes: Complete Reference
Anthropic uses standard HTTP codes plus a unique 529 overloaded_error that you won't find in most other APIs. All errors include a machine-readable type field in the JSON body.
400 invalid_request_errorClient error — fix before retryingTrigger: Bad request body, exceeding context window, invalid model name
Fix request client-side. Check token count against max_tokens limit. Verify model name matches Anthropic's model list.
401 authentication_errorClient error — fix before retryingTrigger: Missing or invalid API key
Verify API key in console.anthropic.com. Check environment variable is loaded correctly. Ensure key hasn't been revoked.
403 permission_errorClient error — fix before retryingTrigger: API key lacks permission for this endpoint or model
Check your tier supports the requested model. claude-3-opus requires higher tier than claude-3-haiku. Contact Anthropic support if you believe this is wrong.
404 not_found_errorClient error — fix before retryingTrigger: Model or resource doesn't exist
Verify model ID against Anthropic's current model list (e.g., "claude-3-5-sonnet-20241022" not "claude-3-5-sonnet"). Some model versions are deprecated.
429 rate_limit_errorRetryableTrigger: Exceeded requests per minute (RPM) or tokens per minute (TPM)
Check Retry-After header. Implement exponential backoff. Consider switching to Haiku for better TPM limits. Reduce context size to preserve TPM budget.
500 api_errorRetryableTrigger: Unexpected Anthropic server error
Retry with exponential backoff (max 3 attempts). Check status.anthropic.com. If persistent, open a ticket with the request ID from the response.
529 overloaded_errorRetryableTrigger: Anthropic servers are at capacity (unique Anthropic error code)
Retry with backoff — usually resolves in seconds to minutes. Check status.anthropic.com for ongoing incidents. Implement fallback to GPT-4o or Gemini if 529s persist >5 minutes.
The 529 Error: Anthropic's Unique Overload Signal
The 529 status code is non-standard and unique to Anthropic. It means their inference infrastructure is at capacity — different from a 503 (service completely down) or 429 (your rate limit exceeded). During 529 events:
- •Anthropic's infrastructure is healthy but oversubscribed — requests are being shed
- •Typically lasts 1-15 minutes, resolves on its own
- •Most common after model launches (Claude 4 launch in April 2026 caused hours of 529s)
- •Do not flood with retries — this makes it worse for everyone
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function callClaudeWithRetry(
messages: Anthropic.MessageParam[],
maxRetries = 4
) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages,
});
} catch (error) {
if (error instanceof Anthropic.APIError) {
// 529 = overloaded, 500 = server error, 529 = retryable
const retryable = [429, 500, 529].includes(error.status);
if (!retryable || attempt === maxRetries - 1) throw error;
// Exponential backoff: 2s, 4s, 8s, 16s + jitter
const delay = Math.pow(2, attempt + 1) * 1000;
const jitter = Math.random() * 1000;
await new Promise((r) => setTimeout(r, delay + jitter));
} else {
throw error;
}
}
}
}Prompt Caching: 90% Cost Reduction on Repeated Context
Anthropic's prompt caching feature is one of the highest-impact optimizations available. If your application sends similar system prompts or large context on every request, you're likely leaving significant cost savings on the table.
When to use prompt caching
- ✓System prompts longer than 1,000 tokens — cache the system prompt, vary only user messages
- ✓RAG applications — cache the retrieved context block (documents, code) across similar queries
- ✓Few-shot examples — cache your 5-10 example input/output pairs in the system prompt
- ✓Code analysis — cache the full file contents, send different questions about the same code
- ✓Conversation history — cache early turns in long conversations to avoid re-ingesting them
Monitoring the Claude API in Production
Official status page
Anthropic publishes incident reports at status.anthropic.com. You can subscribe to email or RSS notifications. However, Anthropic often reports incidents 30-60 minutes after they begin — which is why external monitoring matters.
Key metrics to track
529 error ratePrimary health signal — spikes indicate overload events. Alert above 2%.
429 error rateRate limit pressure — indicates you're approaching TPM/RPM ceiling.
Time to First Token (TTFT)Latency until streaming starts. Baseline: 300-800ms. Alert if >3s.
Tokens per request (P95)Track to predict TPM consumption and catch prompt injection attempts.
Cost per 1,000 requestsAlert on unexpected spikes — could indicate misuse or prompt injection.
Automated alerting setup
For external monitoring (detecting Anthropic-side incidents before your application logs them), use API Status Check for Anthropic or configure synthetic monitoring to ping api.anthropic.com/v1/models every 60 seconds.
- ✓ Alert on non-200 responses from /v1/models
- ✓ Alert when response time exceeds 5 seconds
- ✓ Route alerts to Slack/PagerDuty for on-call engineers
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time Anthropic Claude goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for Anthropic Claude + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Production Checklist
Use the right model for the task
Haiku for classification, routing, and simple extractions. Sonnet for most production tasks. Opus only for complex multi-step reasoning where quality clearly matters.
Always set max_tokens explicitly
Omitting max_tokens causes Claude to generate until a natural stop — unpredictable cost and latency. Set a tight upper bound for your use case.
Implement request timeouts
Set a 30-second timeout. During incidents, requests can hang indefinitely without one, exhausting your connection pool.
Cache system prompts
Enable prompt caching for any system prompt over 1,000 tokens. The 90% read discount recouped in the first few thousand requests.
Monitor 529 rate as a health signal
A rising 529 rate is an early warning of infrastructure issues — often 5-10 minutes before an official incident is posted to status.anthropic.com.
Build a model fallback chain
On persistent 529s or 500s, route to GPT-4o or Gemini Flash. Your users should never know Anthropic is having a bad day.
Related Guides
Frequently Asked Questions
What does Anthropic Claude API error 529 mean?
Error 529 from the Anthropic API means the API is overloaded — Anthropic's servers are at capacity and cannot process your request right now. This is distinct from a 503 (service unavailable). 529 errors typically resolve within seconds to minutes. Implement exponential backoff: wait 1s, then 2s, 4s, 8s with jitter. During high-demand periods (especially after Claude model launches), 529 errors can persist for 5-15 minutes.
What are the Claude API rate limits?
Claude API rate limits depend on your usage tier. Free (Build) tier: 5 RPM, 25K TPM, 300K TPD for claude-3-5-haiku. Scale Tier 1 (default paid): 50 RPM, 100K TPM, 5M TPD for claude-3-5-haiku; 50 RPM, 40K TPM, 2M TPD for claude-3-5-sonnet. Higher tiers (2-4) unlock up to 4,000 RPM and 400K TPM. Limits also apply to input+output tokens combined. Check console.anthropic.com for your current limits.
How do I monitor the Claude API?
Monitor the Claude API by: 1) Checking status.anthropic.com for official incident reports, 2) Setting up synthetic monitoring that pings the Anthropic API list-models endpoint every 60 seconds, 3) Tracking 529 error rate as your primary health signal (spikes indicate overload), 4) Monitoring TTFT (time to first token) — increases >3x from baseline indicate degraded service, 5) Subscribing to alerts via API Status Check for instant notification when Anthropic has an incident.
What is Claude API prompt caching and how does it reduce costs?
Prompt caching is an Anthropic feature that lets you cache large system prompts and reuse them across requests without paying full input token costs. Cached tokens cost 10% of normal input price (a 90% discount). Write operations (first cache) cost 25% more than normal, but subsequent reads are heavily discounted. Ideal for: long system prompts, RAG context, few-shot examples. Cache TTL is 5 minutes — you must send a read request before that to refresh it.
How do I handle Claude API 429 rate limit errors?
For Claude API 429 errors: 1) Check the Retry-After header — Anthropic specifies the exact wait time, 2) Implement token-aware request queuing — large prompts consume more of your TPM budget, 3) Use claude-3-5-haiku for high-volume tasks (5× higher TPM limits than Sonnet at lower cost), 4) Enable prompt caching to reduce effective token consumption on repeated prompts, 5) Distribute requests across time to avoid burst limits.
📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you