Know Your API Cost Profile First
Before optimizing, understand where your API spend goes. Most teams have:
- LLM APIs (OpenAI, Anthropic, Google): Per-token pricing. Costs scale with prompt length, context window, and request volume. Often the #1 API cost for AI-enabled products.
- Data/enrichment APIs (Clearbit, Hunter, People Data Labs): Per-record pricing. Easy to accidentally over-fetch.
- Communication APIs (Twilio, SendGrid): Per-message pricing. High-volume notification systems can generate surprising bills.
- Payment APIs (Stripe): Per-transaction + percentage. Usually justified by revenue, but optimize for dispute handling.
- Infrastructure APIs (AWS, GCP, Azure): Complex pricing with dozens of service dimensions.
Tag your API calls by feature, endpoint, and user tier before optimizing. You can't optimize what you can't measure.
Strategy 1: Response Caching
The highest-leverage optimization for most teams. Cache API responses and return cached results instead of making repeat API calls.
Exact-Match Caching
For deterministic APIs (same input = same output), cache by request parameters. Effective for:
- Geocoding/reverse geocoding (address → coordinates)
- Currency exchange rates (cache for 1-5 minutes)
- Company enrichment data (Clearbit lookups by domain)
- Static data APIs (country codes, industry lists)
// Redis-backed API response cache
const redis = require('redis');
const client = redis.createClient();
async function cachedAPICall(key, ttl, apiFn) {
// Check cache first
const cached = await client.get(key);
if (cached) {
cacheHitCounter.inc({ key_prefix: key.split(':')[0] });
return JSON.parse(cached);
}
// Cache miss — call the API
cacheMissCounter.inc({ key_prefix: key.split(':')[0] });
const result = await apiFn();
// Store with TTL
await client.setEx(key, ttl, JSON.stringify(result));
return result;
}
// Usage: geocoding with 24-hour cache
const geocode = (address) =>
cachedAPICall(
`geocode:${address}`,
86400,
() => googleMaps.geocode({ address })
);Semantic Caching for LLM APIs
LLM responses aren't perfectly deterministic, but semantically similar queries often have reusable responses. Semantic caching uses vector embeddings to find similar previous queries and return cached responses when similarity exceeds a threshold (typically 0.95+).
Tools: GPTCache, Langchain caching, Redis with vector search, Vectara. Effective for: FAQ chatbots, document Q&A, classification tasks. Less effective for: open-ended generation, personalized responses.
📡 Monitor your API spending uptime every 30 seconds — get alerted in under a minute
Trusted by 100,000+ websites · Free tier available
Strategy 2: LLM-Specific Cost Optimization
LLM APIs have unique cost drivers that require specific strategies:
Model Selection (The Biggest Lever)
Choosing the right model is the single highest-impact cost optimization. Cost ratios between model tiers are enormous:
| Model | Input Cost (per 1M tokens) | Best For |
|---|---|---|
| GPT-4o | $2.50 | Complex reasoning, nuanced responses |
| GPT-4o mini | $0.15 | Classification, extraction, simple QA |
| Claude Sonnet 4.6 | $3.00 | Complex tasks, coding, long documents |
| Claude Haiku 4.5 | $0.08 | High-volume simple tasks |
| Gemini 2.0 Flash | $0.075 | Speed + cost sensitive workloads |
Build a model routing layer that selects the cheapest model capable of the task. Route classification and extraction tasks to mini/haiku models. Reserve expensive models for tasks that genuinely require them.
Prompt Compression
LLM costs scale linearly with token count. Reduce your prompt token count:
- Compress system prompts: Audit your system prompt for redundancy. Strip unnecessary examples, verbose formatting instructions, and repeated context.
- Use shorter examples: Few-shot examples should be concise. 3 short examples > 1 verbose example.
- Summarize conversation history: For long conversations, summarize older messages instead of sending full history.
- Use LLMLingua / prompt compression libraries: Tools that algorithmically compress prompts while preserving semantics. 4-8x compression with minimal quality loss for retrieval-augmented prompts.
Anthropic Prompt Caching
Anthropic's prompt caching feature caches the first N tokens of a prompt. On cache hit, cached tokens cost 10% of base price. For applications with consistent system prompts or long context documents, this reduces costs by 50-90% on cached portions. Supported on Claude Haiku, Sonnet, and Opus — add the cache_control parameter to prompts you want cached.
Strategy 3: Request Batching
Batch multiple API requests into single calls to reduce per-request overhead, rate limit consumption, and often cost:
- OpenAI Batch API: Process jobs asynchronously with 50% cost reduction. Ideal for: document processing, bulk classification, overnight enrichment runs.
- Embedding batch requests: Instead of embedding 1 document at a time, batch up to 2,048 texts per request.
- Stripe batch charges: Combine small charges into one larger charge where your business model allows (reduces per-transaction fees).
- SendGrid batch emails: Use the batch send API for bulk campaigns — more efficient than individual sends.
Strategy 4: API Cost Monitoring
You can't optimize what you can't see. Build cost dashboards before optimizing:
Cost Attribution by Feature
// Tag every LLM API call with feature context
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [...],
user: 'feature:email-summarizer', // Cost attribution tag
});
// Track token usage
const cost = (
response.usage.prompt_tokens * PROMPT_COST_PER_TOKEN +
response.usage.completion_tokens * COMPLETION_COST_PER_TOKEN
);
// Emit to your metrics system
apiCostGauge.observe({ feature: 'email-summarizer', model: 'gpt-4o-mini' }, cost);Budget Alerts
Set hard budget limits that stop API calls before you incur surprise charges:
- OpenAI: Settings → Billing → Usage limits. Set a soft limit (warning email) and hard limit (stops API calls).
- Anthropic: Console → Settings → Spend limits.
- AWS: AWS Budgets with SNS alerts. Set alarms at 50%, 80%, 100% of monthly budget.
- Stripe: Monitor your platform's payment processing fees separately — they scale with revenue.
Monitor API health alongside cost metrics
Track API error rates, latency, and uptime in Better Stack. When costs spike due to retries or errors, you'll know immediately.
Try Better Stack Free →Strategy 5: Rate Limit and Retry Optimization
Poorly implemented retry logic can multiply your API costs significantly. Every failed request that's retried is a duplicate cost:
- Use exponential backoff with jitter: Don't retry failed requests at fixed intervals — use exponential backoff to avoid thundering herd. Add random jitter to prevent synchronized retries.
- Track retry rates: If 10% of your API calls are retries, you're paying 10% more than necessary. Fix the underlying rate limit or error issue.
- Use circuit breakers: When an API is consistently failing, stop retrying immediately and return degraded behavior. See our circuit breaker guide.
- Queue non-urgent requests: Don't exhaust rate limits on real-time requests. Queue background jobs to run during off-peak hours when rate limits reset.
API Cost Optimization Checklist
- ☐API cost attribution by feature — know which feature is spending what
- ☐Budget alerts at 50%, 80%, 100% of monthly limit on all providers
- ☐Response caching for deterministic API calls (geocoding, lookups, enrichment)
- ☐LLM model selection: cheapest model that meets quality threshold per task
- ☐System prompt audit — remove redundancy, compress examples
- ☐Anthropic prompt caching enabled for consistent long prompts
- ☐Batch API enabled for non-real-time LLM processing (50% cost reduction)
- ☐Retry rate monitoring — alert if retries exceed 5% of requests
- ☐Exponential backoff with jitter on all API retry logic
- ☐Circuit breakers on third-party APIs to prevent cascading retry costs
Frequently Asked Questions
How do I reduce my OpenAI API costs?
Top strategies: (1) Use GPT-4o mini instead of GPT-4o for tasks that don't need maximum capability (15x cheaper); (2) Enable prompt caching for repeated system prompts; (3) Use the Batch API for non-real-time tasks (50% cheaper); (4) Compress prompts and reduce context window; (5) Set hard usage limits to prevent surprise bills.
What is API response caching?
API response caching stores the result of an API call and returns the cached response for identical or similar subsequent requests instead of making new API calls. For deterministic APIs, caching reduces costs proportionally to the cache hit rate. For LLM APIs, semantic caching uses vector similarity to return cached responses for semantically similar queries.
How do I track my API costs in real time?
Tag every API call with feature context and log token counts or response sizes. Emit cost metrics to your monitoring system (DataDog, Better Stack, Prometheus). Set budget alerts at your provider level. Build dashboards showing cost per feature, cost per user tier, and daily/weekly trends.
What is request batching for API cost optimization?
Request batching combines multiple API calls into a single request. OpenAI Batch API offers 50% cost reduction for async processing. Embedding batching processes multiple texts per call instead of one at a time. The tradeoff is latency — batched requests are processed asynchronously and take longer to complete.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your APIs goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your APIs + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial