LLM Cost Optimization: Cut Your OpenAI & Claude API Spending by 60%
LLM API costs compound fast at scale. A product that costs $200/month at 1,000 users can hit $20,000/month at 100,000 users — without any optimization. This guide covers the six highest-impact cost reduction strategies for OpenAI, Anthropic Claude, and Google Gemini in production.
TL;DR — Top 3 Quick Wins
- Model routing — use GPT-4o-mini / Haiku for classification and simple tasks (60-80% savings)
- Prompt caching — enable caching on system prompts and documents (40-90% on input tokens)
- Batch API — move offline tasks to async batch processing (50% automatic discount)
Why LLM API Costs Spiral Out of Control
LLM costs have three characteristics that make them hard to predict and control:
3-4×
Output Token Premium
Output tokens cost 3-4x more than input tokens on every major provider. Verbose responses kill budgets.
O(n)
Context Growth
Multi-turn chats resend the full conversation each request. A 10-turn conversation sends ~10x tokens vs turn 1.
10-50×
Model Price Range
GPT-4o costs 16x more than GPT-4o-mini. Using the wrong model for simple tasks is the #1 cost waste.
LLM API Pricing Comparison (May 2026)
Prices per 1M tokens. Cached and batch pricing available where supported.
| Provider | Model | Input | Output | Cached Input | Batch Input |
|---|---|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 | $1.25 | $1.25 |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | $0.075 | $0.075 |
| OpenAI | GPT-4.1 | $2.00 | $8.00 | $0.50 | $1.00 |
| Anthropic | Claude Sonnet 4.5 | $3.00 | $15.00 | $0.30 | $1.50 |
| Anthropic | Claude Haiku 4.5 | $0.80 | $4.00 | $0.08 | $0.40 |
| Gemini 2.5 Flash | $0.15 | $0.60 | $0.0375 | $0.075 |
* Prices are approximate and change frequently. Verify current pricing at platform.openai.com/pricing, anthropic.com/pricing, and cloud.google.com/vertex-ai/generative-ai/pricing.
6 Cost Optimization Strategies (Ranked by Impact)
Model Routing
60-80%Effort: MediumUse GPT-4o-mini or Haiku for tasks that don't require flagship model capability. Classification, extraction, and summarization rarely need GPT-4o.
Prompt Caching
40-90% on inputEffort: LowKeep system prompts and documents as cached prefixes. OpenAI caches automatically; Claude requires explicit cache_control markers.
Batch API
50%Effort: Low-MediumRoute non-real-time requests (embeddings, offline classification, report generation) through Batch API for automatic 50% discount.
Output Token Control
20-40%Effort: LowSet max_tokens to a tight limit for structured tasks. Add "Be concise" or JSON-only instructions to system prompts. Output tokens cost 3-4x more than input tokens.
Semantic Caching
30-60%Effort: HighCache LLM responses by semantic similarity of the input. Near-duplicate questions (same intent, different phrasing) return cached responses at zero LLM cost.
Context Window Pruning
15-30%Effort: MediumFor multi-turn conversations, summarize or prune old turns instead of sending the full history every request. Each round-trip sends O(n) tokens where n grows with conversation length.
Strategy 1: Model Routing in Practice
The most impactful cost reduction is routing simple tasks to cheaper models. A typical production app has 3-5 distinct LLM tasks with very different complexity requirements:
// Model routing by task type
const MODEL_ROUTER = {
// Simple tasks — GPT-4o-mini is 16x cheaper, near-identical quality
classify_intent: 'gpt-4o-mini',
extract_entities: 'gpt-4o-mini',
summarize_document: 'gpt-4o-mini',
sentiment_analysis: 'gpt-4o-mini',
faq_matching: 'gpt-4o-mini',
// Complex tasks — need flagship model capability
complex_reasoning: 'gpt-4o',
code_generation: 'gpt-4o',
multi_step_analysis: 'gpt-4o',
creative_writing: 'gpt-4o',
} as const;
async function callLLM(task: keyof typeof MODEL_ROUTER, prompt: string) {
const model = MODEL_ROUTER[task];
return openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
max_tokens: task.startsWith('extract') ? 200 : 1000, // tight limits for simple tasks
});
}Strategy 2: Prompt Caching
Prompt caching gives you a discount on repeated prefix tokens. It's one of the fastest wins because most production apps have large, consistent system prompts.
OpenAI: Automatic Caching
OpenAI caches prompts automatically when the same prefix (≥1,024 tokens) appears in multiple requests within a short window. Cached input tokens cost 50% less. No code changes required — just keep your system prompt consistent.
// Check if caching is working — look for cached_tokens in usage
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: LARGE_SYSTEM_PROMPT }, // 2000+ tokens — will be cached
{ role: 'user', content: userMessage },
],
});
// Check cache hit rate
const { prompt_tokens, completion_tokens } = response.usage;
const cached = response.usage.prompt_tokens_details?.cached_tokens ?? 0;
const cacheRate = cached / prompt_tokens;
console.log(`Cache hit rate: ${(cacheRate * 100).toFixed(1)}%`);
// Target: >60% cache hit rate for production system promptsAnthropic Claude: Explicit Cache Control
Claude requires explicit cache_control markers to enable caching. Mark your system prompt and any documents as ephemeral type. Cached read tokens cost 90% less than regular input tokens.
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5',
max_tokens: 1024,
system: [
{
type: 'text',
text: LARGE_SYSTEM_PROMPT, // 2000+ tokens
cache_control: { type: 'ephemeral' }, // Mark for caching
},
],
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: LARGE_DOCUMENT, // Document to analyze — also cache it
cache_control: { type: 'ephemeral' },
},
{ type: 'text', text: userQuestion },
],
},
],
});
// Check cache savings
const { cache_read_input_tokens, cache_creation_input_tokens } = response.usage;
console.log('Cache reads (cheap):', cache_read_input_tokens);
console.log('Cache writes (one-time):', cache_creation_input_tokens);📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you
Strategy 3: OpenAI Batch API
The Batch API processes requests asynchronously with a 24-hour window in exchange for a 50% discount on all models. Ideal for any workflow that doesn't need real-time responses:
✅ Good for Batch API
- • Embedding generation for document collections
- • Overnight content classification pipelines
- • Scheduled report generation
- • A/B test result analysis
- • Data enrichment workflows
- • Bulk content moderation
❌ Not for Batch API
- • Real-time chat interfaces
- • User-facing features with <5s latency requirements
- • Streaming responses
- • Interactive coding assistants
- • Any user blocking on the response
// Create a batch job (JSONL input)
import fs from 'fs';
// Build batch requests file
const requests = documents.map((doc, i) => ({
custom_id: `doc-${i}`,
method: 'POST',
url: '/v1/chat/completions',
body: {
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'Classify this document. Respond with JSON.' },
{ role: 'user', content: doc.text },
],
max_tokens: 100,
},
}));
fs.writeFileSync('batch_input.jsonl', requests.map(r => JSON.stringify(r)).join('\n'));
// Upload and create batch
const file = await openai.files.create({
file: fs.createReadStream('batch_input.jsonl'),
purpose: 'batch',
});
const batch = await openai.batches.create({
input_file_id: file.id,
endpoint: '/v1/chat/completions',
completion_window: '24h',
});
console.log('Batch ID:', batch.id);
// Poll /v1/batches/{batch_id} for completionStrategy 4: Output Token Control
Output tokens cost 3-4x more than input tokens. Reducing output length is high-leverage, especially for structured data tasks:
Set a tight max_tokens
For classification tasks, set max_tokens: 50. For JSON extraction, max_tokens: 200. For summaries, max_tokens: 300. You pay for every output token whether you use it or not.
Request JSON output
Use response_format: { type: "json_object" } (OpenAI) or tell Claude to respond ONLY in JSON. Eliminates preamble ("Sure, here is the JSON...") that burns tokens.
Add conciseness instructions
Add to system prompt: "Be concise. Do not repeat the question. Answer directly." This can cut output tokens by 20-30% on verbose models.
Use structured outputs for extraction
OpenAI Structured Outputs (response_format: { type: "json_schema" }) enforces exact output shape — no hallucinated fields, no padding text. Predictable token count.
Tracking & Alerting on LLM Spending
You can't optimize what you can't measure. Set up cost tracking before you hit an unexpected bill:
// Track cost per request in your app
function calculateCost(usage: {
prompt_tokens: number;
completion_tokens: number;
cached_tokens?: number;
}, model: string): number {
const PRICING: Record<string, { input: number; output: number; cached?: number }> = {
'gpt-4o': { input: 0.0000025, output: 0.00001, cached: 0.00000125 },
'gpt-4o-mini': { input: 0.00000015, output: 0.0000006, cached: 0.000000075 },
'claude-sonnet-4-5': { input: 0.000003, output: 0.000015, cached: 0.0000003 },
};
const pricing = PRICING[model];
if (!pricing) return 0;
const cachedTokens = usage.cached_tokens ?? 0;
const regularInput = usage.prompt_tokens - cachedTokens;
return (
regularInput * pricing.input +
(cachedTokens * (pricing.cached ?? pricing.input)) +
usage.completion_tokens * pricing.output
);
}
// Log to your analytics/metrics system
const cost = calculateCost(response.usage, model);
metrics.increment('llm.cost.usd', cost, { model, feature: 'chat' });
metrics.increment('llm.tokens.input', response.usage.prompt_tokens, { model });
metrics.increment('llm.tokens.output', response.usage.completion_tokens, { model });OpenAI
platform.openai.com/usagePer-model daily spend + project breakdowns. Set monthly limits under Billing.
Anthropic
console.anthropic.com/usageWorkspace-level usage. Set spend limits per workspace under Settings.
cloud.google.com/billingPer-service billing alerts. Set budget alerts at 50%/80%/100% thresholds.
Bonus: Multi-Provider Failover Reduces Costs & Improves Reliability
Using a single LLM provider creates both a cost risk (price hikes) and a reliability risk (outages). Multi-provider routing gives you flexibility on both:
// Simple multi-provider fallback with cost awareness
async function generateWithFallback(prompt: string, feature: string) {
const providers = [
{ client: openai, model: 'gpt-4o-mini', costMultiplier: 1 }, // Primary (cheapest)
{ client: anthropic, model: 'claude-haiku-4-5', costMultiplier: 5 }, // Secondary
{ client: openai, model: 'gpt-4o', costMultiplier: 16 }, // Last resort
];
for (const provider of providers) {
try {
const result = await callProvider(provider, prompt);
metrics.increment('llm.provider.success', 1, {
model: provider.model,
feature
});
return result;
} catch (err) {
if (err.status === 429 || err.status >= 500) {
continue; // Try next provider
}
throw err; // Non-retriable error
}
}
throw new Error('All LLM providers failed');
}Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your AI APIs goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your AI APIs + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Frequently Asked Questions
How much does the OpenAI API cost per 1M tokens?
As of May 2026: GPT-4o input $2.50/1M tokens, output $10.00/1M tokens. GPT-4o-mini input $0.15/1M tokens, output $0.60/1M tokens. GPT-4.1 input $2.00/1M tokens, output $8.00/1M tokens. o3 input $10.00/1M tokens, output $40.00/1M tokens. Prompt caching reduces input costs by 50% for OpenAI (cached tokens cost $1.25/1M for GPT-4o). Batch API reduces costs by 50% with up to 24-hour turnaround.
What is prompt caching and how much does it save?
Prompt caching stores repeated prefix tokens (system prompts, few-shot examples, documents) so subsequent requests pay a reduced cache-hit rate instead of the full input price. OpenAI: cached input tokens cost 50% less ($1.25/1M vs $2.50/1M for GPT-4o). Anthropic Claude: cache read tokens cost 90% less than regular input tokens ($0.03/1M vs $3.00/1M for Sonnet 4.5). Google Gemini: implicit caching enabled by default, explicit context caching available. For apps with consistent system prompts, savings of 40-70% on input costs are typical.
Should I use GPT-4o or GPT-4o-mini to save money?
GPT-4o-mini costs ~16x less than GPT-4o ($0.15/1M vs $2.50/1M input). It handles classification, extraction, summarization, and simple Q&A at near-GPT-4o quality. Use GPT-4o-mini for: intent classification, document chunking, structured data extraction, FAQ matching, sentiment analysis. Use GPT-4o for: complex reasoning, code generation, creative writing, nuanced tasks requiring deep context. Model routing — using the right model for each task — is the single highest-impact cost reduction strategy.
What is the OpenAI Batch API and how much does it save?
The OpenAI Batch API processes requests asynchronously with up to 24-hour turnaround in exchange for 50% cost reduction across all models. Input $1.25/1M tokens (vs $2.50 synchronous), output $5.00/1M tokens (vs $10.00). Use it for: embedding generation, document classification, offline report generation, A/B test evaluation, data enrichment pipelines. Not suitable for real-time user interactions requiring <1s response.
How do I track and monitor my LLM API spending?
OpenAI: platform.openai.com/usage shows per-model and per-day spend with project-level breakdowns. Set spending limits under Settings > Billing. Anthropic: console.anthropic.com/usage for Claude API spend with workspace-level tracking. Google: cloud.google.com/billing with per-service cost alerts. For third-party tracking across all providers, tools like Helicone, LangSmith, or Portkey provide unified cost dashboards with per-request cost attribution to features or users.
Related Guides
📡 Monitor your APIs — know when they go down before your users do
Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.
Affiliate link — we may earn a commission at no extra cost to you