Blog/AI API Guides

LLM Cost Optimization: Cut Your OpenAI & Claude API Spending by 60%

LLM API costs compound fast at scale. A product that costs $200/month at 1,000 users can hit $20,000/month at 100,000 users — without any optimization. This guide covers the six highest-impact cost reduction strategies for OpenAI, Anthropic Claude, and Google Gemini in production.

OpenAIClaudeGeminiCost OptimizationLLM

TL;DR — Top 3 Quick Wins

  1. Model routing — use GPT-4o-mini / Haiku for classification and simple tasks (60-80% savings)
  2. Prompt caching — enable caching on system prompts and documents (40-90% on input tokens)
  3. Batch API — move offline tasks to async batch processing (50% automatic discount)
📡
Recommended

Monitor your services before your users notice

Try Better Stack Free →

Why LLM API Costs Spiral Out of Control

LLM costs have three characteristics that make them hard to predict and control:

3-4×

Output Token Premium

Output tokens cost 3-4x more than input tokens on every major provider. Verbose responses kill budgets.

O(n)

Context Growth

Multi-turn chats resend the full conversation each request. A 10-turn conversation sends ~10x tokens vs turn 1.

10-50×

Model Price Range

GPT-4o costs 16x more than GPT-4o-mini. Using the wrong model for simple tasks is the #1 cost waste.

LLM API Pricing Comparison (May 2026)

Prices per 1M tokens. Cached and batch pricing available where supported.

ProviderModelInputOutputCached InputBatch Input
OpenAIGPT-4o$2.50$10.00$1.25$1.25
OpenAIGPT-4o-mini$0.15$0.60$0.075$0.075
OpenAIGPT-4.1$2.00$8.00$0.50$1.00
AnthropicClaude Sonnet 4.5$3.00$15.00$0.30$1.50
AnthropicClaude Haiku 4.5$0.80$4.00$0.08$0.40
GoogleGemini 2.5 Flash$0.15$0.60$0.0375$0.075

* Prices are approximate and change frequently. Verify current pricing at platform.openai.com/pricing, anthropic.com/pricing, and cloud.google.com/vertex-ai/generative-ai/pricing.

6 Cost Optimization Strategies (Ranked by Impact)

1

Model Routing

60-80%Effort: Medium

Use GPT-4o-mini or Haiku for tasks that don't require flagship model capability. Classification, extraction, and summarization rarely need GPT-4o.

2

Prompt Caching

40-90% on inputEffort: Low

Keep system prompts and documents as cached prefixes. OpenAI caches automatically; Claude requires explicit cache_control markers.

3

Batch API

50%Effort: Low-Medium

Route non-real-time requests (embeddings, offline classification, report generation) through Batch API for automatic 50% discount.

4

Output Token Control

20-40%Effort: Low

Set max_tokens to a tight limit for structured tasks. Add "Be concise" or JSON-only instructions to system prompts. Output tokens cost 3-4x more than input tokens.

5

Semantic Caching

30-60%Effort: High

Cache LLM responses by semantic similarity of the input. Near-duplicate questions (same intent, different phrasing) return cached responses at zero LLM cost.

6

Context Window Pruning

15-30%Effort: Medium

For multi-turn conversations, summarize or prune old turns instead of sending the full history every request. Each round-trip sends O(n) tokens where n grows with conversation length.

Strategy 1: Model Routing in Practice

The most impactful cost reduction is routing simple tasks to cheaper models. A typical production app has 3-5 distinct LLM tasks with very different complexity requirements:

// Model routing by task type
const MODEL_ROUTER = {
  // Simple tasks — GPT-4o-mini is 16x cheaper, near-identical quality
  classify_intent: 'gpt-4o-mini',
  extract_entities: 'gpt-4o-mini',
  summarize_document: 'gpt-4o-mini',
  sentiment_analysis: 'gpt-4o-mini',
  faq_matching: 'gpt-4o-mini',

  // Complex tasks — need flagship model capability
  complex_reasoning: 'gpt-4o',
  code_generation: 'gpt-4o',
  multi_step_analysis: 'gpt-4o',
  creative_writing: 'gpt-4o',
} as const;

async function callLLM(task: keyof typeof MODEL_ROUTER, prompt: string) {
  const model = MODEL_ROUTER[task];
  return openai.chat.completions.create({
    model,
    messages: [{ role: 'user', content: prompt }],
    max_tokens: task.startsWith('extract') ? 200 : 1000, // tight limits for simple tasks
  });
}
Real-world example: A SaaS product routing 80% of requests to GPT-4o-mini and 20% to GPT-4o saves ~65% vs sending all requests to GPT-4o — for identical user-facing quality on the simple tasks.

Strategy 2: Prompt Caching

Prompt caching gives you a discount on repeated prefix tokens. It's one of the fastest wins because most production apps have large, consistent system prompts.

OpenAI: Automatic Caching

OpenAI caches prompts automatically when the same prefix (≥1,024 tokens) appears in multiple requests within a short window. Cached input tokens cost 50% less. No code changes required — just keep your system prompt consistent.

// Check if caching is working — look for cached_tokens in usage
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: LARGE_SYSTEM_PROMPT }, // 2000+ tokens — will be cached
    { role: 'user', content: userMessage },
  ],
});

// Check cache hit rate
const { prompt_tokens, completion_tokens } = response.usage;
const cached = response.usage.prompt_tokens_details?.cached_tokens ?? 0;
const cacheRate = cached / prompt_tokens;
console.log(`Cache hit rate: ${(cacheRate * 100).toFixed(1)}%`);
// Target: >60% cache hit rate for production system prompts

Anthropic Claude: Explicit Cache Control

Claude requires explicit cache_control markers to enable caching. Mark your system prompt and any documents as ephemeral type. Cached read tokens cost 90% less than regular input tokens.

import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-5',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: LARGE_SYSTEM_PROMPT, // 2000+ tokens
      cache_control: { type: 'ephemeral' }, // Mark for caching
    },
  ],
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: LARGE_DOCUMENT, // Document to analyze — also cache it
          cache_control: { type: 'ephemeral' },
        },
        { type: 'text', text: userQuestion },
      ],
    },
  ],
});

// Check cache savings
const { cache_read_input_tokens, cache_creation_input_tokens } = response.usage;
console.log('Cache reads (cheap):', cache_read_input_tokens);
console.log('Cache writes (one-time):', cache_creation_input_tokens);
Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

Strategy 3: OpenAI Batch API

The Batch API processes requests asynchronously with a 24-hour window in exchange for a 50% discount on all models. Ideal for any workflow that doesn't need real-time responses:

✅ Good for Batch API

  • • Embedding generation for document collections
  • • Overnight content classification pipelines
  • • Scheduled report generation
  • • A/B test result analysis
  • • Data enrichment workflows
  • • Bulk content moderation

❌ Not for Batch API

  • • Real-time chat interfaces
  • • User-facing features with <5s latency requirements
  • • Streaming responses
  • • Interactive coding assistants
  • • Any user blocking on the response
// Create a batch job (JSONL input)
import fs from 'fs';

// Build batch requests file
const requests = documents.map((doc, i) => ({
  custom_id: `doc-${i}`,
  method: 'POST',
  url: '/v1/chat/completions',
  body: {
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Classify this document. Respond with JSON.' },
      { role: 'user', content: doc.text },
    ],
    max_tokens: 100,
  },
}));

fs.writeFileSync('batch_input.jsonl', requests.map(r => JSON.stringify(r)).join('\n'));

// Upload and create batch
const file = await openai.files.create({
  file: fs.createReadStream('batch_input.jsonl'),
  purpose: 'batch',
});

const batch = await openai.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h',
});

console.log('Batch ID:', batch.id);
// Poll /v1/batches/{batch_id} for completion

Strategy 4: Output Token Control

Output tokens cost 3-4x more than input tokens. Reducing output length is high-leverage, especially for structured data tasks:

Set a tight max_tokens

For classification tasks, set max_tokens: 50. For JSON extraction, max_tokens: 200. For summaries, max_tokens: 300. You pay for every output token whether you use it or not.

Request JSON output

Use response_format: { type: "json_object" } (OpenAI) or tell Claude to respond ONLY in JSON. Eliminates preamble ("Sure, here is the JSON...") that burns tokens.

Add conciseness instructions

Add to system prompt: "Be concise. Do not repeat the question. Answer directly." This can cut output tokens by 20-30% on verbose models.

Use structured outputs for extraction

OpenAI Structured Outputs (response_format: { type: "json_schema" }) enforces exact output shape — no hallucinated fields, no padding text. Predictable token count.

Tracking & Alerting on LLM Spending

You can't optimize what you can't measure. Set up cost tracking before you hit an unexpected bill:

// Track cost per request in your app
function calculateCost(usage: {
  prompt_tokens: number;
  completion_tokens: number;
  cached_tokens?: number;
}, model: string): number {
  const PRICING: Record<string, { input: number; output: number; cached?: number }> = {
    'gpt-4o': { input: 0.0000025, output: 0.00001, cached: 0.00000125 },
    'gpt-4o-mini': { input: 0.00000015, output: 0.0000006, cached: 0.000000075 },
    'claude-sonnet-4-5': { input: 0.000003, output: 0.000015, cached: 0.0000003 },
  };

  const pricing = PRICING[model];
  if (!pricing) return 0;

  const cachedTokens = usage.cached_tokens ?? 0;
  const regularInput = usage.prompt_tokens - cachedTokens;

  return (
    regularInput * pricing.input +
    (cachedTokens * (pricing.cached ?? pricing.input)) +
    usage.completion_tokens * pricing.output
  );
}

// Log to your analytics/metrics system
const cost = calculateCost(response.usage, model);
metrics.increment('llm.cost.usd', cost, { model, feature: 'chat' });
metrics.increment('llm.tokens.input', response.usage.prompt_tokens, { model });
metrics.increment('llm.tokens.output', response.usage.completion_tokens, { model });

OpenAI

platform.openai.com/usage

Per-model daily spend + project breakdowns. Set monthly limits under Billing.

Anthropic

console.anthropic.com/usage

Workspace-level usage. Set spend limits per workspace under Settings.

Google

cloud.google.com/billing

Per-service billing alerts. Set budget alerts at 50%/80%/100% thresholds.

📡
Recommended

Monitor your services before your users notice

Try Better Stack Free →

Bonus: Multi-Provider Failover Reduces Costs & Improves Reliability

Using a single LLM provider creates both a cost risk (price hikes) and a reliability risk (outages). Multi-provider routing gives you flexibility on both:

// Simple multi-provider fallback with cost awareness
async function generateWithFallback(prompt: string, feature: string) {
  const providers = [
    { client: openai, model: 'gpt-4o-mini', costMultiplier: 1 },      // Primary (cheapest)
    { client: anthropic, model: 'claude-haiku-4-5', costMultiplier: 5 }, // Secondary
    { client: openai, model: 'gpt-4o', costMultiplier: 16 },            // Last resort
  ];

  for (const provider of providers) {
    try {
      const result = await callProvider(provider, prompt);
      metrics.increment('llm.provider.success', 1, {
        model: provider.model,
        feature
      });
      return result;
    } catch (err) {
      if (err.status === 429 || err.status >= 500) {
        continue; // Try next provider
      }
      throw err; // Non-retriable error
    }
  }
  throw new Error('All LLM providers failed');
}

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your AI APIs goes down, you'll know in under 60 seconds — not when your users start complaining.

  • Email alerts for your AI APIs + 9 more APIs
  • $0 due today for trial
  • Cancel anytime — $9/mo after trial

Frequently Asked Questions

How much does the OpenAI API cost per 1M tokens?

As of May 2026: GPT-4o input $2.50/1M tokens, output $10.00/1M tokens. GPT-4o-mini input $0.15/1M tokens, output $0.60/1M tokens. GPT-4.1 input $2.00/1M tokens, output $8.00/1M tokens. o3 input $10.00/1M tokens, output $40.00/1M tokens. Prompt caching reduces input costs by 50% for OpenAI (cached tokens cost $1.25/1M for GPT-4o). Batch API reduces costs by 50% with up to 24-hour turnaround.

What is prompt caching and how much does it save?

Prompt caching stores repeated prefix tokens (system prompts, few-shot examples, documents) so subsequent requests pay a reduced cache-hit rate instead of the full input price. OpenAI: cached input tokens cost 50% less ($1.25/1M vs $2.50/1M for GPT-4o). Anthropic Claude: cache read tokens cost 90% less than regular input tokens ($0.03/1M vs $3.00/1M for Sonnet 4.5). Google Gemini: implicit caching enabled by default, explicit context caching available. For apps with consistent system prompts, savings of 40-70% on input costs are typical.

Should I use GPT-4o or GPT-4o-mini to save money?

GPT-4o-mini costs ~16x less than GPT-4o ($0.15/1M vs $2.50/1M input). It handles classification, extraction, summarization, and simple Q&A at near-GPT-4o quality. Use GPT-4o-mini for: intent classification, document chunking, structured data extraction, FAQ matching, sentiment analysis. Use GPT-4o for: complex reasoning, code generation, creative writing, nuanced tasks requiring deep context. Model routing — using the right model for each task — is the single highest-impact cost reduction strategy.

What is the OpenAI Batch API and how much does it save?

The OpenAI Batch API processes requests asynchronously with up to 24-hour turnaround in exchange for 50% cost reduction across all models. Input $1.25/1M tokens (vs $2.50 synchronous), output $5.00/1M tokens (vs $10.00). Use it for: embedding generation, document classification, offline report generation, A/B test evaluation, data enrichment pipelines. Not suitable for real-time user interactions requiring <1s response.

How do I track and monitor my LLM API spending?

OpenAI: platform.openai.com/usage shows per-model and per-day spend with project-level breakdowns. Set spending limits under Settings > Billing. Anthropic: console.anthropic.com/usage for Claude API spend with workspace-level tracking. Google: cloud.google.com/billing with per-service cost alerts. For third-party tracking across all providers, tools like Helicone, LangSmith, or Portkey provide unified cost dashboards with per-request cost attribution to features or users.

Related Guides

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you