Blog/AI API Guides

LLM Cost Optimization: Cut Your OpenAI & Claude API Spending by 60%

Q: How much does the OpenAI API cost per 1M tokens?

As of May 2026: GPT-4o input $2.50/1M tokens, output $10.00/1M tokens. GPT-4o-mini input $0.15/1M tokens, output $0.60/1M tokens. GPT-4.1 input $2.00/1M tokens, output $8.00/1M tokens. o3 input $10.00/1M tokens, output $40.00/1M tokens. Prompt caching reduces input costs by 50% for OpenAI (cached tokens cost $1.25/1M for GPT-4o). Batch API reduces costs by 50% with up to 24-hour turnaround.

Q: What is prompt caching and how much does it save?

Prompt caching stores repeated prefix tokens (system prompts, few-shot examples, documents) so subsequent requests pay a reduced cache-hit rate instead of the full input price. OpenAI: cached input tokens cost 50% less ($1.25/1M vs $2.50/1M for GPT-4o). Anthropic Claude: cache read tokens cost 90% less than regular input tokens ($0.03/1M vs $3.00/1M for Sonnet 4.5). Google Gemini: implicit caching enabled by default, explicit context caching available. For apps with consistent system prompts, savings of 40-70% on input costs are typical.

Q: Should I use GPT-4o or GPT-4o-mini to save money?

GPT-4o-mini costs ~16x less than GPT-4o ($0.15/1M vs $2.50/1M input). It handles classification, extraction, summarization, and simple Q&A at near-GPT-4o quality. Use GPT-4o-mini for: intent classification, document chunking, structured data extraction, FAQ matching, sentiment analysis. Use GPT-4o for: complex reasoning, code generation, creative writing, nuanced tasks requiring deep context. Model routing — using the right model for each task — is the single highest-impact cost reduction strategy.

Q: What is the OpenAI Batch API and how much does it save?

The OpenAI Batch API processes requests asynchronously with up to 24-hour turnaround in exchange for 50% cost reduction across all models. Input $1.25/1M tokens (vs $2.50 synchronous), output $5.00/1M tokens (vs $10.00). Use it for: embedding generation, document classification, offline report generation, A/B test evaluation, data enrichment pipelines. Not suitable for real-time user interactions requiring <1s response.

Q: How do I track and monitor my LLM API spending?

OpenAI: platform.openai.com/usage shows per-model and per-day spend with project-level breakdowns. Set spending limits under Settings > Billing. Anthropic: console.anthropic.com/usage for Claude API spend with workspace-level tracking. Google: cloud.google.com/billing with per-service cost alerts. For third-party tracking across all providers, tools like Helicone, LangSmith, or Portkey provide unified cost dashboards with per-request cost attribution to features or users.

LLM API costs compound fast at scale. A product that costs $200/month at 1,000 users can hit $20,000/month at 100,000 users — without any optimization. This guide covers the six highest-impact cost reduction strategies for OpenAI, Anthropic Claude, and Google Gemini in production.

OpenAIClaudeGeminiCost OptimizationLLM

TL;DR — Top 3 Quick Wins

Model routing — use GPT-4o-mini / Haiku for classification and simple tasks (60-80% savings)
Prompt caching — enable caching on system prompts and documents (40-90% on input tokens)
Batch API — move offline tasks to async batch processing (50% automatic discount)

📡

Recommended

Monitor your services before your users notice

Try Better Stack Free →

Why LLM API Costs Spiral Out of Control

LLM costs have three characteristics that make them hard to predict and control:

3-4×

Output Token Premium

Output tokens cost 3-4x more than input tokens on every major provider. Verbose responses kill budgets.

O(n)

Context Growth

Multi-turn chats resend the full conversation each request. A 10-turn conversation sends ~10x tokens vs turn 1.

10-50×

Model Price Range

GPT-4o costs 16x more than GPT-4o-mini. Using the wrong model for simple tasks is the #1 cost waste.

LLM API Pricing Comparison (May 2026)

Prices per 1M tokens. Cached and batch pricing available where supported.

Provider	Model	Input	Output	Cached Input	Batch Input	Best For
OpenAI	GPT-4o	$2.50	$10.00	$1.25	$1.25	Complex reasoning, code, multimodal
OpenAI	GPT-4o-mini	$0.15	$0.60	$0.075	$0.075	Classification, extraction, simple tasks
OpenAI	GPT-4.1	$2.00	$8.00	$0.50	$1.00	Long-context coding, instruction following
Anthropic	Claude Sonnet 4.5	$3.00	$15.00	$0.30	$1.50	Complex analysis, writing, agentic tasks
Anthropic	Claude Haiku 4.5	$0.80	$4.00	$0.08	$0.40	High-volume, latency-sensitive tasks
Google	Gemini 2.5 Flash	$0.15	$0.60	$0.0375	$0.075	High-throughput, cost-sensitive workloads

* Prices are approximate and change frequently. Verify current pricing at platform.openai.com/pricing, anthropic.com/pricing, and cloud.google.com/vertex-ai/generative-ai/pricing.

6 Cost Optimization Strategies (Ranked by Impact)

Model Routing

60-80%Effort: Medium

Use GPT-4o-mini or Haiku for tasks that don't require flagship model capability. Classification, extraction, and summarization rarely need GPT-4o.

Prompt Caching

40-90% on inputEffort: Low

Keep system prompts and documents as cached prefixes. OpenAI caches automatically; Claude requires explicit cache_control markers.

Batch API

50%Effort: Low-Medium

Route non-real-time requests (embeddings, offline classification, report generation) through Batch API for automatic 50% discount.

Output Token Control

20-40%Effort: Low

Set max_tokens to a tight limit for structured tasks. Add "Be concise" or JSON-only instructions to system prompts. Output tokens cost 3-4x more than input tokens.

Semantic Caching

30-60%Effort: High

Cache LLM responses by semantic similarity of the input. Near-duplicate questions (same intent, different phrasing) return cached responses at zero LLM cost.

Context Window Pruning

15-30%Effort: Medium

For multi-turn conversations, summarize or prune old turns instead of sending the full history every request. Each round-trip sends O(n) tokens where n grows with conversation length.

Strategy 1: Model Routing in Practice

The most impactful cost reduction is routing simple tasks to cheaper models. A typical production app has 3-5 distinct LLM tasks with very different complexity requirements:

// Model routing by task type
const MODEL_ROUTER = {
  // Simple tasks — GPT-4o-mini is 16x cheaper, near-identical quality
  classify_intent: 'gpt-4o-mini',
  extract_entities: 'gpt-4o-mini',
  summarize_document: 'gpt-4o-mini',
  sentiment_analysis: 'gpt-4o-mini',
  faq_matching: 'gpt-4o-mini',

  // Complex tasks — need flagship model capability
  complex_reasoning: 'gpt-4o',
  code_generation: 'gpt-4o',
  multi_step_analysis: 'gpt-4o',
  creative_writing: 'gpt-4o',
} as const;

async function callLLM(task: keyof typeof MODEL_ROUTER, prompt: string) {
  const model = MODEL_ROUTER[task];
  return openai.chat.completions.create({
    model,
    messages: [{ role: 'user', content: prompt }],
    max_tokens: task.startsWith('extract') ? 200 : 1000, // tight limits for simple tasks
  });
}

Real-world example: A SaaS product routing 80% of requests to GPT-4o-mini and 20% to GPT-4o saves ~65% vs sending all requests to GPT-4o — for identical user-facing quality on the simple tasks.

Strategy 2: Prompt Caching

Prompt caching gives you a discount on repeated prefix tokens. It's one of the fastest wins because most production apps have large, consistent system prompts.

OpenAI: Automatic Caching

OpenAI caches prompts automatically when the same prefix (≥1,024 tokens) appears in multiple requests within a short window. Cached input tokens cost 50% less. No code changes required — just keep your system prompt consistent.

// Check if caching is working — look for cached_tokens in usage
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: LARGE_SYSTEM_PROMPT }, // 2000+ tokens — will be cached
    { role: 'user', content: userMessage },
  ],
});

// Check cache hit rate
const { prompt_tokens, completion_tokens } = response.usage;
const cached = response.usage.prompt_tokens_details?.cached_tokens ?? 0;
const cacheRate = cached / prompt_tokens;
console.log(`Cache hit rate: ${(cacheRate * 100).toFixed(1)}%`);
// Target: >60% cache hit rate for production system prompts

Anthropic Claude: Explicit Cache Control

Claude requires explicit cache_control markers to enable caching. Mark your system prompt and any documents as ephemeral type. Cached read tokens cost 90% less than regular input tokens.

import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-5',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: LARGE_SYSTEM_PROMPT, // 2000+ tokens
      cache_control: { type: 'ephemeral' }, // Mark for caching
    },
  ],
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: LARGE_DOCUMENT, // Document to analyze — also cache it
          cache_control: { type: 'ephemeral' },
        },
        { type: 'text', text: userQuestion },
      ],
    },
  ],
});

// Check cache savings
const { cache_read_input_tokens, cache_creation_input_tokens } = response.usage;
console.log('Cache reads (cheap):', cache_read_input_tokens);
console.log('Cache writes (one-time):', cache_creation_input_tokens);

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

Strategy 3: OpenAI Batch API

The Batch API processes requests asynchronously with a 24-hour window in exchange for a 50% discount on all models. Ideal for any workflow that doesn't need real-time responses:

✅ Good for Batch API

• Embedding generation for document collections
• Overnight content classification pipelines
• Scheduled report generation
• A/B test result analysis
• Data enrichment workflows
• Bulk content moderation

❌ Not for Batch API

• Real-time chat interfaces
• User-facing features with <5s latency requirements
• Streaming responses
• Interactive coding assistants
• Any user blocking on the response

// Create a batch job (JSONL input)
import fs from 'fs';

// Build batch requests file
const requests = documents.map((doc, i) => ({
  custom_id: `doc-${i}`,
  method: 'POST',
  url: '/v1/chat/completions',
  body: {
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Classify this document. Respond with JSON.' },
      { role: 'user', content: doc.text },
    ],
    max_tokens: 100,
  },
}));

fs.writeFileSync('batch_input.jsonl', requests.map(r => JSON.stringify(r)).join('\n'));

// Upload and create batch
const file = await openai.files.create({
  file: fs.createReadStream('batch_input.jsonl'),
  purpose: 'batch',
});

const batch = await openai.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h',
});

console.log('Batch ID:', batch.id);
// Poll /v1/batches/{batch_id} for completion

Strategy 4: Output Token Control

Output tokens cost 3-4x more than input tokens. Reducing output length is high-leverage, especially for structured data tasks:

Set a tight max_tokens

For classification tasks, set max_tokens: 50. For JSON extraction, max_tokens: 200. For summaries, max_tokens: 300. You pay for every output token whether you use it or not.

Request JSON output

Use response_format: { type: "json_object" } (OpenAI) or tell Claude to respond ONLY in JSON. Eliminates preamble ("Sure, here is the JSON...") that burns tokens.

Add conciseness instructions

Add to system prompt: "Be concise. Do not repeat the question. Answer directly." This can cut output tokens by 20-30% on verbose models.

Use structured outputs for extraction

OpenAI Structured Outputs (response_format: { type: "json_schema" }) enforces exact output shape — no hallucinated fields, no padding text. Predictable token count.

Tracking & Alerting on LLM Spending

You can't optimize what you can't measure. Set up cost tracking before you hit an unexpected bill:

// Track cost per request in your app
function calculateCost(usage: {
  prompt_tokens: number;
  completion_tokens: number;
  cached_tokens?: number;
}, model: string): number {
  const PRICING: Record<string, { input: number; output: number; cached?: number }> = {
    'gpt-4o': { input: 0.0000025, output: 0.00001, cached: 0.00000125 },
    'gpt-4o-mini': { input: 0.00000015, output: 0.0000006, cached: 0.000000075 },
    'claude-sonnet-4-5': { input: 0.000003, output: 0.000015, cached: 0.0000003 },
  };

  const pricing = PRICING[model];
  if (!pricing) return 0;

  const cachedTokens = usage.cached_tokens ?? 0;
  const regularInput = usage.prompt_tokens - cachedTokens;

  return (
    regularInput * pricing.input +
    (cachedTokens * (pricing.cached ?? pricing.input)) +
    usage.completion_tokens * pricing.output
  );
}

// Log to your analytics/metrics system
const cost = calculateCost(response.usage, model);
metrics.increment('llm.cost.usd', cost, { model, feature: 'chat' });
metrics.increment('llm.tokens.input', response.usage.prompt_tokens, { model });
metrics.increment('llm.tokens.output', response.usage.completion_tokens, { model });

OpenAI

platform.openai.com/usage

Per-model daily spend + project breakdowns. Set monthly limits under Billing.

Anthropic

console.anthropic.com/usage

Workspace-level usage. Set spend limits per workspace under Settings.

Google

cloud.google.com/billing

Per-service billing alerts. Set budget alerts at 50%/80%/100% thresholds.

📡

Recommended

Monitor your services before your users notice

Try Better Stack Free →

Bonus: Multi-Provider Failover Reduces Costs & Improves Reliability

Using a single LLM provider creates both a cost risk (price hikes) and a reliability risk (outages). Multi-provider routing gives you flexibility on both:

// Simple multi-provider fallback with cost awareness
async function generateWithFallback(prompt: string, feature: string) {
  const providers = [
    { client: openai, model: 'gpt-4o-mini', costMultiplier: 1 },      // Primary (cheapest)
    { client: anthropic, model: 'claude-haiku-4-5', costMultiplier: 5 }, // Secondary
    { client: openai, model: 'gpt-4o', costMultiplier: 16 },            // Last resort
  ];

  for (const provider of providers) {
    try {
      const result = await callProvider(provider, prompt);
      metrics.increment('llm.provider.success', 1, {
        model: provider.model,
        feature
      });
      return result;
    } catch (err) {
      if (err.status === 429 || err.status >= 500) {
        continue; // Try next provider
      }
      throw err; // Non-retriable error
    }
  }
  throw new Error('All LLM providers failed');
}

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your AI APIs goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your AI APIs + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

Frequently Asked Questions

How much does the OpenAI API cost per 1M tokens?

As of May 2026: GPT-4o input $2.50/1M tokens, output $10.00/1M tokens. GPT-4o-mini input $0.15/1M tokens, output $0.60/1M tokens. GPT-4.1 input $2.00/1M tokens, output $8.00/1M tokens. o3 input $10.00/1M tokens, output $40.00/1M tokens. Prompt caching reduces input costs by 50% for OpenAI (cached tokens cost $1.25/1M for GPT-4o). Batch API reduces costs by 50% with up to 24-hour turnaround.

What is prompt caching and how much does it save?

Prompt caching stores repeated prefix tokens (system prompts, few-shot examples, documents) so subsequent requests pay a reduced cache-hit rate instead of the full input price. OpenAI: cached input tokens cost 50% less ($1.25/1M vs $2.50/1M for GPT-4o). Anthropic Claude: cache read tokens cost 90% less than regular input tokens ($0.03/1M vs $3.00/1M for Sonnet 4.5). Google Gemini: implicit caching enabled by default, explicit context caching available. For apps with consistent system prompts, savings of 40-70% on input costs are typical.

Should I use GPT-4o or GPT-4o-mini to save money?

GPT-4o-mini costs ~16x less than GPT-4o ($0.15/1M vs $2.50/1M input). It handles classification, extraction, summarization, and simple Q&A at near-GPT-4o quality. Use GPT-4o-mini for: intent classification, document chunking, structured data extraction, FAQ matching, sentiment analysis. Use GPT-4o for: complex reasoning, code generation, creative writing, nuanced tasks requiring deep context. Model routing — using the right model for each task — is the single highest-impact cost reduction strategy.

What is the OpenAI Batch API and how much does it save?

The OpenAI Batch API processes requests asynchronously with up to 24-hour turnaround in exchange for 50% cost reduction across all models. Input $1.25/1M tokens (vs $2.50 synchronous), output $5.00/1M tokens (vs $10.00). Use it for: embedding generation, document classification, offline report generation, A/B test evaluation, data enrichment pipelines. Not suitable for real-time user interactions requiring <1s response.

How do I track and monitor my LLM API spending?

OpenAI: platform.openai.com/usage shows per-model and per-day spend with project-level breakdowns. Set spending limits under Settings > Billing. Anthropic: console.anthropic.com/usage for Claude API spend with workspace-level tracking. Google: cloud.google.com/billing with per-service cost alerts. For third-party tracking across all providers, tools like Helicone, LangSmith, or Portkey provide unified cost dashboards with per-request cost attribution to features or users.

Staff Pick

📡 Monitor your APIs — know when they go down before your users do

Better Stack checks uptime every 30 seconds with instant Slack, email & SMS alerts. Free tier available.

Start Free →

Affiliate link — we may earn a commission at no extra cost to you

LLM Cost Optimization: Cut Your OpenAI & Claude API Spending by 60%

TL;DR — Top 3 Quick Wins

Why LLM API Costs Spiral Out of Control

LLM API Pricing Comparison (May 2026)

6 Cost Optimization Strategies (Ranked by Impact)

Model Routing

Prompt Caching

Batch API

Output Token Control

Semantic Caching

Context Window Pruning

Strategy 1: Model Routing in Practice

Strategy 2: Prompt Caching

OpenAI: Automatic Caching

Anthropic Claude: Explicit Cache Control

Strategy 3: OpenAI Batch API

Strategy 4: Output Token Control

Tracking & Alerting on LLM Spending

Bonus: Multi-Provider Failover Reduces Costs & Improves Reliability

Stop checking — get alerted instantly

Frequently Asked Questions

Related Guides