What is the single biggest cost-reduction lever?

Model routing. Sending classification, routing, and short-extraction tasks to smaller cheaper models (Haiku, GPT-4.1-mini, Gemini Flash) instead of frontier models typically reduces average inference cost by 40–70% with no quality loss on those task types.

How much does prompt caching actually save?

For workloads with long stable system prompts or retrieved context (RAG is the canonical case), prompt caching saves roughly 90% on the cached portion, translating to 30–60% total cost reduction. For short-prompt workloads, savings are minimal. Benchmark on your actual workload.

Should I use batch APIs?

Yes, for any workload that does not need real-time response. OpenAI and Anthropic batch APIs offer roughly 50% discount. Overnight analyses, bulk classification, back-office processing, and non-user-facing jobs are all good candidates.

How do I know my cost optimization is working?

Track dollars per successful user action, not total spend. If total spend goes down but quality drops and task completion falls, you have shifted cost to support and churn. Per-feature dashboards plus eval scores plus task completion gives you a credible optimization loop.

When should I consider self-hosting an open model?

When your inference bill exceeds roughly $20k/month on a workload where an open model (Llama 3.3, Qwen) provides comparable quality. Below that threshold, managed APIs are cheaper once you account for GPU infrastructure, MLOps, and on-call. The break-even moves quickly — reassess annually.

AI Cost Optimization: How to Reduce LLM Spend Without Hurting Quality

Introduction

AI inference cost is the new AWS bill — a line item that starts invisible and becomes the second-largest category on your P&L by month twelve if you do not actively manage it. The good news: most teams can cut AI spend by 40–70% with standard engineering discipline and no quality loss. This guide covers the ten tactics that consistently move the needle.

1. Route to smaller models for simple tasks

Not every task needs Claude 3.5 Sonnet or GPT-4.1. Classification, short extraction, keyword detection, and simple routing run perfectly on Haiku, GPT-4.1-mini, or Gemini Flash at 1/10th to 1/20th the cost. A thin router that picks model per task type reduces average inference cost 40–70% on typical mixed workloads.

Implementation: tag every LLM call with a task type at the call site. Pick the cheapest model that passes the task's eval set. Re-evaluate routing decisions every time a model is updated.

Route classification/extraction/routing to small models
Reserve frontier models for reasoning and agentic decisions
Typical cost reduction from routing alone: 40–70%

2. Use prompt caching for stable prefixes

Anthropic and OpenAI both support prompt caching for long system prompts and retrieved context. If you send the same 5,000-token system prompt to every user, caching it cuts that cost by roughly 90% after the first call. For RAG, cache the retrieved chunks for the session's short window.

Cache long system prompts (~90% cost reduction on the prefix)
Cache retrieved context within a session
Always measure on your workload — savings vary with prompt shape

3. Add a content-addressed response cache

Many calls are redundant across users and sessions. A Redis cache keyed by hash(input + model + system prompt + parameters) eliminates repeat work at zero quality cost. 20–50% of production LLM traffic can be served from cache in typical RAG and classification workloads.

Hash-keyed Redis cache eliminates duplicate calls
20–50% reduction on typical RAG/classification workloads
Never cache user-personalized generative outputs

4. Compress long conversation context

Long chat conversations quickly push into high-token territory. Compression strategies: summarize older turns, drop tool call logs after the result is incorporated, use sliding-window context with a summary of dropped content. A 20-turn conversation can often be compressed to 4x fewer tokens with no perceived quality loss.

Summarize turns older than N=10; drop tool logs after integration
Sliding window with summary of dropped content
4x token reduction is achievable without quality loss

5. Batch offline workloads

OpenAI and Anthropic both offer batch APIs at roughly 50% discount. Every workload that does not need real-time response — nightly analyses, overnight summarization jobs, bulk classification — should go through batch. Typical savings: 50% on the portion of workload that is not user-facing.

Batch APIs from OpenAI and Anthropic offer ~50% discount
Use for any non-real-time workload (overnight jobs, bulk processing)

6. Cap output tokens aggressively

Output tokens are 3–5x more expensive than input tokens in most modern models. Every response schema should have a tight max_tokens that matches expected content. Uncapped completions routinely waste 30–40% on model-generated filler.

Output tokens cost 3–5x input tokens
Cap max_tokens per response type; enforce in schema

7. Improve retrieval to shrink context

Better retrieval means less context needs to be passed to the LLM. A RAG pipeline that retrieves 8 highly relevant chunks beats one that retrieves 40 mediocre chunks — cheaper and higher quality. Invest in hybrid retrieval and reranking, then reduce top-k to the smallest number that preserves eval scores.

Hybrid retrieval + reranking lets you reduce top-k
Smaller, more relevant context = lower cost and higher quality

8. Use structured output to eliminate reprompts

Free-form outputs that need parsing frequently require a second LLM call to clean up. Structured output (JSON mode, tool schemas) eliminates the need for repair prompts, saving one full call per failure. Run rate of invalid structured outputs is typically <1% on modern models.

Structured output eliminates repair prompts
Invalid rate <1% on modern models with good schemas

9. Dashboard cost per feature, not total

Aggregate spend tells you little. Cost per feature (per endpoint, per user action, per workflow) tells you where to focus. Instrument at the call site. Teams that dashboard per-feature catch runaway features 2–3 weeks before finance does.

Tag every call with feature/endpoint/user-action
Dashboard cost per feature, not aggregate total
Catches runaway features weeks before finance notices

10. Optimize dollars per successful user action

The right optimization target is not dollars per token but dollars per successful outcome — per converted lead, per resolved ticket, per closed task. This framing prevents micro-optimization that harms quality and surfaces the real economic levers.

Measure dollars per successful outcome, not per token
Prevents quality-damaging micro-optimizations

Conclusion

AI cost optimization is not one tactic — it is ten. Each of them is a small, measurable engineering investment. Together they produce 40–70% reductions without harming quality. Instrument per-feature, pick the three highest-leverage tactics for your workload, and reassess every quarter. Cost optimization compounds.