LLM cost optimization playbook: 7 levers, 30–60% savings
Last verified: June 2026· playbook
Read this first: the bill is almost always 3–10x what it should be
Before any of the levers, do the read. Most teams running production AI are paying 3–10x what they should, and they don't know it — not because they made bad choices, but because the bill grew faster than anyone's ability to see inside it.
The standard instrument-and-audit looks like this:
- Stand up a per-call LLM observability layer (Helicone, LiteLLM + a metrics layer, Maxim Bifrost, or open-source fast-litellm).
- Per-route, per-model, per-user: tokens, latency, cost, and (if you have it) per-business-KPI.
- Sort routes by cost. The top 3 are usually 60–80% of the bill.
- For each top route, ask: could a smaller model do this with a quality gate?
Once you can see it, the rest is the 7 levers below.
Lever 1: Model routing (40–70% of the wins)
Model routing is the single biggest lever — and the one most teams under-use.
The pattern: a small model (Haiku, Flash, GPT-4o-mini, DeepSeek-V3) for high-volume, low-stakes calls; a capable model (Sonnet, Opus, GPT-4o, Claude Fable) for the hard reasoning. A quality gate on the boundary decides which calls escalate.
Common splits:
- Classification, extraction, simple Q&A → small model (often 90% of call volume)
- Summarization, rephrasing, formatting → small model with structured outputs
- Multi-step reasoning, planning, code generation → capable model (often 10% of call volume but 60% of cost)
Typical savings: 40–70% of the bill. The quality gate is the only tricky part: LLM-as-judge on a small held-out set, sampled online, gives you confidence that the small model isn't silently degrading quality.
Lever 2: Prompt caching of stable prefixes
Prompt caching is the second-biggest lever — and the one most teams haven't turned on.
Most LLM APIs (Anthropic, OpenAI, Google, DeepSeek) support caching of the prompt prefix. The rule: anything in the prompt that is stable across requests should be cached. Common cached prefixes:
- System prompt (often 500–2,000 tokens of instructions + tools)
- Retrieved documents / RAG context (often 2,000–20,000 tokens; the same doc is sent to many requests)
- Tool definitions (every tool description, every parameter schema)
- Examples / few-shot prompts (stable across calls)
Typical savings: 30–80% of input-token cost, on cacheable routes. The cost is cache-control plumbing: which part of the prompt is the stable prefix, and which part is the request-specific part.
Lever 3: Response caching for deterministic routes
For deterministic routes, the entire response can be cached.
Safe to fully cache:
- Classification (label the request, cache the label)
- Extraction (same input → same output, unless the model is non-deterministic)
- FAQ / RAG over a known doc set (cache by doc-set version + query)
Not safe to cache (or cache only the prefix):
- Code generation (output is highly variable)
- Multi-step reasoning (the path matters)
- Any route that depends on real-time data
Typical savings: 20–50% of call volume on safe routes — often 5–15% of the bill.
Lever 4: Request batching through the batch API
Most providers offer a batch API at 50% the synchronous price. The trade-off is latency: batches typically complete in minutes to hours, not seconds.
Routes that fit batching:
- Nightly evaluation runs
- Bulk document processing (RAG ingestion, batch summarization)
- Reports that don't need real-time
Routes that don't fit:
- Anything user-facing in a chat flow
- Anything that gates a downstream action in <5 seconds
Typical savings: 50% on the routes that fit, often 10–25% of the bill once you include the volume.
Lever 5: Prompt compression and context trimming
The cheapest prompts to cut are the ones nobody reads.
Common bloat:
- Retrieved documents that don't change the answer. Measure: drop one retrieved doc at a time, measure the answer quality, keep the docs that move the needle.
- Examples that don't generalize. 3 examples > 20 examples if the 3 are sharp.
- Conversation history that doesn't matter. Summarize or drop turns after the 5th; the answer rarely depends on the full thread.
- Output cap that's too high. Most teams set
max_tokensto 4,000 by default. Most routes don't need more than 500–1,000.
Typical savings: 10–30% of input-token cost. Hard to find without an eval that proves the compression isn't degrading quality.
Lever 6: RAG retrieval quality (kill the over-fetch)
The most expensive lever, and the most often missed.
Most RAG systems over-fetch. They send 20 chunks of context to the model, when 3 would have answered the question. The cost is:
- The 20 chunks of input token cost
- The latency (longer prompts = slower responses)
- The quality (too much context confuses the model)
The fix: re-rank with a smaller model, retrieve fewer high-quality chunks, and use compression / summarization on the retrieved docs before they reach the model.
Typical savings: 30–60% on RAG-heavy routes.
Lever 7: Drop high-cost, low-value routes
The hardest lever to pull — and often the biggest absolute saving.
Look at your top 3 routes by cost. For each:
- Is the use case defensible? Could you defend it in front of a board? Or is it a feature nobody uses that someone added in 2024?
- Is the cost per business KPI reasonable? If the route costs $0.50 per call and drives $0.10 of revenue, the unit economics are broken. Either raise the price, lower the cost, or kill it.
- Is it a demo that became production? A demo that became production is a feature that was never reviewed. Review it now.
Most teams find at least one route that is driving 30% of the bill and shouldn't exist. Killing it is the cleanest 30% reduction you'll ever get.
You need cost observability or none of this sticks
None of this sticks without cost observability.
The minimum viable cost observability:
- Per-call observability: tokens, latency, cost, model, route, user.
- Per-business-KPI rollup: cost per resolution, cost per conversion, cost per user per day.
- Alerting on cost anomalies: per-route spend > 2x baseline = page someone.
- Cost in the dashboard next to latency: not in a separate spreadsheet, not in a finance team's monthly close.
Without this, the wins from week 1 creep back by month 3 — because nobody can see them creeping.
Do this yourself vs hire us
When to do this yourself, when to hire:
Do this yourself if…
- You have a senior engineer with LLM cost experience in-house
- Your LLM bill is under $20k/month
- You have 6+ weeks before the wins need to be visible
- You already have per-call observability in place
Hire us if…
- Your LLM bill is $50k+/month and you have no senior LLM engineer on staff
- You want outcome pricing — a number on the line, not a deck
- You want the wins in the next 4–6 weeks, not the next 4–6 months
- You don't yet have per-call observability, and you want it as part of the engagement
- You want the cost observability wired into your existing dashboards, not a separate tool
Frequently asked questions
What is the realistic LLM cost reduction?
A focused 4–6 week engagement consistently finds 30–60% on production traffic. The ceiling depends on the workload: RAG-heavy apps often hit 60–80%; non-RAG chat workloads hit 20–40%.
How long does it take to see results?
Week 1 is the read + instrument. Week 2 is the first cuts (routing, caching). Week 3–4 is the bigger wins. By week 4 you have a documented before/after number.
Do I need to switch LLM providers to get the savings?
Almost never. The wins are in routing, caching, and observability — not in the provider. Migration cost dwarfs the savings on the same workload.
What is the difference between LLM cost optimization and FinOps?
FinOps is the broader cloud-spend discipline. LLM cost optimization is the AI-specific slice: model routing, prompt caching, response caching, RAG retrieval quality. Most FinOps teams don't have the AI-specific expertise.