LLM cost optimization playbook: 7 levers, 30–60% savings

Last verified: June 2026· playbook

Read this first: the bill is almost always 3–10x what it should be

Before any of the levers, do the read. Most teams running production AI are paying 3–10x what they should, and they don't know it — not because they made bad choices, but because the bill grew faster than anyone's ability to see inside it.

The standard instrument-and-audit looks like this:

Stand up a per-call LLM observability layer (Helicone, LiteLLM + a metrics layer, Maxim Bifrost, or open-source fast-litellm).
Per-route, per-model, per-user: tokens, latency, cost, and (if you have it) per-business-KPI.
Sort routes by cost. The top 3 are usually 60–80% of the bill.
For each top route, ask: could a smaller model do this with a quality gate?

Once you can see it, the rest is the 7 levers below.

Lever 1: Model routing (40–70% of the wins)

Model routing is the single biggest lever — and the one most teams under-use.

The pattern: a small model (Haiku, Flash, GPT-4o-mini, DeepSeek-V3) for high-volume, low-stakes calls; a capable model (Sonnet, Opus, GPT-4o, Claude Fable) for the hard reasoning. A quality gate on the boundary decides which calls escalate.

Common splits:

Classification, extraction, simple Q&A → small model (often 90% of call volume)
Summarization, rephrasing, formatting → small model with structured outputs
Multi-step reasoning, planning, code generation → capable model (often 10% of call volume but 60% of cost)

Typical savings: 40–70% of the bill. The quality gate is the only tricky part: LLM-as-judge on a small held-out set, sampled online, gives you confidence that the small model isn't silently degrading quality.

Lever 2: Prompt caching of stable prefixes

Prompt caching is the second-biggest lever — and the one most teams haven't turned on.

Most LLM APIs (Anthropic, OpenAI, Google, DeepSeek) support caching of the prompt prefix. The rule: anything in the prompt that is stable across requests should be cached. Common cached prefixes:

System prompt (often 500–2,000 tokens of instructions + tools)
Retrieved documents / RAG context (often 2,000–20,000 tokens; the same doc is sent to many requests)
Tool definitions (every tool description, every parameter schema)
Examples / few-shot prompts (stable across calls)

Typical savings: 30–80% of input-token cost, on cacheable routes. The cost is cache-control plumbing: which part of the prompt is the stable prefix, and which part is the request-specific part.

Lever 3: Response caching for deterministic routes

For deterministic routes, the entire response can be cached.

Safe to fully cache:

Classification (label the request, cache the label)
Extraction (same input → same output, unless the model is non-deterministic)
FAQ / RAG over a known doc set (cache by doc-set version + query)

Not safe to cache (or cache only the prefix):

Code generation (output is highly variable)
Multi-step reasoning (the path matters)
Any route that depends on real-time data

Typical savings: 20–50% of call volume on safe routes — often 5–15% of the bill.

Lever 4: Request batching through the batch API

Most providers offer a batch API at 50% the synchronous price. The trade-off is latency: batches typically complete in minutes to hours, not seconds.

Routes that fit batching:

Nightly evaluation runs
Bulk document processing (RAG ingestion, batch summarization)
Reports that don't need real-time

Routes that don't fit:

Anything user-facing in a chat flow
Anything that gates a downstream action in <5 seconds

Typical savings: 50% on the routes that fit, often 10–25% of the bill once you include the volume.

Lever 5: Prompt compression and context trimming

The cheapest prompts to cut are the ones nobody reads.

Common bloat:

Retrieved documents that don't change the answer. Measure: drop one retrieved doc at a time, measure the answer quality, keep the docs that move the needle.
Examples that don't generalize. 3 examples > 20 examples if the 3 are sharp.
Conversation history that doesn't matter. Summarize or drop turns after the 5th; the answer rarely depends on the full thread.
Output cap that's too high. Most teams set max_tokens to 4,000 by default. Most routes don't need more than 500–1,000.

Typical savings: 10–30% of input-token cost. Hard to find without an eval that proves the compression isn't degrading quality.

Lever 6: RAG retrieval quality (kill the over-fetch)

The most expensive lever, and the most often missed.

Most RAG systems over-fetch. They send 20 chunks of context to the model, when 3 would have answered the question. The cost is:

The 20 chunks of input token cost
The latency (longer prompts = slower responses)
The quality (too much context confuses the model)

The fix: re-rank with a smaller model, retrieve fewer high-quality chunks, and use compression / summarization on the retrieved docs before they reach the model.

Typical savings: 30–60% on RAG-heavy routes.

Lever 7: Drop high-cost, low-value routes

The hardest lever to pull — and often the biggest absolute saving.

Look at your top 3 routes by cost. For each:

Is the use case defensible? Could you defend it in front of a board? Or is it a feature nobody uses that someone added in 2024?
Is the cost per business KPI reasonable? If the route costs $0.50 per call and drives $0.10 of revenue, the unit economics are broken. Either raise the price, lower the cost, or kill it.
Is it a demo that became production? A demo that became production is a feature that was never reviewed. Review it now.

Most teams find at least one route that is driving 30% of the bill and shouldn't exist. Killing it is the cleanest 30% reduction you'll ever get.

You need cost observability or none of this sticks

None of this sticks without cost observability.

The minimum viable cost observability:

Per-call observability: tokens, latency, cost, model, route, user.
Per-business-KPI rollup: cost per resolution, cost per conversion, cost per user per day.
Alerting on cost anomalies: per-route spend > 2x baseline = page someone.
Cost in the dashboard next to latency: not in a separate spreadsheet, not in a finance team's monthly close.

Without this, the wins from week 1 creep back by month 3 — because nobody can see them creeping.

Do this yourself vs hire us

When to do this yourself, when to hire:

Do this yourself if…

You have a senior engineer with LLM cost experience in-house
Your LLM bill is under $20k/month
You have 6+ weeks before the wins need to be visible
You already have per-call observability in place

Hire us if…

Your LLM bill is $50k+/month and you have no senior LLM engineer on staff
You want outcome pricing — a number on the line, not a deck
You want the wins in the next 4–6 weeks, not the next 4–6 months
You don't yet have per-call observability, and you want it as part of the engagement
You want the cost observability wired into your existing dashboards, not a separate tool

Frequently asked questions

What is the realistic LLM cost reduction?

A focused 4–6 week engagement consistently finds 30–60% on production traffic. The ceiling depends on the workload: RAG-heavy apps often hit 60–80%; non-RAG chat workloads hit 20–40%.

How long does it take to see results?

Week 1 is the read + instrument. Week 2 is the first cuts (routing, caching). Week 3–4 is the bigger wins. By week 4 you have a documented before/after number.

Do I need to switch LLM providers to get the savings?

Almost never. The wins are in routing, caching, and observability — not in the provider. Migration cost dwarfs the savings on the same workload.

What is the difference between LLM cost optimization and FinOps?

FinOps is the broader cloud-spend discipline. LLM cost optimization is the AI-specific slice: model routing, prompt caching, response caching, RAG retrieval quality. Most FinOps teams don't have the AI-specific expertise.