Reference architecture: a 6-week LLM cost engagement

Last verified: June 2026· reference architecture

Starting state

The buyer is in the following situation on day 0. We name it honestly because the engagement can't start from a fantasy starting point.

Production AI features running for 6–18 months, $30k–$500k/month on LLM APIs
No per-call observability — the bill is one line item on the cloud invoice
Engineering team knows AI is "expensive" but can't say which routes, models, or call patterns are driving the spend
No evals; prompt changes are blind deploys
Cost-cutting attempts have been ad-hoc (e.g. "we'll switch to Haiku for some calls") with no measurement

Week-by-week plan

The work, phase by phase. Every phase has a clear focus, a clear deliverable, and a clear handoff to the next phase.

Week 1
Read + instrument
Deliverables
- Read-only audit of traffic, prompts, models, and routing (no production writes)
- Per-call LLM observability wired in (Helicone, LiteLLM + metrics, or Maxim Bifrost)
- Cost + latency dashboards per route, per model, per user — in the team's existing BI tool where possible
Week 2
Routing + caching — first cuts
Deliverables
- Cost-aware model routing in production (small model for classification, capable model for reasoning, quality gate on the boundary)
- Prompt + response caching for the highest-volume routes (cache by request fingerprint, with TTL per route)
- Dropped or batched 1–3 high-cost, low-value routes the audit surfaced
Week 3
RAG retrieval quality + compression
Deliverables
- Re-ranked retrieved chunks (smaller model re-ranks before the capable model reads)
- Compression / summarization of long retrieved docs before they hit the model
- Per-route quality gate (LLM-as-judge, sampled online) to prove the cuts aren't degrading quality
Week 4
Documented before/after + runbook
Deliverables
- Documented before/after cost numbers per route, per model, per business KPI
- Runbook so the team can keep tuning without us (routing policy, cache TTL policy, the 'when to escalate a route to a more capable model' rule)
- Handoff doc with the dashboards, the alerts, and the 3 things to watch in the next 30 days
Weeks 5–6 (optional)
Tuning + the next quarter's roadmap
Deliverables
- Tuning the routing thresholds based on real traffic
- Identifying the next 2–3 levers (evals, RAG quality on a second route, prompt compression at scale)
- A 12-month LLM cost trajectory with the wins, the assumptions, and the headroom

End state

What the buyer has when the engagement ends. Quantified where we can quantify it; named where we can name it.

30–60% LLM cost reduction, documented, with before/after numbers
Cost-aware routing in production on the team's stack (no vendor lock)
Live cost + latency observability, in the team's existing dashboards
A quality gate on the routing boundary, so the team can move routing policy without breaking output
A 12-month LLM cost forecast with the wins and the headroom

What the buyer owns

Everything. The code is in the team's repo. The dashboards are in the team's stack. The runbook is in the team's wiki. The credentials are in the team's secret store. We do not operate managed services and we do not retain access after handoff. The point of the engagement is to leave the team running the system themselves, well enough to hire in-house and transition out cleanly.

All routing and caching code, in the team's repo, in the team's preferred language
All dashboards and alerts, wired into the team's existing observability stack
The runbook and the handoff doc
The router and the cache layer, operated by the team (we don't operate managed services)

Reference architecture: a 6-week LLM cost engagement

Starting state

Week-by-week plan

Week 1

Week 2

Week 3

Week 4

Weeks 5–6 (optional)

End state

What the buyer owns

Related