Skip to content
Agent Month

Reference architecture: a 6-week LLM cost engagement

Last verified: June 2026· reference architecture

Starting state

The buyer is in the following situation on day 0. We name it honestly because the engagement can't start from a fantasy starting point.

  • Production AI features running for 6–18 months, $30k–$500k/month on LLM APIs
  • No per-call observability — the bill is one line item on the cloud invoice
  • Engineering team knows AI is "expensive" but can't say which routes, models, or call patterns are driving the spend
  • No evals; prompt changes are blind deploys
  • Cost-cutting attempts have been ad-hoc (e.g. "we'll switch to Haiku for some calls") with no measurement

Week-by-week plan

The work, phase by phase. Every phase has a clear focus, a clear deliverable, and a clear handoff to the next phase.

  1. Week 1

    Read + instrument

    Deliverables

    • Read-only audit of traffic, prompts, models, and routing (no production writes)
    • Per-call LLM observability wired in (Helicone, LiteLLM + metrics, or Maxim Bifrost)
    • Cost + latency dashboards per route, per model, per user — in the team's existing BI tool where possible
  2. Week 2

    Routing + caching — first cuts

    Deliverables

    • Cost-aware model routing in production (small model for classification, capable model for reasoning, quality gate on the boundary)
    • Prompt + response caching for the highest-volume routes (cache by request fingerprint, with TTL per route)
    • Dropped or batched 1–3 high-cost, low-value routes the audit surfaced
  3. Week 3

    RAG retrieval quality + compression

    Deliverables

    • Re-ranked retrieved chunks (smaller model re-ranks before the capable model reads)
    • Compression / summarization of long retrieved docs before they hit the model
    • Per-route quality gate (LLM-as-judge, sampled online) to prove the cuts aren't degrading quality
  4. Week 4

    Documented before/after + runbook

    Deliverables

    • Documented before/after cost numbers per route, per model, per business KPI
    • Runbook so the team can keep tuning without us (routing policy, cache TTL policy, the 'when to escalate a route to a more capable model' rule)
    • Handoff doc with the dashboards, the alerts, and the 3 things to watch in the next 30 days
  5. Weeks 5–6 (optional)

    Tuning + the next quarter's roadmap

    Deliverables

    • Tuning the routing thresholds based on real traffic
    • Identifying the next 2–3 levers (evals, RAG quality on a second route, prompt compression at scale)
    • A 12-month LLM cost trajectory with the wins, the assumptions, and the headroom

End state

What the buyer has when the engagement ends. Quantified where we can quantify it; named where we can name it.

  • 30–60% LLM cost reduction, documented, with before/after numbers
  • Cost-aware routing in production on the team's stack (no vendor lock)
  • Live cost + latency observability, in the team's existing dashboards
  • A quality gate on the routing boundary, so the team can move routing policy without breaking output
  • A 12-month LLM cost forecast with the wins and the headroom

What the buyer owns

Everything. The code is in the team's repo. The dashboards are in the team's stack. The runbook is in the team's wiki. The credentials are in the team's secret store. We do not operate managed services and we do not retain access after handoff. The point of the engagement is to leave the team running the system themselves, well enough to hire in-house and transition out cleanly.

  • All routing and caching code, in the team's repo, in the team's preferred language
  • All dashboards and alerts, wired into the team's existing observability stack
  • The runbook and the handoff doc
  • The router and the cache layer, operated by the team (we don't operate managed services)