The CTO's first 90 days with an AI mandate

Last verified: June 2026· playbook

Day 0: what you actually have

You just got an "add AI to the product" mandate from the board. The CEO is excited. The board wants dates. Your team is uncertain. You have 12 weeks before "transformation" turns into "the thing we tried last year".

Here is what you actually have on day 0:

AI features already in production. The company is using LLMs somewhere — probably in the product, probably without evals, probably without cost observability. Find them this week.
Engineers already using AI tools. Claude Code, Cursor, Copilot are already in your org, even if you haven't approved them. Survey your team this week.
Cloud bill with one "AI" line item. Nobody can say which prompt, model, or call pattern is driving the spend. That's normal. Fix it in week 2.
No platform team. Most CTOs in this position don't have a senior AI platform hire. You won't hire one in 12 weeks. You can rent one for the same period for the cost of one senior FTE-quarter.

Weeks 1–2: instrument the bill

The first move is not a strategy doc. It is instrumentation.

Stand up a per-call LLM observability layer in week 1–2. You need to know:

Which model, which route, which user is driving the bill
Per-call latency and error rate, by route
Cost per business KPI (per user, per query, per conversion) — not just per token

Most teams use one of: Helicone (drop-in OpenAI proxy), LiteLLM + a metrics layer, or a managed product like Maxim Bifrost or TrueFoundry. Open-source: fast-litellm.

End of week 2 you should be able to answer: where is the money going, and which 3 calls are 80% of the bill?

Weeks 3–4: cut LLM cost 30–60%

Now you can see it. The next 2 weeks are the easiest cost win most teams ever get.

The standard playbook — usually 30–60% reduction in week 4:

Model routing. A small model (Haiku, Flash, GPT-4o-mini) for classification, extraction, simple Q&A. A capable model (Sonnet, Opus, GPT-4o) for hard reasoning. Quality gate on the boundary.
Prompt caching. Cache stable prefixes — system prompts, retrieved documents, RAG context. Often the single biggest lever in a RAG-heavy app.
Response caching. Cache deterministic outputs by request fingerprint. Safe for most extraction and classification routes.
Request batching. Batch non-urgent work through the provider's batch API at half the cost.
Drop high-cost, low-value routes. If a route is driving 30% of the bill and you can't defend it, kill it.

Target: 30–60% reduction in week 4, with before/after numbers you can put in the next board update.

Weeks 5–6: add evals

You shipped AI features with zero evals. Now every prompt change is a blind deploy. The cost of a bad output only shows up after it reaches a customer.

Add an eval layer in week 5–6. The minimum viable version:

20–50 representative production prompts for your top 3 routes, paired with expected behavior (exact match, rubric, or LLM-as-judge).
Run them in CI on every prompt change, model swap, or retrieval tweak. Fail the build when quality regresses past a threshold.
Sample real production traffic for online evals so silent quality drift doesn't ship.

Pair with structured outputs (schema-enforced JSON) wherever you consume model output programmatically. Most teams that ship evals discover their error rate was 3–10x what they assumed.

Weeks 7–8: standardize AI coding

Your team is already using Claude Code, Cursor, and Copilot — in a free-for-all. Juniors are 2x faster, seniors say quality is slipping, and the gains aren't compounding. Fix it in week 7–8.

The standardized workflow:

Project-level rules files. One .claude/commands/ + one .cursor/rules/ baseline in every repo, in version control. The team's priors, captured.
Shared slash commands and agent definitions. The "add tests" command, the "review this PR" agent, the "explain this module" agent — for every workflow every engineer runs more than once a week.
MCP access to the top 3 internal systems. Postgres (read-only), Linear, your CI logs. Read-only by default; the agent can answer its own questions.
Review hooks. AI-aware SAST, secret scanning, format, and a "this PR was opened by an agent, do an extra review" label.

By end of week 8, the team should have a workflow that beats the free-for-all on quality and on speed.

Weeks 9–10: ship the first agent-bearing feature

The first agent-bearing feature. Pick a feature that:

Has a measurable KPI. Cost per resolution, time to first response, conversion rate — something the PM owns.
Is bounded. The agent's authority ends at "draft" or "suggest" or "route to human". Not "ship".
Has a human in the loop for the hard cases. Confidence threshold or escalation rule, not a vibe.
Is behind a feature flag. Roll out 5% → 25% → 50% → 100%, with quality + cost telemetry on the boundary.

Ship it. This is the proof of the platform — a real feature, with real users, that the team can point at when the board asks "where is the AI".

Weeks 11–12: review and sequence next quarter

End of week 12, you have:

A documented 30–60% LLM cost reduction, in production, with the dashboard that keeps it from creeping back
An eval suite in CI that gates every model + prompt change
A standardized AI-coding workflow that the team actually uses
One agent-bearing feature in production behind a flag, with real usage data

What you sequence for next quarter depends on what you have:

More agent features. The platform can support them. The 2nd one is faster than the 1st.
MCP for the rest of the internal stack. Datadog, Snowflake, deploy, internal APIs. The team is asking for these now.
Regulated-data path. If healthcare, finance, or EU is in scope, you need self-hosted inference in the next 6 months. Stand it up in parallel.
AI platform hire. Now you can write a real job spec. The platform is the requirement.

What NOT to do

Common moves that kill the 90 days. Refuse them.

"AI transformation roadmap" before shipping anything. Roadmaps are what you write after the first win, not before. A roadmap with no live numbers is a slide deck.
Big-bang migration to a new LLM provider. Migration cost dwarfs the savings. Route to the right model per call instead.
"We'll build an internal AI team before we ship anything." The internal team is the next step, not the first. Rent capacity for 12 weeks, ship, then hire.
"AI agents will replace [X job title]." AI agents change how work is done, not the org chart. The board doesn't need a reorg; the board needs a number.
"Let me see a 5-year AI strategy." You don't have a 5-year AI strategy. You have a 12-week plan. The plan is the strategy.

The parallel track: regulated data

If your data is in scope for a regulated buyer (healthcare, finance, defense, EU residency), you have a parallel track:

Stand up self-hosted inference in parallel. vLLM, TensorRT-LLM, or SGLang on your own cloud. An 8–16 week engagement, but the dependency only blocks the regulated workload — the rest of the platform ships on hosted APIs.
Document the data-flow boundary. The audit needs a diagram of what data leaves the building, when, and why. Stand this up in week 4 — it informs everything else.
Pick a workload that demonstrates the path. One internal-tool LLM (RAG over internal docs, a developer-tooling assistant) on self-hosted. Not a customer-facing one yet.

Most CTOs in regulated industries deprioritize this. Most CTOs who don't, hit a wall in month 4. Do the parallel track.

Do this yourself vs hire us

When to do this yourself, when to hire:

Do this yourself if…

You already have a senior engineer with LLM cost + eval experience in-house
Your LLM bill is under $20k/month — the savings are real but small in absolute terms
You have 8+ weeks before the mandate becomes urgent
You have executive patience to wait for the wins to compound

Hire us if…

The mandate is urgent (board, CEO, competitive pressure)
Your LLM bill is $50k+/month and you have no senior LLM engineer on staff
You don't have a senior engineer who has shipped evals in CI before
You want outcome pricing — a number on the line, not a deck
You need the wins in the next 4–6 weeks, not the next 4–6 months

Frequently asked questions

What is the single most important first move?

Instrument the bill. You can't cut LLM cost, evaluate AI features, or make a credible mandate plan until you can see what you actually spend on. Week 1–2 is observability; week 3+ is everything else.

How do I get buy-in for the 12-week plan?

Show the board the dollar number you can defend in week 4 (30–60% of the current bill), the time-to-ship for the first agent feature (week 9–10), and the cost of renting the capacity to do it (one senior FTE-quarter, outcome-priced). That is the plan.

Do I need a senior AI platform hire in the first 90 days?

No. You need senior AI engineering capacity for 12 weeks. Hire internally after the platform is standing and you can write a real job spec. Most CTOs who hire first ship nothing for 6 months.

What if my board wants a 'big vision' first?

Give them a 12-week plan that ladders to a 12-month outcome. The plan IS the vision. A vision without a plan is a speech.

How do I sequence the regulated-data work?

Parallel track, not blocking. Stand up self-hosted inference in week 8–16, after the cost + eval + workflow wins are in. The regulated workload is a customer-facing one that ships 2 quarters later, on the same platform.