Guides
March 6, 2026
By Andrew Day

LLM Cost Optimization Playbook

Prioritize the engineering tactics that lower AI spend fastest. Prompt compression, caching, smaller models, batching, and retrieval optimization with a clear savings vs effort ranking.

Share this post

Send it to someone managing cloud or AI spend.

LinkedInX

Use this when you know there is waste but are unsure what to change first.

The fast answer: rank tactics by savings potential and effort, then pick two for the next sprint. Start with prompt compression and model tier before touching caching or batching.

What you will get in 12 minutes

  • A prioritization framework (savings vs effort)
  • A ranked list of six tactics with tradeoffs and worked examples
  • A sprint sequencing guide for the first three optimization cycles
  • A simple way to measure whether a change worked

Before you start: establish a baseline

Optimization without measurement is guesswork. Before you change anything, establish a baseline for the metric you are trying to move.

The most useful baseline metrics are:

  • Cost per request for a specific workflow — total inference cost divided by request count for the same period
  • Cost per feature — inference cost attributable to one product surface
  • Input and output token averages — average token counts per request, separated into input and output

You do not need all three. Pick the one that is easiest to track for your most expensive workflow and record the value for the most recent 7-day period. That becomes your before number. After the change ships, compare the same metric for the following 7 days.

Without this baseline, you will make changes and have no way to confirm they helped. With it, you can prove the improvement and explain it to the team.

Prioritization framework

Not all tactics are equal. Use two dimensions:

Dimension What it means
Savings potential How much cost reduction is realistic for your workload?
Effort Engineering time, testing, rollout risk

High savings and low effort wins first. Low savings and high effort goes to the backlog.

One important nuance: savings potential varies significantly by workload. Prompt compression is only high-savings if your prompts are currently long. Model switching is only high-savings if you are using a premium model for tasks that do not require it. The ranking below assumes typical workloads — validate it against your specific situation before committing to a sprint.


Tactic 1: Prompt compression

Savings potential: High for context-heavy workflows
Effort: Low to medium

Long prompts are expensive because every token in the input costs money. Most production prompts have room to be shorter without sacrificing quality — accumulated system prompt additions, redundant instructions, and large retrieved contexts are the most common sources of waste.

What this looks like in practice:

A team running a document Q&A workflow has a system prompt that grew from 120 tokens to 680 tokens over six months as engineers added instructions, examples, and edge-case handling. Nobody removed anything, they only added. The prompt currently costs $0.0014 per request just for the system prompt. Trimming it back to 180 tokens targeted at the actual task reduces that to $0.00038 per request — a 73 percent reduction in system prompt cost.

The most impactful changes are usually: trimming the system prompt to only what is necessary for the specific task, reducing retrieved context from top-10 to top-3 chunks where quality holds, and removing duplicate or near-duplicate instructions that accumulated over time.

Action Example
Shorten system prompts Cut from 500 to 150 tokens where clarity allows
Limit RAG context Retrieve top 3 instead of top 10 when quality holds
Remove duplicate instructions Consolidate repeated rules
Use structured prompts Templates reduce token bloat

Failure mode: Aggressive compression hurts quality. If the system prompt previously included examples for edge cases, removing them can increase error rates on those cases. Test on a held-out sample before rolling to production. If quality drops, find the specific instruction that is carrying weight rather than restoring everything.

How to measure it: Track average input token count per request before and after. Cost per request should move in proportion to the token reduction.

Do this sprint: Identify your top 3 workflows by token cost. Shorten the system prompt for the largest one and A/B test on 20 percent of traffic before full rollout.


Tactic 2: Smaller model fallback

Savings potential: Very high for the right tasks
Effort: Medium

Using a premium model for every task is the most common source of avoidable spend. Many tasks — classification, extraction, simple summarization, structured output generation — perform equally well on mini or small model tiers at a fraction of the cost. The difference between GPT-5.2 and GPT-5 Mini for a sentiment classification task is often less than 2 percent in quality and 7x in price.

What this looks like in practice:

A team uses GPT-5.2 for all of their AI workflows as a default because it is what they started with and "we know it works." They have eight workflows. After running a simple evaluation, they find that four of them — ticket routing, content tagging, email subject line scoring, and FAQ matching — perform at or above acceptable quality thresholds on GPT-5 Mini. Switching those four workflows reduces their monthly inference spend from $4,200 to $1,800, without any meaningful product impact.

The key is the evaluation step. You cannot assume which workflows tolerate smaller models without testing. The test does not need to be elaborate — 50 to 100 representative examples with expected outputs, run against both models, is usually enough to make the decision with confidence.

Action Example
Route by task type Classification → GPT-5 Mini, complex reasoning → GPT-5.2
Use fallback chains Try cheaper model first, escalate on failure
Evaluate per workflow One workflow may save 10x, another may need premium

Failure mode: Quality risk if task taxonomy is wrong. Switching a workflow that genuinely needs reasoning depth to a smaller model can produce subtle quality degradations that are hard to detect without careful evaluation. Always evaluate per workflow rather than making a blanket routing change.

How to measure it: Track quality score (from your eval harness) and cost per request for each switched workflow independently. Do not measure the aggregate — evaluate each one so you know which ones worked and which ones did not.

Do this sprint: List workflows that are classification, extraction, or simple summarization. Run an evaluation on one with GPT-5 Mini or Claude Haiku. If quality holds, ship it and move to the next.


Tactic 3: Caching

Savings potential: High for repeated inputs
Effort: Medium to high

Caching avoids paying to compute something you have already computed. There are two types worth considering: embedding caching, which avoids re-embedding the same document, and response caching, which avoids re-generating the same completion for identical or near-identical inputs.

What this looks like in practice:

A team builds a customer support assistant that uses the same product documentation corpus for every conversation. The corpus is 400 documents. Without caching, every conversation embeds a fresh set of context chunks. With embedding caching, those 400 documents are embedded once and the embeddings are stored. The team's embedding cost drops from $180 per month to $12 per month because the corpus only changes when documentation is updated.

For completion caching, a different team finds that 18 percent of their incoming support questions are near-duplicates — "how do I cancel my subscription?" appears in dozens of paraphrased forms. Implementing semantic caching for those questions reduces inference calls by 18 percent immediately.

Action Example
Embedding cache Same document, same embedding — cache it
Semantic cache Similar queries → reuse similar responses when safe
Prompt cache (providers that support it) Reuse long prefix across requests

Failure mode: Cache invalidation and hit-rate tuning add engineering complexity. Semantic caching in particular can return stale or slightly wrong answers if the similarity threshold is too loose. This tactic requires more careful implementation and testing than prompt compression. Do not ship semantic caching without a clear staleness policy and a way to monitor cache hit rates.

How to measure it: Track cache hit rate as a primary metric. Track cost per request before and after, and confirm the reduction is proportional to the hit rate. If hit rate is high but cost per request barely moved, the caching is not covering the expensive cases.

Do this sprint: Measure how many requests have identical or near-identical inputs. If more than 10 percent do, caching is worth investing in. Start with exact-match caching for deterministic inputs before attempting semantic caching.


Tactic 4: Batching

Savings potential: Medium for bulk workloads
Effort: Medium

Batching reduces the overhead cost of many small requests by combining them into fewer larger ones. For workloads that do not need real-time responses, the OpenAI Batch API and similar async endpoints offer 50 percent pricing discounts at the cost of higher latency.

What this looks like in practice:

A team runs nightly re-classification of all support tickets created in the prior 24 hours — typically 800 to 1,200 tickets. Currently they process each ticket as a separate synchronous request. Switching to the OpenAI Batch API reduces their nightly classification cost from $42 to $21 simply by using the batch endpoint. The job takes slightly longer to complete but still finishes before the morning review starts. No product or customer impact.

Action Example
Batch embeddings Process 100 documents in one API call instead of 100
Batch classification Run overnight batch instead of real-time where possible
Use batch APIs where available Some providers offer batch endpoints with lower effective cost

Failure mode: Latency. This tactic only applies to workloads where the user or the system can wait. A real-time chat assistant cannot be batched. Trying to force batching on latency-sensitive workflows creates user experience problems. Always validate that the workload can tolerate the delay before switching.

How to measure it: Track job completion time and cost per batch run. Confirm that the latency increase is within acceptable bounds for the use case before declaring the tactic successful.

Do this sprint: Identify background or batch workloads — anything that runs overnight, on a schedule, or in the background rather than in response to a live user action. Check whether they are already batched. If not, estimate the savings from switching to a batch endpoint and prioritize if the number is material.


Tactic 5: Retrieval optimization

Savings potential: Medium to high for RAG systems
Effort: Medium

RAG systems are often more expensive than they need to be because they retrieve too much context and pass it all to the model. Retrieving top-10 chunks when top-3 would be sufficient doubles or triples the input token cost for every RAG request. Optimizing retrieval is about finding the minimum context that produces acceptable quality.

What this looks like in practice:

A team runs a product knowledge assistant. Their retrieval currently passes the top-8 chunks to the model, averaging 3,200 tokens of context per request. Running an evaluation with top-3 chunks shows that quality degrades on 4 percent of queries — mostly complex multi-hop questions. They implement a hybrid strategy: top-3 chunks for standard queries, with a fallback to top-8 for queries that contain multiple question marks or specific domain keywords. Average context drops to 1,600 tokens per request, cutting retrieval-related input cost by 50 percent while maintaining quality on complex queries.

Action Example
Reduce retrieved chunks Top 3 instead of top 10 when quality holds
Improve chunking Smaller, more focused chunks reduce noise
Pre-filter before retrieval Exclude irrelevant documents earlier
Use cheaper embeddings for pre-filter Two-stage: cheap embedding filter, then expensive retrieval

Failure mode: Recall can drop if you over-optimize. The most common mistake is reducing top-K based on average quality without testing the tail. Retrieval misses tend to cluster on complex, multi-part, or ambiguous queries. Make sure your evaluation set includes those cases before reducing context aggressively.

How to measure it: Track retrieval quality (recall@k, hit@k) before and after any retrieval change. Track input token averages in parallel. A retrieval change that drops quality by 10 percent is not a cost optimization — it is a regression with a side effect.

Do this sprint: For your main RAG workflow, try reducing from top-K to top-(K-3) and measure quality on a held-out evaluation set. If quality holds, you save tokens. If it drops, investigate whether the failures cluster in a specific query type that could be handled with a fallback strategy.


Tactic 6: Response truncation and structure

Savings potential: Medium for long outputs
Effort: Low

Output tokens are generally more expensive than input tokens, and many workflows generate more output than they need. Setting explicit max_tokens limits, using structured output formats, and writing tighter prompts that constrain response length are all low-effort changes that reduce output token count.

What this looks like in practice:

A team has a workflow that generates daily summaries of customer conversations. The current prompt says "summarize the conversation." The model generates an average of 420 tokens per summary. The team rewrites the prompt to "summarize the conversation in three bullet points of no more than 20 words each" and sets max_tokens to 150. Average output drops to 95 tokens. The summaries are slightly more constrained but actually more useful for the dashboard where they appear.

Action Example
Set max_tokens Cap completions when you know the expected length
Use JSON mode Structured output reduces verbose prose
Truncate summaries "Summarize in 2 sentences" instead of open-ended

Failure mode: User experience impact if outputs feel cut off or lose necessary nuance. This is the easiest tactic to over-apply. Test the truncated outputs with real users or reviewers before declaring success. Some workflows genuinely need long outputs.

How to measure it: Track average output token count per request before and after. Track quality metrics in parallel to confirm that shorter outputs are not degrading usefulness.

Do this sprint: Check your top 3 workflows for max_tokens settings. If any are unset or set very high, add or tighten the limit and test on a small sample first.


Tactic ranking matrix

Tactic Savings Effort Do first if
Prompt compression High Low–med Context-heavy workflows
Smaller model fallback Very high Medium You have tier-1 tasks using premium models
Caching High Med–high Repeated inputs are common
Batching Medium Medium Bulk or background workloads
Retrieval optimization Med–high Medium RAG is a significant cost driver
Response truncation Medium Low Long outputs are common

How to sequence your first three sprints

Most teams try to tackle too many tactics at once and see limited results because they cannot isolate what worked. A better approach is to focus each sprint on one tactic, measure it cleanly, and then move to the next.

Sprint 1 — Baseline and quick wins:

  • Establish cost-per-request baselines for your top 3 workflows
  • Audit system prompt length for each workflow
  • Ship prompt compression for the workflow with the most obvious waste
  • Set max_tokens for any workflow that does not already have it

Sprint 2 — Model routing:

  • Run model evaluations on your classification, extraction, and simple summarization workflows
  • Switch the one or two workflows with the clearest quality parity to a smaller model tier
  • Measure cost-per-request before and after for each switched workflow

Sprint 3 — Retrieval or caching:

  • If RAG is a significant cost driver: reduce top-K and measure retrieval quality impact
  • If you have high repeat-query volume: implement exact-match caching for deterministic inputs
  • Measure cache hit rate or retrieval token reduction over 2 weeks

By sprint 4, you typically have enough data to evaluate more complex tactics (semantic caching, batching, advanced retrieval strategies) with a clearer picture of where the remaining waste is.


How StackSpend helps

StackSpend helps you measure optimization work by showing spend by provider, model, service, and category. You can compare before and after optimization work in one view. See AI cost monitoring.

What to do next

Share this post

Send it to someone managing cloud or AI spend.

LinkedInX

Know where your cloud and AI spend stands — every day.

Connect providers in minutes. Get 90 days of visibility and start receiving daily cost updates before the invoice lands.

14-day free trial. No credit card required. Plans from $19/month.