The fastest way to ship a bad LLM change is to evaluate it with taste.
Two people look at six outputs, say one prompt "feels better," and the change goes live. A week later review volume spikes, the wrong cases are being escalated, and nobody can prove which change caused it.
An evaluation system exists to stop that kind of rollout.
Start with the workflow, not the model
The wrong question is:
- which model is best?
The right question is:
- what output must be correct for this workflow to be safe and useful?
If the workflow is support triage, your eval is not about eloquence. It is about route accuracy, review rate, and downstream resolution quality.
If the workflow is extraction, your eval is not one generic "quality score." It is field accuracy and review burden.
One workflow, one eval harness
A practical eval system needs four things:
- a clearly defined task
- a representative dataset
- task-specific metrics
- a release threshold
That sounds obvious, but teams often skip straight from "we changed the prompt" to "let's glance at examples."
A concrete eval harness
Here is a simple TypeScript example for evaluating a support classifier:
type EvalExample = {
input: string;
expectedLabel: "billing" | "bug" | "feature_request" | "other";
};
type EvalResult = {
correct: boolean;
predicted: string;
expected: string;
};
export async function runEvalSet(
examples: EvalExample[],
): Promise<{ accuracy: number; results: EvalResult[] }> {
const results: EvalResult[] = [];
for (const example of examples) {
const prediction = await classifyTicket(example.input);
results.push({
correct: prediction.label === example.expectedLabel,
predicted: prediction.label,
expected: example.expectedLabel,
});
}
const correctCount = results.filter((result) => result.correct).length;
return {
accuracy: correctCount / results.length,
results,
};
}
This is not fancy. That is the point. A usable eval harness does not need a platform on day one. It needs a repeatable dataset and metrics that actually describe the job.
Metrics by task type
Different tasks need different metrics. Reusing one generic rubric across every workflow hides problems.
| Workflow type | Primary metrics | Typical guardrails |
|---|---|---|
| Extraction | Field precision, recall, F1 | Review rate, malformed payload rate |
| Classification or routing | Accuracy, precision and recall by label | False-positive cost, false-negative cost |
| RAG or retrieval | Recall@k, Hit@k, citation correctness | Unsupported-claim rate, latency |
| Chat or support | Resolution quality, escalation correctness | Containment rate, repeat-contact rate |
| Tool-use workflows | Task completion rate, tool success rate | Recovery success, average step count |
If you cannot name the task-specific metric, the workflow definition is still too fuzzy.
Build the smallest useful eval set
Do not wait for a benchmark with thousands of examples. Start with:
- 25 to 50 representative cases
- clear expected outputs
- a few known edge cases
- a few recent production failures
That is enough to catch many regressions if the task is well-defined.
The key habit is not size. It is maintenance. Every repeated production failure should become a future eval case.
Offline evals and online signals solve different problems
Offline evals tell you whether a candidate change looks safe enough to ship.
Online signals tell you whether the shipped system is staying healthy under real traffic.
You need both.
Offline:
- compare prompt versions
- compare models
- test routing logic
- check retrieval changes before release
Online:
- monitor drift
- watch review rate
- monitor latency
- monitor cost per successful task
A prompt change that keeps quality flat but doubles latency is still a meaningful product change.
Use release gates, not vague approval
A useful release gate looks like this:
Workflow: support ticket triage
Baseline accuracy: 0.89
Candidate minimum accuracy: 0.89
Guardrail: review rate must not rise by more than 3 points
Guardrail: p95 latency must not rise by more than 15%
Decision: reject candidate if any gate fails
That is much more reliable than "the cheaper model looked okay in our spot check."
Turn production failures into assets
A mature eval set is built from real pain:
- edge cases that confused the model
- customer-facing regressions
- cases reviewers repeatedly fix
- retrieval misses that caused fabricated answers
Each one should become:
- a permanent regression case
- a new bucket in your analysis
- or a new rule if the failure should never rely on a model again
That is how the eval system gets more valuable over time.
The most common mistake
The most common mistake is evaluating only the final answer when the failure actually sits earlier in the pipeline.
Examples:
- the answer is wrong because retrieval missed the right chunk
- the route is wrong because the label set is poorly defined
- the action is unsafe because review thresholds are under-calibrated
If the system has multiple stages, evaluate those stages too.
How StackSpend helps
An eval result is more useful when it sits next to economics. The Data Explorer lets you segment spend by provider and filter to the date range of a model rollout to compare cost-per-task before and after — which means you can see whether a "quality win" also increased total inference spend or shifted cost structure in ways your eval harness did not catch. Pairing that with Monitoring anomaly alerts means you get a signal the moment a new model version starts generating unexpected token volume in production, not just when the monthly invoice arrives.
What to do next
FAQ
Do I need a dedicated eval platform to start?
No. A script, a fixed dataset, and clear metrics are enough to start. Dedicated tools become useful when you have many workflows, many prompt versions, or team-wide review processes.
How big should my eval set be?
Big enough to catch meaningful regressions for one workflow. For many teams that starts around 25 to 50 examples, then grows as new failures appear.
Should I compare models with one universal score?
Usually no. Compare them on the actual job they are doing. A model that is better at long-form writing may still be worse for routing or extraction.
How often should I refresh the eval set?
Whenever production teaches you something new. Repeated failures, edge cases, and review disagreements should become additions to the dataset.
What is the best guardrail besides quality?
Cost and latency are usually the most important operational guardrails. A change that preserves quality but materially worsens economics is not automatically a win.