Evaluation playbook for LLM applications

The fastest way to ship a bad LLM change is to evaluate it with taste.

Two people look at six outputs, say one prompt "feels better," and the change goes live. A week later review volume spikes, the wrong cases are being escalated, and nobody can prove which change caused it.

An evaluation system exists to stop that kind of rollout.

Start with the workflow, not the model

The wrong question is:

which model is best?

The right question is:

what output must be correct for this workflow to be safe and useful?

If the workflow is support triage, your eval is not about eloquence. It is about route accuracy, review rate, and downstream resolution quality.

If the workflow is extraction, your eval is not one generic "quality score." It is field accuracy and review burden.

One workflow, one eval harness

A practical eval system needs four things:

a clearly defined task
a representative dataset
task-specific metrics
a release threshold

That sounds obvious, but teams often skip straight from "we changed the prompt" to "let's glance at examples."

A concrete eval harness

Here is a simple TypeScript example for evaluating a support classifier:

type EvalExample = {
  input: string;
  expectedLabel: "billing" | "bug" | "feature_request" | "other";
};

type EvalResult = {
  correct: boolean;
  predicted: string;
  expected: string;
};

export async function runEvalSet(
  examples: EvalExample[],
): Promise<{ accuracy: number; results: EvalResult[] }> {
  const results: EvalResult[] = [];

  for (const example of examples) {
    const prediction = await classifyTicket(example.input);
    results.push({
      correct: prediction.label === example.expectedLabel,
      predicted: prediction.label,
      expected: example.expectedLabel,
    });
  }

  const correctCount = results.filter((result) => result.correct).length;

  return {
    accuracy: correctCount / results.length,
    results,
  };
}

This is not fancy. That is the point. A usable eval harness does not need a platform on day one. It needs a repeatable dataset and metrics that actually describe the job.

Metrics by task type

Different tasks need different metrics. Reusing one generic rubric across every workflow hides problems.

Workflow type	Primary metrics	Typical guardrails
Extraction	Field precision, recall, F1	Review rate, malformed payload rate
Classification or routing	Accuracy, precision and recall by label	False-positive cost, false-negative cost
RAG or retrieval	Recall@k, Hit@k, citation correctness	Unsupported-claim rate, latency
Chat or support	Resolution quality, escalation correctness	Containment rate, repeat-contact rate
Tool-use workflows	Task completion rate, tool success rate	Recovery success, average step count

If you cannot name the task-specific metric, the workflow definition is still too fuzzy.

Build the smallest useful eval set

Do not wait for a benchmark with thousands of examples. Start with:

25 to 50 representative cases
clear expected outputs
a few known edge cases
a few recent production failures

That is enough to catch many regressions if the task is well-defined.

The key habit is not size. It is maintenance. Every repeated production failure should become a future eval case.

Offline evals and online signals solve different problems

Offline evals tell you whether a candidate change looks safe enough to ship.

Online signals tell you whether the shipped system is staying healthy under real traffic.

You need both.

Offline:

compare prompt versions
compare models
test routing logic
check retrieval changes before release

Online:

monitor drift
watch review rate
monitor latency
monitor cost per successful task

A prompt change that keeps quality flat but doubles latency is still a meaningful product change.

Use release gates, not vague approval

A useful release gate looks like this:

Workflow: support ticket triage
Baseline accuracy: 0.89
Candidate minimum accuracy: 0.89
Guardrail: review rate must not rise by more than 3 points
Guardrail: p95 latency must not rise by more than 15%
Decision: reject candidate if any gate fails

That is much more reliable than "the cheaper model looked okay in our spot check."

Turn production failures into assets

A mature eval set is built from real pain:

edge cases that confused the model
customer-facing regressions
cases reviewers repeatedly fix
retrieval misses that caused fabricated answers

Each one should become:

a permanent regression case
a new bucket in your analysis
or a new rule if the failure should never rely on a model again

That is how the eval system gets more valuable over time.

The most common mistake

The most common mistake is evaluating only the final answer when the failure actually sits earlier in the pipeline.

Examples:

the answer is wrong because retrieval missed the right chunk
the route is wrong because the label set is poorly defined
the action is unsafe because review thresholds are under-calibrated

If the system has multiple stages, evaluate those stages too.

How StackSpend helps

An eval result is more useful when it sits next to economics. The Data Explorer lets you segment spend by provider and filter to the date range of a model rollout to compare cost-per-task before and after — which means you can see whether a "quality win" also increased total inference spend or shifted cost structure in ways your eval harness did not catch. Pairing that with Monitoring anomaly alerts means you get a signal the moment a new model version starts generating unexpected token volume in production, not just when the monthly invoice arrives.

What to do next

FAQ

Do I need a dedicated eval platform to start?

No. A script, a fixed dataset, and clear metrics are enough to start. Dedicated tools become useful when you have many workflows, many prompt versions, or team-wide review processes.

How big should my eval set be?

Big enough to catch meaningful regressions for one workflow. For many teams that starts around 25 to 50 examples, then grows as new failures appear.

Should I compare models with one universal score?

Usually no. Compare them on the actual job they are doing. A model that is better at long-form writing may still be worse for routing or extraction.

How often should I refresh the eval set?

Whenever production teaches you something new. Repeated failures, edge cases, and review disagreements should become additions to the dataset.

What is the best guardrail besides quality?

Cost and latency are usually the most important operational guardrails. A change that preserves quality but materially worsens economics is not automatically a win.