Structured outputs for extraction, classification, and scoring

Your support team wants to auto-triage inbound tickets into billing, bug, or feature_request. The prototype looks fine in a chat window, but the first production run breaks because the model returns billing issue, payments, and refund-related instead of the three labels your queue expects.

That is the real job for structured outputs. The model is still doing interpretation, but the application owns the contract. If the answer will drive code, routing, or a database write, you should design the schema before you design the prompt.

What structured outputs are actually for

Structured outputs are most useful when the model answer is an intermediate system artifact, not the final user-facing prose.

Good fits:

extracting fields from messy text or documents
classifying one item into a small label set
assigning a bounded score against a rubric
routing work into a queue or workflow state

Bad fits:

open-ended drafting
brainstorming
tasks where the response is only read by a human and never parsed

The practical rule is simple: if a parser, validator, queue, or write path sits immediately after the model call, use a schema-constrained response.

Why free-form prompting fails in production

Free-form prompting usually fails in operationally boring ways:

a required field is omitted
a label drifts outside the allowed enum
ambiguity is buried inside prose instead of surfaced as a flag
safety refusal is mixed into the payload and your code cannot tell what happened

Those failures create more than annoyance. They create retry loops, manual cleanup, queue pollution, and hidden spend.

A concrete example: ticket triage with a real contract

For support triage, the product requirement is not "give me a smart answer." It is "return one valid queue label, explain it briefly, and flag cases that should be reviewed."

Here is what that looks like in TypeScript with the OpenAI SDK and Zod. The example uses gpt-4.1-mini as a concrete low-cost model — substitute the current mini-tier model for your provider; the pattern is model-agnostic.

import OpenAI from "openai";
import { zodTextFormat } from "openai/helpers/zod";
import { z } from "zod";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const TicketDecision = z.object({
  label: z.enum(["billing", "bug", "feature_request", "other"]),
  confidenceBand: z.enum(["high", "medium", "low"]),
  needsReview: z.boolean(),
  reason: z.string().min(1),
  evidenceQuote: z.string().min(1),
});

export async function classifyTicket(message: string) {
  const response = await client.responses.parse({
    model: "gpt-4.1-mini",
    input: [
      {
        role: "system",
        content:
          "Classify the support message. Use only the allowed labels. Mark needsReview=true when the message is ambiguous or contains multiple issues.",
      },
      {
        role: "user",
        content: message,
      },
    ],
    text: {
      format: zodTextFormat(TicketDecision, "ticket_decision"),
    },
  });

  const decision = response.output_parsed;

  // Hard business rules still live in code, not in the model.
  if (decision.label === "billing" && decision.confidenceBand === "low") {
    return { ...decision, needsReview: true };
  }

  return decision;
}

This design is useful because it separates concerns cleanly:

the model handles interpretation
the schema constrains the shape
the application applies deterministic policy after the model responds

Extraction, classification, and scoring are different contracts

The mistake many teams make is using one giant schema for all three jobs. The better approach is to keep each contract narrow.

Task	Best schema shape	What to avoid
Extraction	Typed fields, nullable values, evidence text, review flag	Forcing missing values to be filled
Classification	One enum, brief reason, confidence band, review flag	Large uncontrolled label sets
Scoring	Numeric score, fixed range, rubric version, explanation	Changing scales without versioning

If you need all three, use separate stages. Do not create one mega-schema that tries to be an ETL job, classifier, and analyst at the same time.

Extraction needs nullable fields and evidence

Extraction is where teams most often force the model to invent values. If the source text does not include a contract value or a date, the right output is usually null, not a guess.

An extraction contract should usually include:

required vs nullable fields
an evidence quote for critical fields
a needsReview flag
a clear rule for insufficient evidence

That keeps the workflow honest. It also makes downstream review much faster because the reviewer can see which quote justified the field.

Scoring needs calibration, not just a number

Scoring feels tidy because it produces a number, but this is where teams often over-trust the model. A score without a versioned rubric is not stable enough to use operationally.

If you score leads from 1 to 5, you need:

a fixed scale
a rubric version
examples of what a 1, 3, and 5 mean
a review process for threshold decisions

Otherwise the model becomes a moving target and your downstream automation quietly drifts.

Function calling vs structured outputs

Use structured outputs when the next step is "interpret this input into a shaped answer."

Use tool or function calling when the next step is "take an action."

A strong production pattern is:

classify or extract with a schema
validate the payload in code
call a tool only if the payload is valid and allowed

That is safer than letting the same step both interpret the text and decide whether to execute something sensitive.

A fill-in design spec you can actually use

Before shipping a structured-output workflow, fill in this spec:

Workflow:
Downstream system using the output:
Required fields:
Nullable fields:
Allowed enums:
Definition of insufficient evidence:
When needsReview must be true:
Validation rules enforced in code:
Primary metric:
Guardrail metric:

That document is more useful than a vague prompt note because a product manager, engineer, and reviewer can all inspect the same contract.

The most common failure mode

The most common failure is not "the model hallucinated." It is "the team treated the schema like a safety blanket."

A schema alone does not solve:

bad label design
weak rubric definitions
missing validation rules
incorrect automation thresholds

Structured outputs reduce format risk. They do not remove the need for evaluation and deterministic checks.

How StackSpend helps

Structured-output systems are easier to measure at the workflow level because the boundary is explicit. In the Data Explorer, you can filter by provider and service category to compare cost per completed classification run before and after a schema change, spot retry inflation showing up as increased token volume on the same service, and see whether a "more reliable" structured workflow is actually reducing review volume or just adding more model calls. Setting a Budget alert on your classification service gives you a concrete signal when schema complexity is growing token spend faster than request volume.

What to do next

FAQ

Should I use native structured outputs or Instructor?

Use native structured outputs first when your provider supports them well enough for your workflow. Use a wrapper such as Instructor when you need portability across providers or a stronger typed developer experience across multiple SDKs.

Is JSON mode enough?

Usually no. JSON mode can give you valid JSON while still returning the wrong keys, wrong enum values, or the wrong shape. Schema-constrained outputs are more useful when the response drives code.

Should I include the explanation field in production?

Usually yes, but keep it short. A brief reason is helpful for review queues and debugging. It should not replace deterministic validation or evidence capture.

How many labels are too many for a classification schema?

Once the label set becomes large, fuzzy, or frequently changing, accuracy usually falls and review becomes harder. If the labels are numerous, consider a router followed by a narrower classifier per route.

Can I use structured outputs for user-facing copy generation?

Sometimes, but that is not the highest-leverage use. Structured outputs are most valuable when the payload feeds a system, not when the output is just prose for a human reader.