Your support team wants to auto-triage inbound tickets into billing, bug, or feature_request. The prototype looks fine in a chat window, but the first production run breaks because the model returns billing issue, payments, and refund-related instead of the three labels your queue expects.
That is the real job for structured outputs. The model is still doing interpretation, but the application owns the contract. If the answer will drive code, routing, or a database write, you should design the schema before you design the prompt.
What structured outputs are actually for
Structured outputs are most useful when the model answer is an intermediate system artifact, not the final user-facing prose.
Good fits:
- extracting fields from messy text or documents
- classifying one item into a small label set
- assigning a bounded score against a rubric
- routing work into a queue or workflow state
Bad fits:
- open-ended drafting
- brainstorming
- tasks where the response is only read by a human and never parsed
The practical rule is simple: if a parser, validator, queue, or write path sits immediately after the model call, use a schema-constrained response.
Why free-form prompting fails in production
Free-form prompting usually fails in operationally boring ways:
- a required field is omitted
- a label drifts outside the allowed enum
- ambiguity is buried inside prose instead of surfaced as a flag
- safety refusal is mixed into the payload and your code cannot tell what happened
Those failures create more than annoyance. They create retry loops, manual cleanup, queue pollution, and hidden spend.
A concrete example: ticket triage with a real contract
For support triage, the product requirement is not "give me a smart answer." It is "return one valid queue label, explain it briefly, and flag cases that should be reviewed."
Here is what that looks like in TypeScript with the OpenAI SDK and Zod. The example uses gpt-4.1-mini as a concrete low-cost model — substitute the current mini-tier model for your provider; the pattern is model-agnostic.
import OpenAI from "openai";
import { zodTextFormat } from "openai/helpers/zod";
import { z } from "zod";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const TicketDecision = z.object({
label: z.enum(["billing", "bug", "feature_request", "other"]),
confidenceBand: z.enum(["high", "medium", "low"]),
needsReview: z.boolean(),
reason: z.string().min(1),
evidenceQuote: z.string().min(1),
});
export async function classifyTicket(message: string) {
const response = await client.responses.parse({
model: "gpt-4.1-mini",
input: [
{
role: "system",
content:
"Classify the support message. Use only the allowed labels. Mark needsReview=true when the message is ambiguous or contains multiple issues.",
},
{
role: "user",
content: message,
},
],
text: {
format: zodTextFormat(TicketDecision, "ticket_decision"),
},
});
const decision = response.output_parsed;
// Hard business rules still live in code, not in the model.
if (decision.label === "billing" && decision.confidenceBand === "low") {
return { ...decision, needsReview: true };
}
return decision;
}
This design is useful because it separates concerns cleanly:
- the model handles interpretation
- the schema constrains the shape
- the application applies deterministic policy after the model responds
Extraction, classification, and scoring are different contracts
The mistake many teams make is using one giant schema for all three jobs. The better approach is to keep each contract narrow.
| Task | Best schema shape | What to avoid |
|---|---|---|
| Extraction | Typed fields, nullable values, evidence text, review flag | Forcing missing values to be filled |
| Classification | One enum, brief reason, confidence band, review flag | Large uncontrolled label sets |
| Scoring | Numeric score, fixed range, rubric version, explanation | Changing scales without versioning |
If you need all three, use separate stages. Do not create one mega-schema that tries to be an ETL job, classifier, and analyst at the same time.
Extraction needs nullable fields and evidence
Extraction is where teams most often force the model to invent values. If the source text does not include a contract value or a date, the right output is usually null, not a guess.
An extraction contract should usually include:
- required vs nullable fields
- an evidence quote for critical fields
- a
needsReviewflag - a clear rule for insufficient evidence
That keeps the workflow honest. It also makes downstream review much faster because the reviewer can see which quote justified the field.
Scoring needs calibration, not just a number
Scoring feels tidy because it produces a number, but this is where teams often over-trust the model. A score without a versioned rubric is not stable enough to use operationally.
If you score leads from 1 to 5, you need:
- a fixed scale
- a rubric version
- examples of what a 1, 3, and 5 mean
- a review process for threshold decisions
Otherwise the model becomes a moving target and your downstream automation quietly drifts.
Function calling vs structured outputs
Use structured outputs when the next step is "interpret this input into a shaped answer."
Use tool or function calling when the next step is "take an action."
A strong production pattern is:
- classify or extract with a schema
- validate the payload in code
- call a tool only if the payload is valid and allowed
That is safer than letting the same step both interpret the text and decide whether to execute something sensitive.
A fill-in design spec you can actually use
Before shipping a structured-output workflow, fill in this spec:
Workflow:
Downstream system using the output:
Required fields:
Nullable fields:
Allowed enums:
Definition of insufficient evidence:
When needsReview must be true:
Validation rules enforced in code:
Primary metric:
Guardrail metric:
That document is more useful than a vague prompt note because a product manager, engineer, and reviewer can all inspect the same contract.
The most common failure mode
The most common failure is not "the model hallucinated." It is "the team treated the schema like a safety blanket."
A schema alone does not solve:
- bad label design
- weak rubric definitions
- missing validation rules
- incorrect automation thresholds
Structured outputs reduce format risk. They do not remove the need for evaluation and deterministic checks.
How StackSpend helps
Structured-output systems are easier to measure at the workflow level because the boundary is explicit. In the Data Explorer, you can filter by provider and service category to compare cost per completed classification run before and after a schema change, spot retry inflation showing up as increased token volume on the same service, and see whether a "more reliable" structured workflow is actually reducing review volume or just adding more model calls. Setting a Budget alert on your classification service gives you a concrete signal when schema complexity is growing token spend faster than request volume.
What to do next
FAQ
Should I use native structured outputs or Instructor?
Use native structured outputs first when your provider supports them well enough for your workflow. Use a wrapper such as Instructor when you need portability across providers or a stronger typed developer experience across multiple SDKs.
Is JSON mode enough?
Usually no. JSON mode can give you valid JSON while still returning the wrong keys, wrong enum values, or the wrong shape. Schema-constrained outputs are more useful when the response drives code.
Should I include the explanation field in production?
Usually yes, but keep it short. A brief reason is helpful for review queues and debugging. It should not replace deterministic validation or evidence capture.
How many labels are too many for a classification schema?
Once the label set becomes large, fuzzy, or frequently changing, accuracy usually falls and review becomes harder. If the labels are numerous, consider a router followed by a narrower classifier per route.
Can I use structured outputs for user-facing copy generation?
Sometimes, but that is not the highest-leverage use. Structured outputs are most valuable when the payload feeds a system, not when the output is just prose for a human reader.