Binary and constrained-choice tasks are where LLMs can look more reliable than they really are.
Because the answer space is small, teams often assume the risk is small too. But a bounded choice still needs a real contract, calibration, and a review path. "Pick one of three actions" is only safe when the allowed outputs, automation rules, and fallback behavior are explicit.
What this pattern is actually good for
Useful cases include:
- approve or escalate
- pass or fail
- route to one of a few queues
- choose one approved template
These are good LLM candidates when the input is messy language but the output space is narrow.
They are bad LLM candidates when the decision is already deterministic.
When code or classic ML is the better answer
| Decision type | Best default | Why |
|---|---|---|
| Exact threshold check | Code | The logic is explicit and auditable |
| Stable high-volume label prediction | Traditional ML | Usually cheaper and faster at scale |
| Messy text to small label set | LLM with constrained output | The interpretation is semantic but the result is bounded |
| High-risk approval with hard policy rules | Code plus review | The model should not be final authority |
A concrete constrained-choice contract
Suppose you are triaging whether an extracted vendor record is ready to move forward. A useful response shape is:
type Decision = {
choice: "approve" | "reject" | "escalate";
confidenceBand: "high" | "medium" | "low";
reason: string;
needsReview: boolean;
};
Here is how that looks in a real flow:
export async function decideRecordReadiness(recordText: string) {
const decision = await classifyRecord(recordText);
if (decision.choice === "approve" && decision.confidenceBand !== "high") {
return {
...decision,
choice: "escalate",
needsReview: true,
reason: "Approval requires high-confidence evidence",
};
}
return decision;
}
The important part is not the schema itself. It is that automation rules still live in code.
Confidence is only a routing signal
Treat confidence as a traffic light, not a truth score.
It can help answer:
- should this be automated?
- should this be reviewed?
- should this take a fallback path?
It should not answer:
- is this definitely correct?
If you want it to drive routing, calibrate it against real examples and monitor the false-accept cost.
Good and bad use cases
Good:
- support triage into a few queues
- choosing one of several allowed outreach templates
- deciding whether an extracted payload is complete enough for review or escalation
Bad:
- approval thresholds already defined in policy
- tax, legal, or financial calculations
- repeated well-labeled prediction tasks where traditional ML is mature and cheap
The litmus test is simple: is the ambiguity in the language, or in the rule?
If the rule is explicit, use code.
A practical threshold spec
Write this before you automate:
Workflow:
Allowed outputs:
Which outputs can be automated:
Which outputs require review:
What false-positive cost is unacceptable:
What false-negative cost is acceptable:
Primary metric:
Guardrail metric:
If the team cannot answer these questions clearly, it is not ready to automate the decision.
The common failure mode
The most common failure is letting a bounded decision return free-form prose, then pretending that text can be interpreted safely downstream.
The second most common failure is treating confidence as if it were calibrated probability when nobody has tested that assumption.
Both mistakes create the illusion of control without the actual control.
How StackSpend helps
Bounded-decision systems are easy to measure as discrete workflows. In StackSpend, that lets you compare automation rate, review rate, and cost per completed decision across model tiers, then see whether a "cheaper" choice really lowered cost without pushing too many cases into review or rework.
What to do next
FAQ
How many allowed choices is too many?
Once the option set gets large, overlapping, or frequently changing, the task often stops being a clean constrained-choice problem.
Can I automate low-confidence outputs if the task is low risk?
Sometimes, but that should be a deliberate product decision backed by evaluation and acceptable error cost.
Should I always include an escalate option?
Usually yes when the workflow has meaningful risk. It gives the system a safe place to send ambiguous or weak-evidence cases.
When is classic ML a better fit?
When the label space is stable, you have good training data, and unit economics or latency matter at scale.
What is the best early warning that the decision design is weak?
Reviewers repeatedly overturn the same automated choice or the system keeps returning medium-confidence cases with no useful routing rule attached.