Binary decisions and constrained choice with LLMs

Binary and constrained-choice tasks are where LLMs can look more reliable than they really are.

Because the answer space is small, teams often assume the risk is small too. But a bounded choice still needs a real contract, calibration, and a review path. "Pick one of three actions" is only safe when the allowed outputs, automation rules, and fallback behavior are explicit.

What this pattern is actually good for

Useful cases include:

approve or escalate
pass or fail
route to one of a few queues
choose one approved template

These are good LLM candidates when the input is messy language but the output space is narrow.

They are bad LLM candidates when the decision is already deterministic.

When code or classic ML is the better answer

Decision type	Best default	Why
Exact threshold check	Code	The logic is explicit and auditable
Stable high-volume label prediction	Traditional ML	Usually cheaper and faster at scale
Messy text to small label set	LLM with constrained output	The interpretation is semantic but the result is bounded
High-risk approval with hard policy rules	Code plus review	The model should not be final authority

A concrete constrained-choice contract

Suppose you are triaging whether an extracted vendor record is ready to move forward. A useful response shape is:

type Decision = {
  choice: "approve" | "reject" | "escalate";
  confidenceBand: "high" | "medium" | "low";
  reason: string;
  needsReview: boolean;
};

Here is how that looks in a real flow:

export async function decideRecordReadiness(recordText: string) {
  const decision = await classifyRecord(recordText);

  if (decision.choice === "approve" && decision.confidenceBand !== "high") {
    return {
      ...decision,
      choice: "escalate",
      needsReview: true,
      reason: "Approval requires high-confidence evidence",
    };
  }

  return decision;
}

The important part is not the schema itself. It is that automation rules still live in code.

Confidence is only a routing signal

Treat confidence as a traffic light, not a truth score.

It can help answer:

should this be automated?
should this be reviewed?
should this take a fallback path?

It should not answer:

is this definitely correct?

If you want it to drive routing, calibrate it against real examples and monitor the false-accept cost.

Good and bad use cases

Good:

support triage into a few queues
choosing one of several allowed outreach templates
deciding whether an extracted payload is complete enough for review or escalation

Bad:

approval thresholds already defined in policy
tax, legal, or financial calculations
repeated well-labeled prediction tasks where traditional ML is mature and cheap

The litmus test is simple: is the ambiguity in the language, or in the rule?

If the rule is explicit, use code.

A practical threshold spec

Write this before you automate:

Workflow:
Allowed outputs:
Which outputs can be automated:
Which outputs require review:
What false-positive cost is unacceptable:
What false-negative cost is acceptable:
Primary metric:
Guardrail metric:

If the team cannot answer these questions clearly, it is not ready to automate the decision.

The common failure mode

The most common failure is letting a bounded decision return free-form prose, then pretending that text can be interpreted safely downstream.

The second most common failure is treating confidence as if it were calibrated probability when nobody has tested that assumption.

Both mistakes create the illusion of control without the actual control.

How StackSpend helps

Bounded-decision systems are easy to measure as discrete workflows. In StackSpend, that lets you compare automation rate, review rate, and cost per completed decision across model tiers, then see whether a "cheaper" choice really lowered cost without pushing too many cases into review or rework.

What to do next

FAQ

How many allowed choices is too many?

Once the option set gets large, overlapping, or frequently changing, the task often stops being a clean constrained-choice problem.

Can I automate low-confidence outputs if the task is low risk?

Sometimes, but that should be a deliberate product decision backed by evaluation and acceptable error cost.

Should I always include an `escalate` option?

Usually yes when the workflow has meaningful risk. It gives the system a safe place to send ambiguous or weak-evidence cases.

When is classic ML a better fit?

When the label space is stable, you have good training data, and unit economics or latency matter at scale.

What is the best early warning that the decision design is weak?

Reviewers repeatedly overturn the same automated choice or the system keeps returning medium-confidence cases with no useful routing rule attached.

Binary decisions and constrained choice with LLMs

What this pattern is actually good for

When code or classic ML is the better answer

A concrete constrained-choice contract

Confidence is only a routing signal

Good and bad use cases

A practical threshold spec

The common failure mode

How StackSpend helps

What to do next

FAQ

How many allowed choices is too many?

Can I automate low-confidence outputs if the task is low risk?

Should I always include an `escalate` option?

When is classic ML a better fit?

What is the best early warning that the decision design is weak?

References

Know where your cloud and AI spend stands — every day.

Binary decisions and constrained choice with LLMs

What this pattern is actually good for

When code or classic ML is the better answer

A concrete constrained-choice contract

Confidence is only a routing signal

Good and bad use cases

A practical threshold spec

The common failure mode

How StackSpend helps

What to do next

FAQ

How many allowed choices is too many?

Can I automate low-confidence outputs if the task is low risk?

Should I always include an escalate option?

When is classic ML a better fit?

What is the best early warning that the decision design is weak?

References

Know where your cloud and AI spend stands — every day.

Should I always include an `escalate` option?