LLM-generated features for traditional ML

One of the best uses for LLMs is not on the hot path at all.

If your team keeps asking whether a live request really needs a model call, the answer is often no. The better design is to use the model offline to create features, then let cheaper downstream models, rules, or ranking systems use those features at serving time.

What this pattern actually is

An LLM-generated feature is any model-produced artifact that becomes input to a cheaper system later.

Examples:

a support-conversation intent label used by a routing model
a normalized topic tag used by search ranking
a risk score used by a rules engine
a short account summary used by a churn model

The model is not the product in these cases. It is part of the enrichment pipeline.

Why this often beats direct inference

Direct inference charges you on the live path forever.

Offline feature generation moves that cost into:

nightly batches
ingest-time enrichment
periodic backfills

That matters when the same expensive interpretation would otherwise happen on every request.

A concrete batch-enrichment example

Here is a simple TypeScript sketch that enriches support conversations with controlled features:

type ConversationFeatures = {
  intent: "billing" | "bug" | "feature_request" | "other";
  sentiment: "positive" | "neutral" | "negative";
  escalationRisk: 1 | 2 | 3 | 4 | 5;
  summary: string;
};

export async function enrichConversation(
  conversationId: string,
  transcript: string,
): Promise<void> {
  const features = await generateFeatures(transcript);

  await saveConversationFeatures({
    conversationId,
    features,
    generatedAt: new Date().toISOString(),
    featureVersion: "v1",
  });
}

The live product can then use those stored features for:

queue routing
prioritization
reporting
downstream classification

without calling an LLM for every user interaction.

When this pattern is a strong fit

It works best when:

the feature can be computed in batch
the output will be reused many times
the live path needs predictable latency
a downstream model or rule system can actually use the feature

This is why it often works well for ranking, prioritization, recommendation, and operational triage.

When it is the wrong fit

It is weaker when:

the feature goes stale quickly
each request needs fresh context
the downstream system cannot consume the signal well
the feature is too subjective to validate

The important question is not "can the model generate something interesting?" It is "does the generated artifact improve the downstream system enough to justify its cost and maintenance?"

Keep the feature contract narrow

A good feature contract is controlled and boring:

intent = billing | bug | feature_request | other
risk_score = 1..5
topic_tags = controlled label array
summary = short normalized summary

A bad feature contract is vague free text that no downstream system can reliably use.

If the live system needs a stable input, give it one.

Evaluate the downstream lift, not just the model output

This is where teams often get fooled. The feature "looks good" in a sample review, but nobody checks whether it helps the downstream system.

Measure:

agreement with human labels
downstream model lift
stability over time
batch cost per record
refresh cadence

If the churn model, ranker, or rules engine does not improve, the feature is not creating enough value.

A practical implementation flow

Use this order:

define the feature contract
generate the feature on a sample set
compare it with human labels or business outcomes
backfill in batch
plug it into the cheaper downstream system

The pattern fails when teams do step 5 first because the feature feels promising.

A fill-in opportunity worksheet

Use this before building:

Workflow:
What interpretation is expensive on the live path?
Can that interpretation be precomputed?
Which downstream system will consume it?
How often does it need refresh?
What metric proves the feature helped?
What metric proves the feature became stale?

If the answer to "can it be precomputed?" is yes and the refresh cadence is not every request, this pattern is worth testing.

The economic trade-off that matters

This pattern is not "free" just because it runs offline. Batch jobs can still become expensive if:

you enrich too many records
you refresh too often
you store verbose free-text outputs instead of controlled fields
the downstream system never uses half the generated features

The real win is moving cost to a place where it is smaller, more predictable, and reused more times.

How StackSpend helps

LLM-generated features change spend shape from live inference to enrichment jobs. In StackSpend, you can compare batch-enrichment cost against saved live-path inference, see which feature-generation job is growing fastest, and measure whether the shift actually lowered cost per serving event instead of quietly adding a second layer of AI spend.

What to do next

FAQ

What is the best kind of feature to generate with an LLM?

A feature that is expensive to infer repeatedly, stable enough to reuse, and useful to a cheaper downstream model or rule system.

Should the generated feature be free text or controlled labels?

Prefer controlled labels or bounded fields when possible. They are easier to validate and more useful to downstream systems.

How often should I refresh LLM-generated features?

Only as often as the underlying signal changes. Over-refreshing is one of the easiest ways to erase the economic benefit of this pattern.

Can this pattern replace traditional ML?

Sometimes it complements it better than it replaces it. The common win is using LLMs for enrichment and traditional ML for high-volume serving.

How do I know whether the feature is worth keeping?

If it does not improve the downstream metric you care about, or if the improvement does not justify generation cost and maintenance, it should be simplified or removed.