Production systemsBuild production LLM applicationsModule 8 of 10
Guides
March 11, 2026
By Andrew Day

Production chat systems: memory, handoffs, and escalation

Production chat is a workflow system, not just a prompt. Design memory, retrieval, containment, and human handoff together.

Share this post

Send it to someone managing cloud or AI spend.

LinkedInX

Production chat stops being "just a chatbot" the moment users come back for a second turn.

Now you need to decide what stays in session state, what belongs in durable account context, what should be fetched from retrieval, and when the assistant should stop and hand off to a human. If you do not separate those layers, the system turns into one long prompt with unclear boundaries and rising cost.

Chat is a workflow system

A production chat assistant usually has four jobs:

  1. understand the incoming turn
  2. fetch only the right context
  3. either answer or take a bounded action
  4. escalate cleanly when the system should stop

That is a better mental model than "make the system prompt better."

Memory is not one thing

Teams say "memory" when they often mean three different data sources:

Memory type What belongs there What does not
Session memory Current turn state, short summary, recent clarifications Entire transcript forever
Durable user or account memory Preferences, account attributes, known permissions Ad hoc conversational guesses
Retrieval context Policies, docs, product knowledge, approved references Pretending docs are "memory"

Once you name these separately, the architecture gets simpler.

A concrete routing and memory pattern

Here is a compact TypeScript sketch for routing a turn and building the prompt context:

type TurnType =
  | "info_question"
  | "action_request"
  | "account_issue"
  | "unsupported"
  | "escalation_candidate";

type SessionState = {
  sessionSummary: string;
  recentActions: string[];
  unresolvedIssue?: string;
};

export async function handleTurn(input: {
  message: string;
  session: SessionState;
  accountId: string;
}) {
  const turnType = await classifyTurn(input.message);

  const accountFacts = await getAccountFacts(input.accountId);
  const retrievalContext =
    turnType === "info_question" || turnType === "account_issue"
      ? await searchKnowledgeBase(input.message)
      : [];

  if (turnType === "action_request") {
    return routeToActionFlow({
      message: input.message,
      session: input.session,
      accountFacts,
    });
  }

  if (turnType === "unsupported" || turnType === "escalation_candidate") {
    return escalateToHuman({
      reason: turnType,
      sessionSummary: input.session.sessionSummary,
      accountFacts,
      message: input.message,
    });
  }

  return answerQuestion({
    message: input.message,
    sessionSummary: input.session.sessionSummary,
    accountFacts,
    retrievalContext,
  });
}

The main design choice is visible here:

  • routing happens first
  • retrieval is conditional, not automatic
  • action and answer paths are separated
  • escalation is explicit

That is how you keep the system from turning into one expensive prompt blob.

Do not carry the whole transcript forever

The most common chat anti-pattern is appending every prior message to every new turn.

That creates three problems:

  • prompt cost keeps rising
  • irrelevant history pollutes the current task
  • it becomes unclear which context actually mattered

A short session summary is usually more useful than raw transcript carry-forward. If something matters for future turns, summarize it deliberately.

Action requests should not share the same path as answers

If the user wants to know a policy, retrieval and answer generation make sense.

If the user wants to change billing settings, reset access, or request a refund, that should become a bounded action flow with validation, tool calls, and escalation rules.

Combining both into one prompt makes it harder to debug failures and easier for the assistant to overstep.

A good handoff payload

A handoff should help the human continue the work, not restart it from zero.

At minimum, include:

  • reason for escalation
  • user intent
  • session summary
  • actions already attempted
  • relevant retrieved evidence
  • account facts that matter

Here is a simple structure:

type HandoffPayload = {
  reason: "policy_risk" | "missing_evidence" | "frustrated_user" | "high_value_action";
  customerIntent: string;
  sessionSummary: string;
  attemptedActions: string[];
  evidenceSnippets: string[];
  accountFacts: Record<string, string>;
};

If the reviewer has to re-read a long transcript to understand what happened, the handoff design is weak.

What to measure in production chat

Containment is not enough. A chat assistant can keep users inside the bot while still doing a bad job.

Track:

  • resolution quality
  • escalation accuracy
  • repeat-contact rate
  • average cost per resolved conversation
  • review or handoff rate

If containment rises while repeat-contact rate also rises, the system is probably over-answering instead of resolving issues well.

A practical design worksheet

Use this for one chat workflow:

Workflow:
Turn types:
What belongs in session memory:
What belongs in durable account context:
Which knowledge must come from retrieval:
Which requests require tools:
Which requests require escalation:
What the handoff payload must include:
Primary metric:
Guardrail metric:

This forces the team to separate memory, retrieval, action, and escalation before implementation.

The failure mode that causes the most pain

The failure mode is not usually "the model forgot something." It is "the system treated every turn like the same kind of problem."

Support chat becomes much more reliable when it can distinguish:

  • questions
  • account issues
  • actions
  • unsupported cases
  • escalation candidates

Routing is boring, but boring is good when you are trying to run a dependable support workflow.

How StackSpend helps

Chat systems hide cost growth in long sessions, unnecessary retrieval on every turn, and escalations that happen too late. In StackSpend, you can compare cost per resolved conversation, see whether one chat flow is driving excess token volume, and spot when a memory or routing change made the assistant more expensive without improving handoff quality.

What to do next

FAQ

What is the difference between session memory and retrieval?

Session memory is conversation state from the live interaction. Retrieval is grounded external knowledge such as policies or documentation fetched on demand.

Should I store the full transcript as memory?

Usually no. A compact session summary plus the last few relevant turns is often more useful and much cheaper.

When should chat escalate to a human?

Escalate when the assistant lacks enough evidence, the requested action is high risk, the user is frustrated, or the workflow has real business or compliance consequences.

Should action requests and informational questions share one prompt?

Usually no. Action requests need validation and bounded execution rules. Informational answers need grounded context and answer generation.

What is the best leading indicator that chat quality is slipping?

Repeat-contact rate is one of the best signals. If users come back because the first interaction did not truly resolve the issue, containment alone can mislead you.

References

Share this post

Send it to someone managing cloud or AI spend.

LinkedInX

Know where your cloud and AI spend stands — every day.

Connect providers in minutes. Get 90 days of visibility and start receiving daily cost updates before the invoice lands.

14-day free trial. No credit card required. Plans from $19/month.