Production chat systems: memory, handoffs, and escalation

Production chat stops being "just a chatbot" the moment users come back for a second turn.

Now you need to decide what stays in session state, what belongs in durable account context, what should be fetched from retrieval, and when the assistant should stop and hand off to a human. If you do not separate those layers, the system turns into one long prompt with unclear boundaries and rising cost.

Chat is a workflow system

A production chat assistant usually has four jobs:

understand the incoming turn
fetch only the right context
either answer or take a bounded action
escalate cleanly when the system should stop

That is a better mental model than "make the system prompt better."

Memory is not one thing

Teams say "memory" when they often mean three different data sources:

Memory type	What belongs there	What does not
Session memory	Current turn state, short summary, recent clarifications	Entire transcript forever
Durable user or account memory	Preferences, account attributes, known permissions	Ad hoc conversational guesses
Retrieval context	Policies, docs, product knowledge, approved references	Pretending docs are "memory"

Once you name these separately, the architecture gets simpler.

A concrete routing and memory pattern

Here is a compact TypeScript sketch for routing a turn and building the prompt context:

type TurnType =
  | "info_question"
  | "action_request"
  | "account_issue"
  | "unsupported"
  | "escalation_candidate";

type SessionState = {
  sessionSummary: string;
  recentActions: string[];
  unresolvedIssue?: string;
};

export async function handleTurn(input: {
  message: string;
  session: SessionState;
  accountId: string;
}) {
  const turnType = await classifyTurn(input.message);

  const accountFacts = await getAccountFacts(input.accountId);
  const retrievalContext =
    turnType === "info_question" || turnType === "account_issue"
      ? await searchKnowledgeBase(input.message)
      : [];

  if (turnType === "action_request") {
    return routeToActionFlow({
      message: input.message,
      session: input.session,
      accountFacts,
    });
  }

  if (turnType === "unsupported" || turnType === "escalation_candidate") {
    return escalateToHuman({
      reason: turnType,
      sessionSummary: input.session.sessionSummary,
      accountFacts,
      message: input.message,
    });
  }

  return answerQuestion({
    message: input.message,
    sessionSummary: input.session.sessionSummary,
    accountFacts,
    retrievalContext,
  });
}

The main design choice is visible here:

routing happens first
retrieval is conditional, not automatic
action and answer paths are separated
escalation is explicit

That is how you keep the system from turning into one expensive prompt blob.

Do not carry the whole transcript forever

The most common chat anti-pattern is appending every prior message to every new turn.

That creates three problems:

prompt cost keeps rising
irrelevant history pollutes the current task
it becomes unclear which context actually mattered

A short session summary is usually more useful than raw transcript carry-forward. If something matters for future turns, summarize it deliberately.

Action requests should not share the same path as answers

If the user wants to know a policy, retrieval and answer generation make sense.

If the user wants to change billing settings, reset access, or request a refund, that should become a bounded action flow with validation, tool calls, and escalation rules.

Combining both into one prompt makes it harder to debug failures and easier for the assistant to overstep.

A good handoff payload

A handoff should help the human continue the work, not restart it from zero.

At minimum, include:

reason for escalation
user intent
session summary
actions already attempted
relevant retrieved evidence
account facts that matter

Here is a simple structure:

type HandoffPayload = {
  reason: "policy_risk" | "missing_evidence" | "frustrated_user" | "high_value_action";
  customerIntent: string;
  sessionSummary: string;
  attemptedActions: string[];
  evidenceSnippets: string[];
  accountFacts: Record<string, string>;
};

If the reviewer has to re-read a long transcript to understand what happened, the handoff design is weak.

What to measure in production chat

Containment is not enough. A chat assistant can keep users inside the bot while still doing a bad job.

Track:

resolution quality
escalation accuracy
repeat-contact rate
average cost per resolved conversation
review or handoff rate

If containment rises while repeat-contact rate also rises, the system is probably over-answering instead of resolving issues well.

A practical design worksheet

Use this for one chat workflow:

Workflow:
Turn types:
What belongs in session memory:
What belongs in durable account context:
Which knowledge must come from retrieval:
Which requests require tools:
Which requests require escalation:
What the handoff payload must include:
Primary metric:
Guardrail metric:

This forces the team to separate memory, retrieval, action, and escalation before implementation.

The failure mode that causes the most pain

The failure mode is not usually "the model forgot something." It is "the system treated every turn like the same kind of problem."

Support chat becomes much more reliable when it can distinguish:

questions
account issues
actions
unsupported cases
escalation candidates

Routing is boring, but boring is good when you are trying to run a dependable support workflow.

How StackSpend helps

Chat systems hide cost growth in long sessions, unnecessary retrieval on every turn, and escalations that happen too late. In StackSpend, you can compare cost per resolved conversation, see whether one chat flow is driving excess token volume, and spot when a memory or routing change made the assistant more expensive without improving handoff quality.

What to do next

FAQ

What is the difference between session memory and retrieval?

Session memory is conversation state from the live interaction. Retrieval is grounded external knowledge such as policies or documentation fetched on demand.

Should I store the full transcript as memory?

Usually no. A compact session summary plus the last few relevant turns is often more useful and much cheaper.

When should chat escalate to a human?

Escalate when the assistant lacks enough evidence, the requested action is high risk, the user is frustrated, or the workflow has real business or compliance consequences.

Should action requests and informational questions share one prompt?

Usually no. Action requests need validation and bounded execution rules. Informational answers need grounded context and answer generation.

What is the best leading indicator that chat quality is slipping?

Repeat-contact rate is one of the best signals. If users come back because the first interaction did not truly resolve the issue, containment alone can mislead you.