Production chat stops being "just a chatbot" the moment users come back for a second turn.
Now you need to decide what stays in session state, what belongs in durable account context, what should be fetched from retrieval, and when the assistant should stop and hand off to a human. If you do not separate those layers, the system turns into one long prompt with unclear boundaries and rising cost.
Chat is a workflow system
A production chat assistant usually has four jobs:
- understand the incoming turn
- fetch only the right context
- either answer or take a bounded action
- escalate cleanly when the system should stop
That is a better mental model than "make the system prompt better."
Memory is not one thing
Teams say "memory" when they often mean three different data sources:
| Memory type | What belongs there | What does not |
|---|---|---|
| Session memory | Current turn state, short summary, recent clarifications | Entire transcript forever |
| Durable user or account memory | Preferences, account attributes, known permissions | Ad hoc conversational guesses |
| Retrieval context | Policies, docs, product knowledge, approved references | Pretending docs are "memory" |
Once you name these separately, the architecture gets simpler.
A concrete routing and memory pattern
Here is a compact TypeScript sketch for routing a turn and building the prompt context:
type TurnType =
| "info_question"
| "action_request"
| "account_issue"
| "unsupported"
| "escalation_candidate";
type SessionState = {
sessionSummary: string;
recentActions: string[];
unresolvedIssue?: string;
};
export async function handleTurn(input: {
message: string;
session: SessionState;
accountId: string;
}) {
const turnType = await classifyTurn(input.message);
const accountFacts = await getAccountFacts(input.accountId);
const retrievalContext =
turnType === "info_question" || turnType === "account_issue"
? await searchKnowledgeBase(input.message)
: [];
if (turnType === "action_request") {
return routeToActionFlow({
message: input.message,
session: input.session,
accountFacts,
});
}
if (turnType === "unsupported" || turnType === "escalation_candidate") {
return escalateToHuman({
reason: turnType,
sessionSummary: input.session.sessionSummary,
accountFacts,
message: input.message,
});
}
return answerQuestion({
message: input.message,
sessionSummary: input.session.sessionSummary,
accountFacts,
retrievalContext,
});
}
The main design choice is visible here:
- routing happens first
- retrieval is conditional, not automatic
- action and answer paths are separated
- escalation is explicit
That is how you keep the system from turning into one expensive prompt blob.
Do not carry the whole transcript forever
The most common chat anti-pattern is appending every prior message to every new turn.
That creates three problems:
- prompt cost keeps rising
- irrelevant history pollutes the current task
- it becomes unclear which context actually mattered
A short session summary is usually more useful than raw transcript carry-forward. If something matters for future turns, summarize it deliberately.
Action requests should not share the same path as answers
If the user wants to know a policy, retrieval and answer generation make sense.
If the user wants to change billing settings, reset access, or request a refund, that should become a bounded action flow with validation, tool calls, and escalation rules.
Combining both into one prompt makes it harder to debug failures and easier for the assistant to overstep.
A good handoff payload
A handoff should help the human continue the work, not restart it from zero.
At minimum, include:
- reason for escalation
- user intent
- session summary
- actions already attempted
- relevant retrieved evidence
- account facts that matter
Here is a simple structure:
type HandoffPayload = {
reason: "policy_risk" | "missing_evidence" | "frustrated_user" | "high_value_action";
customerIntent: string;
sessionSummary: string;
attemptedActions: string[];
evidenceSnippets: string[];
accountFacts: Record<string, string>;
};
If the reviewer has to re-read a long transcript to understand what happened, the handoff design is weak.
What to measure in production chat
Containment is not enough. A chat assistant can keep users inside the bot while still doing a bad job.
Track:
- resolution quality
- escalation accuracy
- repeat-contact rate
- average cost per resolved conversation
- review or handoff rate
If containment rises while repeat-contact rate also rises, the system is probably over-answering instead of resolving issues well.
A practical design worksheet
Use this for one chat workflow:
Workflow:
Turn types:
What belongs in session memory:
What belongs in durable account context:
Which knowledge must come from retrieval:
Which requests require tools:
Which requests require escalation:
What the handoff payload must include:
Primary metric:
Guardrail metric:
This forces the team to separate memory, retrieval, action, and escalation before implementation.
The failure mode that causes the most pain
The failure mode is not usually "the model forgot something." It is "the system treated every turn like the same kind of problem."
Support chat becomes much more reliable when it can distinguish:
- questions
- account issues
- actions
- unsupported cases
- escalation candidates
Routing is boring, but boring is good when you are trying to run a dependable support workflow.
How StackSpend helps
Chat systems hide cost growth in long sessions, unnecessary retrieval on every turn, and escalations that happen too late. In StackSpend, you can compare cost per resolved conversation, see whether one chat flow is driving excess token volume, and spot when a memory or routing change made the assistant more expensive without improving handoff quality.
What to do next
FAQ
What is the difference between session memory and retrieval?
Session memory is conversation state from the live interaction. Retrieval is grounded external knowledge such as policies or documentation fetched on demand.
Should I store the full transcript as memory?
Usually no. A compact session summary plus the last few relevant turns is often more useful and much cheaper.
When should chat escalate to a human?
Escalate when the assistant lacks enough evidence, the requested action is high risk, the user is frustrated, or the workflow has real business or compliance consequences.
Should action requests and informational questions share one prompt?
Usually no. Action requests need validation and bounded execution rules. Informational answers need grounded context and answer generation.
What is the best leading indicator that chat quality is slipping?
Repeat-contact rate is one of the best signals. If users come back because the first interaction did not truly resolve the issue, containment alone can mislead you.