Your team has a working RAG assistant for internal documentation. The answers sound polished, but users keep saying some variation of: "it answered confidently, but it missed the actual policy page."
That usually means the problem is retrieval, not generation. A stronger model will often make the wrong answer sound better. The higher-leverage fix is usually to treat retrieval like a funnel: gather candidates from more than one channel, rerank aggressively, and only then send evidence to the generator.
The retrieval mistake most teams make
Many RAG systems start as:
- embed every chunk
- run one vector search
- send the top results to the model
That works for paraphrases, but it breaks on real enterprise queries that depend on exact terms such as:
- product names
- error codes
- internal acronyms
- contract language
- policy dates
Dense retrieval is good at semantic similarity. It is not enough by itself for mixed semantic plus exact-match workloads.
What a strong default pipeline looks like
For most production RAG systems, the best default is:
| Stage | Purpose | Typical failure if omitted |
|---|---|---|
| Query normalization | Clean up slang, noise, or multi-part phrasing | Good content is never searched for directly |
| BM25 or lexical search | Catch exact words, IDs, and rare phrases | Codes and exact entities are missed |
| Dense retrieval | Catch paraphrases and semantic matches | Conceptual matches are missed |
| Reranking | Promote the best evidence and drop noisy chunks | The prompt gets bloated with irrelevant context |
| Generation | Answer from a smaller, better evidence set | The model has to reason over noisy context |
The win is not elegance. The win is that better retrieval usually reduces hallucinations and token waste at the same time.
A concrete implementation sketch
Here is a simple TypeScript example that merges BM25 and dense candidates, then reranks before generation:
type Chunk = {
id: string;
text: string;
source: string;
};
async function retrieveEvidence(query: string): Promise<Chunk[]> {
const [bm25Candidates, denseCandidates] = await Promise.all([
searchBm25(query, { topK: 20 }),
searchDense(query, { topK: 20 }),
]);
const merged = dedupeById([...bm25Candidates, ...denseCandidates]);
const reranked = await rerank({
query,
documents: merged.map((chunk) => chunk.text),
topN: 6,
});
return reranked.map((result) => merged[result.index]);
}
function dedupeById(chunks: Chunk[]): Chunk[] {
return [...new Map(chunks.map((chunk) => [chunk.id, chunk])).values()];
}
This example matters because it mirrors the actual design choice you need to make:
- candidate generation is broad
- reranking is selective
- generation only sees the narrowed set
If you send 20 raw chunks straight to the model, you are paying generation costs to compensate for weak retrieval.
When BM25 helps more than embeddings
Lexical search is usually the missing layer when your corpus includes brittle terms:
- SKU names
- provider IDs
- legal clauses
- internal feature flags
- error messages copied from logs
Those queries often look low-quality to a pure semantic retriever because the exact token is the signal. BM25 is not old-fashioned here. It is the right tool for the job.
When dense retrieval matters more
Dense retrieval is strongest when the wording varies but the meaning is close:
- "how do I revoke access" vs "remove a user from the workspace"
- "billing export" vs "cost usage dataset"
- "wrong invoice amount" vs "charge discrepancy"
If your corpus is heavy on paraphrase and natural language, dense retrieval usually carries more weight.
The point is not to pick one winner. The point is to let both channels bring candidates to the same merge stage.
What Anthropic's contextual retrieval research showed
Anthropic published benchmark results from their Contextual Retrieval work that are worth knowing when making architecture decisions. All figures use top-20 retrieval failure rate (1 minus recall@20) — lower is better. Tests ran across codebases, fiction, ArXiv papers, and science papers using Gemini Text 004 embeddings and a Cohere reranker (as of September 2024).
| Retrieval approach | Top-20 failure rate | Reduction vs baseline |
|---|---|---|
| Embeddings only | 5.7% | — |
| Contextual embeddings alone | 3.7% | −35% |
| Contextual embeddings + contextual BM25 | 2.9% | −49% |
| Contextual embeddings + contextual BM25 + reranking | 1.9% | −67% |
The improvements stack. Doing all three is meaningfully better than doing two.
The practical lesson is not "copy one exact pipeline." It is:
- enrich chunks when raw chunks lose meaning without surrounding context
- combine contextual embeddings with BM25 rather than choosing one
- rerank before generation — it adds the most incremental gain
Reranking is the step teams skip too early
Reranking is where you trade a little extra retrieval work for a much cleaner prompt. In practice, that usually improves three things:
- citation quality
- answer precision
- total generation tokens
If your retriever returns ten okay chunks and only three are truly relevant, reranking is the cheapest place to fix the problem. It is usually cheaper than increasing context windows or switching to a larger generation model.
Query rewriting only helps when the input is the problem
Query rewriting is useful when the user asks:
- vague conversational questions
- multi-part questions
- shorthand that does not match indexed language
It is not a cure for:
- weak chunking
- missing metadata filters
- outdated or incomplete indexes
- poor corpus structure
If a query says "what changed in enterprise billing last quarter," a rewrite can help. If the right document is absent or badly chunked, rewriting will not save you.
What to measure before touching generation prompts
If you only measure final-answer quality, you cannot tell whether the real issue is retrieval, ranking, or generation. Measure the retrieval stack directly.
Track:
- Recall@k
- Hit@k
- MRR or nDCG
- reranker lift compared with raw candidates
- citation correctness in the final answer
The best operational habit is to keep a small test set of queries where you already know the correct supporting chunk or document.
A practical design worksheet
Use this before changing your RAG stack:
Primary query types:
Corpus types:
Exact-match terms that matter:
Metadata filters available:
Lexical candidate count:
Dense candidate count:
Reranked topN:
Final chunks sent to generation:
Primary retrieval metric:
Guardrail metric:
This gives you something testable. "We improved RAG quality" is too vague. "Recall@10 improved from 0.62 to 0.79 while prompt tokens dropped 18%" is a real engineering result.
The trade-off most teams miss
A better retrieval funnel often looks more complex on paper while being cheaper in production.
Why:
- fewer irrelevant chunks reach the model
- fewer users retry because the first answer was weak
- fewer support escalations happen after grounded answers fail
The cheapest RAG system is rarely the one with the fewest components. It is the one that wastes the fewest tokens and the fewest user turns.
How StackSpend helps
Retrieval upgrades change spend shape across embeddings, rerankers, and generation. The Data Explorer lets you filter by provider and service to compare token volume before and after a reranker launch, see which specific workflow is driving embedding cost growth, and check whether a retrieval architecture change lowered cost per successful answer or only added one more billable step. The Monitoring view surfaces spending anomalies — useful for catching unexpected spikes when a reranker adds extra API calls you did not anticipate in load testing.
What to do next
FAQ
Should I always use hybrid retrieval?
No. If your corpus is tiny, your queries are uniform, or exact-match terms barely matter, dense retrieval alone may be enough. Hybrid retrieval is most useful when both semantic similarity and exact language matter.
How many chunks should I send to the generator?
Usually fewer than you think. Start by reranking down to a small set that really supports the answer. Sending more chunks often increases contradiction and token waste.
Do I need a reranker if my vector search looks good?
Maybe not at first. But once your candidate set grows noisy, reranking is often the most efficient way to improve prompt precision without changing the generation model.
When should I rewrite the query?
Rewrite only when the input is vague, conversational, or mismatched to corpus language. Do not rewrite every query automatically.
Is Anthropic's contextual retrieval paper something I should copy exactly?
No. The useful lesson is architectural, not literal: enrich weak chunks, mix lexical and semantic retrieval, and rerank before generation when the candidate set is noisy.