Hybrid search and reranking patterns for RAG

Your team has a working RAG assistant for internal documentation. The answers sound polished, but users keep saying some variation of: "it answered confidently, but it missed the actual policy page."

That usually means the problem is retrieval, not generation. A stronger model will often make the wrong answer sound better. The higher-leverage fix is usually to treat retrieval like a funnel: gather candidates from more than one channel, rerank aggressively, and only then send evidence to the generator.

The retrieval mistake most teams make

Many RAG systems start as:

embed every chunk
run one vector search
send the top results to the model

That works for paraphrases, but it breaks on real enterprise queries that depend on exact terms such as:

product names
error codes
internal acronyms
contract language
policy dates

Dense retrieval is good at semantic similarity. It is not enough by itself for mixed semantic plus exact-match workloads.

What a strong default pipeline looks like

For most production RAG systems, the best default is:

Stage	Purpose	Typical failure if omitted
Query normalization	Clean up slang, noise, or multi-part phrasing	Good content is never searched for directly
BM25 or lexical search	Catch exact words, IDs, and rare phrases	Codes and exact entities are missed
Dense retrieval	Catch paraphrases and semantic matches	Conceptual matches are missed
Reranking	Promote the best evidence and drop noisy chunks	The prompt gets bloated with irrelevant context
Generation	Answer from a smaller, better evidence set	The model has to reason over noisy context

The win is not elegance. The win is that better retrieval usually reduces hallucinations and token waste at the same time.

A concrete implementation sketch

Here is a simple TypeScript example that merges BM25 and dense candidates, then reranks before generation:

type Chunk = {
  id: string;
  text: string;
  source: string;
};

async function retrieveEvidence(query: string): Promise<Chunk[]> {
  const [bm25Candidates, denseCandidates] = await Promise.all([
    searchBm25(query, { topK: 20 }),
    searchDense(query, { topK: 20 }),
  ]);

  const merged = dedupeById([...bm25Candidates, ...denseCandidates]);

  const reranked = await rerank({
    query,
    documents: merged.map((chunk) => chunk.text),
    topN: 6,
  });

  return reranked.map((result) => merged[result.index]);
}

function dedupeById(chunks: Chunk[]): Chunk[] {
  return [...new Map(chunks.map((chunk) => [chunk.id, chunk])).values()];
}

This example matters because it mirrors the actual design choice you need to make:

candidate generation is broad
reranking is selective
generation only sees the narrowed set

If you send 20 raw chunks straight to the model, you are paying generation costs to compensate for weak retrieval.

When BM25 helps more than embeddings

Lexical search is usually the missing layer when your corpus includes brittle terms:

SKU names
provider IDs
legal clauses
internal feature flags
error messages copied from logs

Those queries often look low-quality to a pure semantic retriever because the exact token is the signal. BM25 is not old-fashioned here. It is the right tool for the job.

When dense retrieval matters more

Dense retrieval is strongest when the wording varies but the meaning is close:

"how do I revoke access" vs "remove a user from the workspace"
"billing export" vs "cost usage dataset"
"wrong invoice amount" vs "charge discrepancy"

If your corpus is heavy on paraphrase and natural language, dense retrieval usually carries more weight.

The point is not to pick one winner. The point is to let both channels bring candidates to the same merge stage.

What Anthropic's contextual retrieval research showed

Anthropic published benchmark results from their Contextual Retrieval work that are worth knowing when making architecture decisions. All figures use top-20 retrieval failure rate (1 minus recall@20) — lower is better. Tests ran across codebases, fiction, ArXiv papers, and science papers using Gemini Text 004 embeddings and a Cohere reranker (as of September 2024).

Retrieval approach	Top-20 failure rate	Reduction vs baseline
Embeddings only	5.7%	—
Contextual embeddings alone	3.7%	−35%
Contextual embeddings + contextual BM25	2.9%	−49%
Contextual embeddings + contextual BM25 + reranking	1.9%	−67%

The improvements stack. Doing all three is meaningfully better than doing two.

The practical lesson is not "copy one exact pipeline." It is:

enrich chunks when raw chunks lose meaning without surrounding context
combine contextual embeddings with BM25 rather than choosing one
rerank before generation — it adds the most incremental gain

Reranking is the step teams skip too early

Reranking is where you trade a little extra retrieval work for a much cleaner prompt. In practice, that usually improves three things:

citation quality
answer precision
total generation tokens

If your retriever returns ten okay chunks and only three are truly relevant, reranking is the cheapest place to fix the problem. It is usually cheaper than increasing context windows or switching to a larger generation model.

Query rewriting only helps when the input is the problem

Query rewriting is useful when the user asks:

vague conversational questions
multi-part questions
shorthand that does not match indexed language

It is not a cure for:

weak chunking
missing metadata filters
outdated or incomplete indexes
poor corpus structure

If a query says "what changed in enterprise billing last quarter," a rewrite can help. If the right document is absent or badly chunked, rewriting will not save you.

What to measure before touching generation prompts

If you only measure final-answer quality, you cannot tell whether the real issue is retrieval, ranking, or generation. Measure the retrieval stack directly.

Track:

Recall@k
Hit@k
MRR or nDCG
reranker lift compared with raw candidates
citation correctness in the final answer

The best operational habit is to keep a small test set of queries where you already know the correct supporting chunk or document.

A practical design worksheet

Use this before changing your RAG stack:

Primary query types:
Corpus types:
Exact-match terms that matter:
Metadata filters available:
Lexical candidate count:
Dense candidate count:
Reranked topN:
Final chunks sent to generation:
Primary retrieval metric:
Guardrail metric:

This gives you something testable. "We improved RAG quality" is too vague. "Recall@10 improved from 0.62 to 0.79 while prompt tokens dropped 18%" is a real engineering result.

The trade-off most teams miss

A better retrieval funnel often looks more complex on paper while being cheaper in production.

Why:

fewer irrelevant chunks reach the model
fewer users retry because the first answer was weak
fewer support escalations happen after grounded answers fail

The cheapest RAG system is rarely the one with the fewest components. It is the one that wastes the fewest tokens and the fewest user turns.

How StackSpend helps

Retrieval upgrades change spend shape across embeddings, rerankers, and generation. The Data Explorer lets you filter by provider and service to compare token volume before and after a reranker launch, see which specific workflow is driving embedding cost growth, and check whether a retrieval architecture change lowered cost per successful answer or only added one more billable step. The Monitoring view surfaces spending anomalies — useful for catching unexpected spikes when a reranker adds extra API calls you did not anticipate in load testing.

What to do next

FAQ

Should I always use hybrid retrieval?

No. If your corpus is tiny, your queries are uniform, or exact-match terms barely matter, dense retrieval alone may be enough. Hybrid retrieval is most useful when both semantic similarity and exact language matter.

How many chunks should I send to the generator?

Usually fewer than you think. Start by reranking down to a small set that really supports the answer. Sending more chunks often increases contradiction and token waste.

Do I need a reranker if my vector search looks good?

Maybe not at first. But once your candidate set grows noisy, reranking is often the most efficient way to improve prompt precision without changing the generation model.

When should I rewrite the query?

Rewrite only when the input is vague, conversational, or mismatched to corpus language. Do not rewrite every query automatically.

Is Anthropic's contextual retrieval paper something I should copy exactly?

No. The useful lesson is architectural, not literal: enrich weak chunks, mix lexical and semantic retrieval, and rerank before generation when the candidate set is noisy.