Embeddings vs full context cost efficiency

Use this when you're deciding whether to retrieve with embeddings or stuff more content into the prompt.

The fast answer: for small knowledge bases (<200K tokens), full-context with prompt caching often wins. For larger corpora or frequently changing data, retrieval with embeddings scales better.

What you will get in 10 minutes

Cost shape of embeddings + retrieval
Cost shape of long-context prompting
Decision rules by corpus size and update cadence
A checklist to pick the right approach for your use case

Use this when

You're building RAG or document-Q&A and unsure whether to retrieve or prompt with more context
Long-context models are available and you're tempted to skip retrieval
Retrieval or inference costs are growing and you want to understand the tradeoff
You're choosing between embedding providers and wondering if retrieval is worth the infra

The 60-second answer

Your situation	Prefer
Corpus under ~200K tokens (~500 pages)	Full-context + prompt caching
Corpus over 200K tokens	Embeddings + retrieval (RAG)
Same documents reused across many requests	Full-context (caching pays off fast)
Documents change weekly or daily	Retrieval (re-embed is cheaper than re-cache)
Queries need only a small slice of the corpus	Retrieval (targeted context beats full dump)
Budget tight, need to ship in days	Full-context first if corpus fits

Anthropic recommends full-context with caching for knowledge bases under 200K tokens before building retrieval. See long-context AI pricing above 200K for provider thresholds.

Cost shape: Embeddings + retrieval

Costs come from embedding, retrieval infra, and inference with retrieved chunks.

Embedding cost

One-time per document change (or per ingestion run)
Voyage AI: ~$0.02–$0.06 per million tokens (pricing); 200M free tokens/month
OpenAI text-embedding-3-small/large: per-token pricing
Contextual retrieval (add context before embedding): ~$1.02 per million document tokens one-time with Claude Haiku + caching (Anthropic)

Retrieval infra

Vector DB (Pinecone, Weaviate, pgvector, etc.): roughly $350–$2,850/month managed, or engineering cost if self-hosted
Reranking (Cohere, Voyage): adds ~$0.02–$0.05 per million tokens

Inference

Per request: system prompt + retrieved chunks (typically 2–5K tokens) + user query + output
Retrieval keeps input smaller than full-context because you only send the top-K chunks
Example: 3K retrieved tokens at Claude Haiku 4.5 ($1/MTok input) ≈ $0.003 per request in context cost

When retrieval wins: Large corpus, selective queries (only a slice needed), documents change often, or you need to scale beyond what fits in context.

Cost shape: Full-context prompting

Costs come from input tokens (and output). Prompt caching changes the math.

Without caching

You pay full input price for the entire corpus on every request
200K tokens at Claude Sonnet 4.6 ($3/MTok) ≈ $0.60 per request
At scale this dominates; retrieval sends only 2–5K tokens per request

With prompt caching

Cache write: 1.25x–2x base input price (one-time per cache window)
Cache read: ~10% of base input price (Anthropic pricing)
After one or two cache hits, cached content is much cheaper than re-processing
Same corpus reused across many requests → caching pays off quickly

When full-context wins: Corpus fits in context (<200K tokens), same content reused often, and you can use a provider with caching (Anthropic, others).

Decision rules

Rule 1: Corpus size

Corpus size	Recommendation
< 100K tokens	Full-context + caching. Skip retrieval.
100K–200K tokens	Try full-context first. If latency or cost is high, move to retrieval.
> 200K tokens	Retrieval. Full-context will hit pricing tiers and/or fail.

Rule 2: Cache reuse

High reuse (same docs, many similar queries) → full-context + caching
Low reuse (each query touches different docs) → retrieval
Mixed → consider hybrid: cache a shared prefix, retrieve the rest

Rule 3: Update cadence

Documents change rarely → full-context is fine; re-cache on change
Documents change weekly/daily → retrieval; re-embed is cheaper than re-cache at scale

Rule 4: Query selectivity

Queries need a small slice of the corpus → retrieval wins (targeted 2–5K vs full 200K)
Queries often need most of the corpus → full-context can work if it fits

Checklist by use case

Use case	Full-context	Retrieval
Internal wiki Q&A, <100 docs	✓
Customer support KB, 500+ articles		✓
Codebase search, 10K+ files		✓
Legal/contract search, 1000s of docs		✓
Chat over a few PDFs, same each session	✓
Product docs, updated weekly		✓

How to see which is costing you

Category-level analysis helps. If retrieval is a big cost driver, you'll see spend on embedding APIs and possibly a separate infra line. If inference dominates, check whether long context or retrieved context is the cause. AI cost monitoring surfaces spend by provider, model, and category so you can see whether embeddings, inference, or infra is moving the needle.