Guides
March 6, 2026
By Andrew Day

Embeddings vs full context cost efficiency

Choose when to retrieve vs stuff more context. Embeddings and retrieval have different cost shapes than long-context prompting—this guide shows which wins for your workload.

Share this post

Send it to someone managing cloud or AI spend.

LinkedInX

Use this when you're deciding whether to retrieve with embeddings or stuff more content into the prompt.

The fast answer: for small knowledge bases (<200K tokens), full-context with prompt caching often wins. For larger corpora or frequently changing data, retrieval with embeddings scales better.

What you will get in 10 minutes

  • Cost shape of embeddings + retrieval
  • Cost shape of long-context prompting
  • Decision rules by corpus size and update cadence
  • A checklist to pick the right approach for your use case

Use this when

  • You're building RAG or document-Q&A and unsure whether to retrieve or prompt with more context
  • Long-context models are available and you're tempted to skip retrieval
  • Retrieval or inference costs are growing and you want to understand the tradeoff
  • You're choosing between embedding providers and wondering if retrieval is worth the infra

The 60-second answer

Your situation Prefer
Corpus under ~200K tokens (~500 pages) Full-context + prompt caching
Corpus over 200K tokens Embeddings + retrieval (RAG)
Same documents reused across many requests Full-context (caching pays off fast)
Documents change weekly or daily Retrieval (re-embed is cheaper than re-cache)
Queries need only a small slice of the corpus Retrieval (targeted context beats full dump)
Budget tight, need to ship in days Full-context first if corpus fits

Anthropic recommends full-context with caching for knowledge bases under 200K tokens before building retrieval. See long-context AI pricing above 200K for provider thresholds.

Cost shape: Embeddings + retrieval

Costs come from embedding, retrieval infra, and inference with retrieved chunks.

Embedding cost

  • One-time per document change (or per ingestion run)
  • Voyage AI: ~$0.02–$0.06 per million tokens (pricing); 200M free tokens/month
  • OpenAI text-embedding-3-small/large: per-token pricing
  • Contextual retrieval (add context before embedding): ~$1.02 per million document tokens one-time with Claude Haiku + caching (Anthropic)

Retrieval infra

  • Vector DB (Pinecone, Weaviate, pgvector, etc.): roughly $350–$2,850/month managed, or engineering cost if self-hosted
  • Reranking (Cohere, Voyage): adds ~$0.02–$0.05 per million tokens

Inference

  • Per request: system prompt + retrieved chunks (typically 2–5K tokens) + user query + output
  • Retrieval keeps input smaller than full-context because you only send the top-K chunks
  • Example: 3K retrieved tokens at Claude Haiku 4.5 ($1/MTok input) ≈ $0.003 per request in context cost

When retrieval wins: Large corpus, selective queries (only a slice needed), documents change often, or you need to scale beyond what fits in context.

Cost shape: Full-context prompting

Costs come from input tokens (and output). Prompt caching changes the math.

Without caching

  • You pay full input price for the entire corpus on every request
  • 200K tokens at Claude Sonnet 4.6 ($3/MTok) ≈ $0.60 per request
  • At scale this dominates; retrieval sends only 2–5K tokens per request

With prompt caching

  • Cache write: 1.25x–2x base input price (one-time per cache window)
  • Cache read: ~10% of base input price (Anthropic pricing)
  • After one or two cache hits, cached content is much cheaper than re-processing
  • Same corpus reused across many requests → caching pays off quickly

When full-context wins: Corpus fits in context (<200K tokens), same content reused often, and you can use a provider with caching (Anthropic, others).

Decision rules

Rule 1: Corpus size

Corpus size Recommendation
< 100K tokens Full-context + caching. Skip retrieval.
100K–200K tokens Try full-context first. If latency or cost is high, move to retrieval.
> 200K tokens Retrieval. Full-context will hit pricing tiers and/or fail.

Rule 2: Cache reuse

  • High reuse (same docs, many similar queries) → full-context + caching
  • Low reuse (each query touches different docs) → retrieval
  • Mixed → consider hybrid: cache a shared prefix, retrieve the rest

Rule 3: Update cadence

  • Documents change rarely → full-context is fine; re-cache on change
  • Documents change weekly/daily → retrieval; re-embed is cheaper than re-cache at scale

Rule 4: Query selectivity

  • Queries need a small slice of the corpus → retrieval wins (targeted 2–5K vs full 200K)
  • Queries often need most of the corpus → full-context can work if it fits

Checklist by use case

Use case Full-context Retrieval
Internal wiki Q&A, <100 docs
Customer support KB, 500+ articles
Codebase search, 10K+ files
Legal/contract search, 1000s of docs
Chat over a few PDFs, same each session
Product docs, updated weekly

How to see which is costing you

Category-level analysis helps. If retrieval is a big cost driver, you'll see spend on embedding APIs and possibly a separate infra line. If inference dominates, check whether long context or retrieved context is the cause. AI cost monitoring surfaces spend by provider, model, and category so you can see whether embeddings, inference, or infra is moving the needle.

What to do next

Share this post

Send it to someone managing cloud or AI spend.

LinkedInX

Know where your cloud and AI spend stands — every day.

Connect providers in minutes. Get 90 days of visibility and start receiving daily cost updates before the invoice lands.

14-day free trial. No credit card required. Plans from $19/month.