RAG in 2026: what actually changed

Apr 4, 2026

RAG in 2026: what actually changed

For two years the loudest prediction about retrieval-augmented generation was that long-context models would make it obsolete. Drop the whole corpus into a 2M-token window and skip the embedding pipeline. The reality in production is more interesting and a lot less tidy.

What long context actually solved

Long context did kill one specific RAG use case: the small, stable knowledge base. If your entire reference material is under ~400k tokens and changes monthly, you can stop pretending you need a vector store. Paste it into the system prompt, set a long cache TTL, and move on. Costs are fine because the prompt cache hits on every call.

What it didn't solve:

  • Hot, high-churn corpora. Documentation that changes hourly invalidates the prompt cache. You're now paying full input cost every request.
  • Multi-tenant isolation. You cannot put customer A's documents in customer B's context window. Retrieval enforces the boundary cleanly.
  • Citations. A model fed a 1.5M-token soup will hallucinate citations faster than you can audit them. Retrieval gives you the source IDs back.
  • Cost at scale. Even with 90% cache discounts, sending the full corpus on every request is roughly 5–8× the cost of retrieving 10 chunks.

The current pattern

The shape that works in 2026 is hybrid retrieval, sparse rerank, then generation against a small context. Concretely:

  1. Embed once at ingest with a strong open-weights model (BGE-M3, Voyage-3, or nomic-embed-v2 if you need on-prem). Store dense vectors alongside a BM25 index of the same chunks.
  2. Query with both. Dense retrieval catches semantic matches; BM25 catches exact-phrase and acronym matches. Merge with a reciprocal-rank fusion (k=60 is the boring, correct default).
  3. Rerank the top 40 → 8 with a cross-encoder. Cohere Rerank v3 and the open bge-reranker-v2-m3 both work; the cross-encoder sees the query and chunk together, so it catches things both retrievers missed.
  4. Generate against the 8 chunks, with explicit citation tags so the model is forced to say which chunk it used.

Chunking is still the largest lever

Every team I've audited in the last year had chunking wrong before they had anything else wrong. The pattern that holds up:

Document typeChunk strategyWhy
Long-form prose (docs, wikis)800–1200 tokens, 100 token overlap, split on h2/h3 boundariesPreserves enough context for the embedder to capture topic
CodeFunction-scoped, never split mid-functionEmbeddings degrade fast on partial syntax
TablesOne chunk per row + one chunk per table headerLets the model find both "the row" and "what the columns mean"
TranscriptsSpeaker-turn boundaries, capped at 600 tokensMid-sentence splits destroy retrieval

Naive fixed-size splitting at 512 tokens is the failure mode that explains 80% of "our RAG is bad" complaints.

Where embeddings still beat long context

Three places, durable:

  • Permission-scoped search, where each user sees a different document set. You'd have to re-prompt-cache per user, which destroys the cache economics.
  • Time-decaying corpora — news, ticket queues, monitoring alerts — where you want the model to weight recent over old. Vector stores let you bias by recency cheaply.
  • Anything you need to audit. Regulated industries don't accept "the model read everything and decided." They want the retrieved chunk IDs in a log line.

The honest takeaway

If you're building a chatbot over a 50k-token product manual, skip RAG and use long context. If you're building anything with multi-tenant data, churn, citations, or scale, retrieval still wins — but the retrieval pipeline is no longer "embed and pray." Hybrid retrieval and a cross-encoder rerank are the table stakes, and chunking is the lever that decides whether any of it works.