Introduction
Retrieval-Augmented Generation (RAG) is the default pattern for grounding LLM outputs in your proprietary data. It is also the pattern teams most often over-apply — reaching for vector search when a simple prompt with the full document would have worked better. This article walks through when RAG is the right answer, when it is not, and how to build one that survives past the first demo.
When RAG genuinely wins
RAG wins when your corpus is large enough that you cannot fit everything relevant into a context window (roughly: more than 20–40 pages of text per query), when your content changes frequently enough that fine-tuning would go stale, when you need citations back to source documents, and when different users should see different subsets of the corpus.
Typical winning use cases: internal knowledge bases, customer support with evolving product docs, policy compliance Q&A, legal and medical reference tools, content personalized to user permissions.
- Corpus too large for context window (>20–40 pages per query)
- Content updates frequently; fine-tune would go stale
- Citations back to source are required
- Per-user or per-tenant permission boundaries on content
When RAG is the wrong choice
If your entire corpus fits in a context window (Gemini 1.5 Pro accepts 2M tokens — roughly 3,000 pages), just include the whole corpus. 'Context stuffing' is often both cheaper and more accurate than RAG for small corpora.
If your problem is structured query ('how many customers upgraded last month'), use SQL, not RAG. RAG over structured data is an antipattern that produces slow, inaccurate answers where a JOIN would have been correct and instant.
If your problem is tone, style, or domain jargon calibration, prefer few-shot prompting or fine-tuning over RAG. RAG retrieves facts; it does not teach the model how to sound.
- Small corpus → just stuff the context; do not retrieve
- Structured query → SQL, not vector search
- Tone/style calibration → few-shot or fine-tune
Chunking strategy — the 70% of RAG quality
RAG quality is mostly chunking quality. The naive 'split every 1,000 characters' approach produces fragments that lose context. Modern chunking respects document structure: section headers, paragraph boundaries, table integrity, code block atomicity.
Practical defaults that hold up: semantic chunking (split at paragraph boundaries with 150–250 token overlap), store parent document context with every chunk (doc title, section path, a short summary), and maintain a document-level index so the model can escalate 'give me more of this document' when a chunk is clearly relevant but insufficient.
- Semantic chunking beats character-based splitting every time
- 150–250 token overlap preserves continuity
- Store parent context (title, section, summary) with every chunk
Retrieval — hybrid beats pure vector
Pure vector similarity search has quality gaps on specific terms, numbers, and acronyms. In production, hybrid retrieval (vector + keyword/BM25, combined with reciprocal rank fusion) consistently outperforms either alone.
Re-ranking with a cross-encoder (Cohere Rerank, Voyage Rerank) after the initial retrieval further improves precision at minimal latency cost. A typical pipeline: retrieve top 50 with hybrid search → re-rank to top 8 with cross-encoder → send to LLM.
- Hybrid retrieval (vector + keyword, reciprocal rank fusion) is the default
- Re-rank with cross-encoder (Cohere Rerank, Voyage) for precision
- Pipeline: hybrid top-50 → rerank top-8 → LLM
Infrastructure — pgvector until you cannot
Start with pgvector on Postgres. Most startup-stage corpora (2–5M embeddings) fit comfortably with acceptable latency. You already run Postgres; you already have backups, access control, and auditability. Fancier vector DBs buy you nothing until you cross roughly 20M vectors or need strict sub-50ms retrieval at scale.
Once you cross those thresholds, graduate to Qdrant (self-hosted, strong filtering) or Pinecone (managed, simple ops). Weaviate is also credible, especially if you want built-in hybrid search. Cost per million vectors at scale typically runs $50–$300/month depending on recall and latency requirements.
- pgvector until ~20M vectors or sub-50ms latency SLA
- Qdrant, Pinecone, Weaviate are credible next steps
- Cost per million vectors at scale: $50–$300/month
Freshness — the hidden operational burden
Production RAG systems live and die on data freshness. Every content source needs an ingestion pipeline: detect changes, reprocess affected chunks, update embeddings, invalidate caches. Teams that build RAG without a freshness pipeline discover their answers are three weeks stale after their first enterprise support ticket.
Incremental updates beat full reindexes. Event-driven pipelines (webhook → queue → embed → upsert) handle most content types cleanly. For sources without change events, a diff-based nightly job is a reasonable fallback.
- Every source needs a change detection + reprocessing pipeline
- Prefer event-driven incremental updates to full reindexes
- Missing the freshness story is the #1 enterprise RAG failure mode
Conclusion
RAG is a powerful pattern when applied to problems it genuinely fits. The craft is in the details — chunking, hybrid retrieval, reranking, infrastructure choice, and freshness — not in the choice to use RAG at all. Build the simple version first, measure retrieval quality with recall@k and precision@k, and invest in the upgrades that your eval data actually justifies.
