Guide · Cluster
RAGArchitecture:WhentoUseIt,WhentoSkipIt

A practical decision guide for Retrieval-Augmented Generation — when it beats fine-tuning, when a simple context window wins, and how to architect a RAG system that scales past the demo.

Updated April 12, 202611 min read

Introduction

Retrieval-Augmented Generation (RAG) is the default pattern for grounding LLM outputs in your proprietary data. It is also the pattern teams most often over-apply — reaching for vector search when a simple prompt with the full document would have worked better. This article walks through when RAG is the right answer, when it is not, and how to build one that survives past the first demo.

When RAG genuinely wins

RAG wins when your corpus is large enough that you cannot fit everything relevant into a context window (roughly: more than 20–40 pages of text per query), when your content changes frequently enough that fine-tuning would go stale, when you need citations back to source documents, and when different users should see different subsets of the corpus.

Typical winning use cases: internal knowledge bases, customer support with evolving product docs, policy compliance Q&A, legal and medical reference tools, content personalized to user permissions.

  • Corpus too large for context window (>20–40 pages per query)
  • Content updates frequently; fine-tune would go stale
  • Citations back to source are required
  • Per-user or per-tenant permission boundaries on content

When RAG is the wrong choice

If your entire corpus fits in a context window (Gemini 1.5 Pro accepts 2M tokens — roughly 3,000 pages), just include the whole corpus. 'Context stuffing' is often both cheaper and more accurate than RAG for small corpora.

If your problem is structured query ('how many customers upgraded last month'), use SQL, not RAG. RAG over structured data is an antipattern that produces slow, inaccurate answers where a JOIN would have been correct and instant.

If your problem is tone, style, or domain jargon calibration, prefer few-shot prompting or fine-tuning over RAG. RAG retrieves facts; it does not teach the model how to sound.

  • Small corpus → just stuff the context; do not retrieve
  • Structured query → SQL, not vector search
  • Tone/style calibration → few-shot or fine-tune

Chunking strategy — the 70% of RAG quality

RAG quality is mostly chunking quality. The naive 'split every 1,000 characters' approach produces fragments that lose context. Modern chunking respects document structure: section headers, paragraph boundaries, table integrity, code block atomicity.

Practical defaults that hold up: semantic chunking (split at paragraph boundaries with 150–250 token overlap), store parent document context with every chunk (doc title, section path, a short summary), and maintain a document-level index so the model can escalate 'give me more of this document' when a chunk is clearly relevant but insufficient.

  • Semantic chunking beats character-based splitting every time
  • 150–250 token overlap preserves continuity
  • Store parent context (title, section, summary) with every chunk

Retrieval — hybrid beats pure vector

Pure vector similarity search has quality gaps on specific terms, numbers, and acronyms. In production, hybrid retrieval (vector + keyword/BM25, combined with reciprocal rank fusion) consistently outperforms either alone.

Re-ranking with a cross-encoder (Cohere Rerank, Voyage Rerank) after the initial retrieval further improves precision at minimal latency cost. A typical pipeline: retrieve top 50 with hybrid search → re-rank to top 8 with cross-encoder → send to LLM.

  • Hybrid retrieval (vector + keyword, reciprocal rank fusion) is the default
  • Re-rank with cross-encoder (Cohere Rerank, Voyage) for precision
  • Pipeline: hybrid top-50 → rerank top-8 → LLM

Infrastructure — pgvector until you cannot

Start with pgvector on Postgres. Most startup-stage corpora (2–5M embeddings) fit comfortably with acceptable latency. You already run Postgres; you already have backups, access control, and auditability. Fancier vector DBs buy you nothing until you cross roughly 20M vectors or need strict sub-50ms retrieval at scale.

Once you cross those thresholds, graduate to Qdrant (self-hosted, strong filtering) or Pinecone (managed, simple ops). Weaviate is also credible, especially if you want built-in hybrid search. Cost per million vectors at scale typically runs $50–$300/month depending on recall and latency requirements.

  • pgvector until ~20M vectors or sub-50ms latency SLA
  • Qdrant, Pinecone, Weaviate are credible next steps
  • Cost per million vectors at scale: $50–$300/month

Freshness — the hidden operational burden

Production RAG systems live and die on data freshness. Every content source needs an ingestion pipeline: detect changes, reprocess affected chunks, update embeddings, invalidate caches. Teams that build RAG without a freshness pipeline discover their answers are three weeks stale after their first enterprise support ticket.

Incremental updates beat full reindexes. Event-driven pipelines (webhook → queue → embed → upsert) handle most content types cleanly. For sources without change events, a diff-based nightly job is a reasonable fallback.

  • Every source needs a change detection + reprocessing pipeline
  • Prefer event-driven incremental updates to full reindexes
  • Missing the freshness story is the #1 enterprise RAG failure mode

Conclusion

RAG is a powerful pattern when applied to problems it genuinely fits. The craft is in the details — chunking, hybrid retrieval, reranking, infrastructure choice, and freshness — not in the choice to use RAG at all. Build the simple version first, measure retrieval quality with recall@k and precision@k, and invest in the upgrades that your eval data actually justifies.

FAQ

Related questions

Specific, numeric answers for founders scoping similar work.

Use RAG when your content changes, when you need citations, or when different users see different corpora. Use fine-tuning when you need tone, style, or format calibration the model cannot learn from few-shot examples. Most startups need RAG; very few actually need fine-tuning in 2026.

Related pillar

Read the full AI Development for Startups: The Complete 2026 Guide

This cluster is a deep-dive section of a larger pillar guide. The pillar covers the full decision landscape.

Build with Mansoori Technologies

Let's Build Something Intelligent

Whether you're launching a new SaaS, adding AI agents, or modernizing existing systems, we can help you move from idea to production fast.