When should I use RAG instead of fine-tuning?

Use RAG when your content changes, when you need citations, or when different users see different corpora. Use fine-tuning when you need tone, style, or format calibration the model cannot learn from few-shot examples. Most startups need RAG; very few actually need fine-tuning in 2026.

Do I need a vector database from day one?

No. pgvector on Postgres handles 2–5M embeddings with acceptable latency for nearly every startup-stage RAG workload. Graduate to Qdrant or Pinecone only when you cross roughly 20M vectors or have hard sub-50ms retrieval requirements.

What chunking strategy works best?

Semantic chunking at paragraph boundaries with 150–250 token overlap, storing parent document context with each chunk. Character-based splitting is an antipattern that silently destroys retrieval quality. Always store title, section, and a short summary alongside each chunk.

Is vector-only retrieval enough?

No, in most production cases. Hybrid retrieval (vector + keyword/BM25, combined with reciprocal rank fusion) consistently outperforms pure vector. Re-ranking with a cross-encoder further improves precision. Plan the hybrid pipeline from the start.

How do I keep my RAG answers fresh?

Build an ingestion pipeline for every content source: detect changes, reprocess chunks, update embeddings, invalidate caches. Event-driven incremental updates beat full reindexes. Missing freshness is the most common cause of enterprise RAG failures.

RAG Architecture: When to Use It, When to Skip It

Introduction

Retrieval-Augmented Generation (RAG) is the default pattern for grounding LLM outputs in your proprietary data. It is also the pattern teams most often over-apply — reaching for vector search when a simple prompt with the full document would have worked better. This article walks through when RAG is the right answer, when it is not, and how to build one that survives past the first demo.

When RAG genuinely wins

RAG wins when your corpus is large enough that you cannot fit everything relevant into a context window (roughly: more than 20–40 pages of text per query), when your content changes frequently enough that fine-tuning would go stale, when you need citations back to source documents, and when different users should see different subsets of the corpus.

Typical winning use cases: internal knowledge bases, customer support with evolving product docs, policy compliance Q&A, legal and medical reference tools, content personalized to user permissions.

Corpus too large for context window (>20–40 pages per query)
Content updates frequently; fine-tune would go stale
Citations back to source are required
Per-user or per-tenant permission boundaries on content

When RAG is the wrong choice

If your entire corpus fits in a context window (Gemini 1.5 Pro accepts 2M tokens — roughly 3,000 pages), just include the whole corpus. 'Context stuffing' is often both cheaper and more accurate than RAG for small corpora.

If your problem is structured query ('how many customers upgraded last month'), use SQL, not RAG. RAG over structured data is an antipattern that produces slow, inaccurate answers where a JOIN would have been correct and instant.

If your problem is tone, style, or domain jargon calibration, prefer few-shot prompting or fine-tuning over RAG. RAG retrieves facts; it does not teach the model how to sound.

Small corpus → just stuff the context; do not retrieve
Structured query → SQL, not vector search
Tone/style calibration → few-shot or fine-tune

Chunking strategy — the 70% of RAG quality

RAG quality is mostly chunking quality. The naive 'split every 1,000 characters' approach produces fragments that lose context. Modern chunking respects document structure: section headers, paragraph boundaries, table integrity, code block atomicity.

Practical defaults that hold up: semantic chunking (split at paragraph boundaries with 150–250 token overlap), store parent document context with every chunk (doc title, section path, a short summary), and maintain a document-level index so the model can escalate 'give me more of this document' when a chunk is clearly relevant but insufficient.

Semantic chunking beats character-based splitting every time
150–250 token overlap preserves continuity
Store parent context (title, section, summary) with every chunk

Retrieval — hybrid beats pure vector

Pure vector similarity search has quality gaps on specific terms, numbers, and acronyms. In production, hybrid retrieval (vector + keyword/BM25, combined with reciprocal rank fusion) consistently outperforms either alone.

Re-ranking with a cross-encoder (Cohere Rerank, Voyage Rerank) after the initial retrieval further improves precision at minimal latency cost. A typical pipeline: retrieve top 50 with hybrid search → re-rank to top 8 with cross-encoder → send to LLM.

Hybrid retrieval (vector + keyword, reciprocal rank fusion) is the default
Re-rank with cross-encoder (Cohere Rerank, Voyage) for precision
Pipeline: hybrid top-50 → rerank top-8 → LLM

Infrastructure — pgvector until you cannot

Start with pgvector on Postgres. Most startup-stage corpora (2–5M embeddings) fit comfortably with acceptable latency. You already run Postgres; you already have backups, access control, and auditability. Fancier vector DBs buy you nothing until you cross roughly 20M vectors or need strict sub-50ms retrieval at scale.

Once you cross those thresholds, graduate to Qdrant (self-hosted, strong filtering) or Pinecone (managed, simple ops). Weaviate is also credible, especially if you want built-in hybrid search. Cost per million vectors at scale typically runs $50–$300/month depending on recall and latency requirements.

pgvector until ~20M vectors or sub-50ms latency SLA
Qdrant, Pinecone, Weaviate are credible next steps
Cost per million vectors at scale: $50–$300/month

Freshness — the hidden operational burden

Production RAG systems live and die on data freshness. Every content source needs an ingestion pipeline: detect changes, reprocess affected chunks, update embeddings, invalidate caches. Teams that build RAG without a freshness pipeline discover their answers are three weeks stale after their first enterprise support ticket.

Incremental updates beat full reindexes. Event-driven pipelines (webhook → queue → embed → upsert) handle most content types cleanly. For sources without change events, a diff-based nightly job is a reasonable fallback.

Every source needs a change detection + reprocessing pipeline
Prefer event-driven incremental updates to full reindexes
Missing the freshness story is the #1 enterprise RAG failure mode

Conclusion

RAG is a powerful pattern when applied to problems it genuinely fits. The craft is in the details — chunking, hybrid retrieval, reranking, infrastructure choice, and freshness — not in the choice to use RAG at all. Build the simple version first, measure retrieval quality with recall@k and precision@k, and invest in the upgrades that your eval data actually justifies.

When RAG genuinely wins

When RAG is the wrong choice

Chunking strategy — the 70% of RAG quality

Retrieval — hybrid beats pure vector

Infrastructure — pgvector until you cannot

Freshness — the hidden operational burden

Related questions

Read the full AI Development for Startups: The Complete 2026 Guide

Let's Build Something Intelligent