RAG extends LLMs with external knowledge by retrieving relevant documents and including them in the generation context. This addresses two fundamental LLM limitations: knowledge cutoff (the model doesn’t know about recent information) and hallucination (the model generates plausible but incorrect facts).
User Query
↓
┌──────────────┐ ┌──────────────┐
│ 1. Embed │ │ Vector │
│ Query │────→│ Database │
└──────────────┘ │ (similarity │
│ search) │
└──────┬───────┘
↓
Top-k Documents
↓
┌──────────────────────────────────┐
│ 2. Construct Augmented Prompt │
│ [System] + [Retrieved Docs] │
│ + [User Query] │
└──────────────┬───────────────────┘
↓
┌──────────────────────────────────┐
│ 3. Generate Answer │
│ (LLM with retrieved context) │
└──────────────────────────────────┘
Chunking splits documents into pieces that fit within the embedding model’s context and retrieval granularity:
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split every N tokens with M overlap | Simple documents |
| Recursive | Split by paragraphs → sentences → tokens | Structured text |
| Semantic | Split at topic boundaries using embeddings | Long-form content |
| Document-aware | Split respecting headers, sections, tables | Technical docs |
Typical parameters: 256-512 tokens per chunk, 10-20% overlap between adjacent chunks.
Metadata preservation: Store source document, page number, section heading, and timestamp with each chunk for citation and filtering.
Convert text chunks and queries into dense vectors for similarity search. See vector database embeddings for model choices.
Key considerations:
Store and index embeddings for fast approximate nearest neighbor (ANN) search:
| Database | Type | Strengths |
|---|---|---|
| Milvus | Open-source, distributed | Scalable, GPU-accelerated search, rich filtering |
| Weaviate | Open-source | Hybrid search, built-in vectorization |
| Pinecone | Managed service | Fully managed, low operational overhead |
| FAISS | Library | Fast, in-memory, GPU support, research-grade |
| Chroma | Open-source | Simple API, good for prototyping |
| pgvector | PostgreSQL extension | Integrates with existing Postgres infrastructure |
Similarity metrics:
Basic retrieval: Embed query → find top-k nearest chunks → return.
Re-ranking: After initial retrieval, use a cross-encoder to score query-document pairs more accurately:
1. Retrieve top-50 chunks via ANN search (fast, approximate)
2. Re-rank with cross-encoder to get top-5 (slow, accurate)
3. Pass top-5 to LLM
Cross-encoders jointly encode the query and document, capturing fine-grained interactions that bi-encoder similarity misses.
Hybrid search: Combine dense retrieval (semantic similarity) with sparse retrieval (BM25 keyword matching):
score = α · dense_score + (1-α) · sparse_score
Hybrid search catches cases where semantic search fails (exact names, codes, numbers) and where keyword search fails (paraphrases, synonyms).
Construct the final prompt with retrieved context:
System: You are a helpful assistant. Answer based on the provided context.
If the context doesn't contain the answer, say "I don't know."
Context:
[Retrieved Document 1]
[Retrieved Document 2]
[Retrieved Document 3]
User: {original query}
Context window management: With limited context length, prioritize:
Generate a hypothetical answer first, embed it, then retrieve:
HyDE improves retrieval for complex queries where the query embedding is far from relevant document embeddings.
Generate multiple query variations, retrieve for each, merge results:
Captures different aspects of the query that a single embedding might miss.
The model decides when to retrieve and critiques its own responses:
Rewrite retrieved chunks to remove irrelevant content:
Reduces noise and fits more relevant information in the context window.
Embed small chunks (children) for precise retrieval, but return larger chunks (parents) for context:
Provides both retrieval precision and generation context.
| Component | NVIDIA Tool | Purpose |
|---|---|---|
| Embedding model | NeMo + TensorRT | Encode queries and documents |
| Embedding serving | NIM (embedding) | Production embedding API |
| Re-ranking | NIM (reranking) | Cross-encoder scoring |
| Vector search | Milvus (GPU-accelerated) | Fast ANN search |
| Generator | NIM (LLM) | Answer generation |
| Safety | NeMo Guardrails | Input/output/retrieval rails |
| Metric | Measures | How |
|---|---|---|
| Recall@k | Retrieval quality | % of relevant docs in top-k |
| MRR (Mean Reciprocal Rank) | Retrieval ranking | Average 1/rank of first relevant doc |
| Faithfulness | Answer groundedness | Does the answer follow from retrieved context? |
| Answer relevancy | Response quality | Does the answer address the original question? |
| Context precision | Retrieval precision | Are retrieved docs actually relevant? |
Frameworks like RAGAS and TruLens automate RAG evaluation across these metrics.