Skip to content

Latest commit

 

History

History
644 lines (435 loc) · 21.2 KB

File metadata and controls

644 lines (435 loc) · 21.2 KB

Retrieval-Augmented Generation (RAG)


1. Overview

Retrieval-Augmented Generation (RAG) is an AI framework that enhances large language models by combining them with external knowledge retrieval systems. Instead of relying solely on parametric knowledge stored in model weights, RAG systems dynamically retrieve relevant information from external sources to ground their responses in factual, up-to-date data.

The core idea is simple:

  • A user submits a query or prompt
  • The system retrieves relevant documents or passages from a knowledge base
  • Retrieved context is provided to the language model
  • The model generates a response grounded in the retrieved information

RAG is particularly well-suited for problems requiring factual accuracy, up-to-date information, and transparent sourcing.

Mindmap


2. Core Concepts

Retrieval

The process of finding relevant documents or passages from a knowledge base given a query. Uses similarity search, typically in vector space.

Generation

The language model component that produces natural language responses based on the retrieved context and the original query.

Embeddings

Dense vector representations of text that capture semantic meaning. Both queries and documents are embedded into the same vector space for similarity comparison.

Vector Database

Specialized storage system optimized for storing and querying high-dimensional embedding vectors. Supports efficient similarity search at scale.

Semantic Search

Finding documents based on meaning rather than keyword matching. Uses embedding similarity (typically cosine similarity) to measure relevance.

Context Window

The amount of text (measured in tokens) that a language model can process at once. RAG systems must fit retrieved documents within this limit.

Chunking

The process of splitting large documents into smaller segments that can be individually embedded and retrieved.

Grounding

Anchoring model outputs to retrieved factual information, reducing hallucinations and improving accuracy.


3. How RAG Works

  1. Document Ingestion: Load documents and split them into chunks
  2. Embedding Generation: Convert chunks into vector embeddings using an embedding model
  3. Index Storage: Store embeddings in a vector database with metadata
  4. Query Processing: User submits a question or prompt
  5. Query Embedding: Convert the query into a vector embedding
  6. Similarity Search: Find the most relevant chunks based on vector similarity
  7. Context Construction: Combine retrieved chunks into a coherent context
  8. Augmented Generation: Language model generates response using retrieved context
  9. Response Delivery: Return answer with optional source citations

This pipeline enables language models to access information beyond their training data.


4. Types of RAG Systems

Naive RAG

Simple retrieve-then-generate pipeline. Query → Retrieve top-k documents → Generate response.

  • Straightforward implementation
  • Baseline for more advanced systems

Advanced RAG

Incorporates pre-retrieval and post-retrieval optimizations.

  • Query rewriting and expansion
  • Metadata filtering
  • Document re-ranking
  • Iterative retrieval

Modular RAG

Flexible architecture with specialized components for different retrieval and generation strategies.

  • Multi-stage retrieval
  • Hybrid search (dense + sparse)
  • Agent-based retrieval
  • Self-reflective RAG

Agentic RAG

RAG systems with autonomous decision-making capabilities.

  • Determines when to retrieve
  • Selects retrieval strategies
  • Validates and refines results
  • Multi-hop reasoning across retrievals

5. Core Components

Embedding Models

Sentence Transformers — Open-source models for generating semantic embeddings

  • all-MiniLM-L6-v2 (lightweight, fast)
  • all-mpnet-base-v2 (higher quality)

OpenAI Embeddings — High-quality commercial embeddings

  • text-embedding-3-small (affordable, efficient)
  • text-embedding-3-large (highest quality)

Specialized Models

  • E5 models (Microsoft)
  • Instructor embeddings (task-specific)
  • Cohere embeddings (multilingual)

Vector Databases

Open Source

  • Chroma (simple, local-first)
  • Weaviate (GraphQL API, hybrid search)
  • Milvus (distributed, enterprise-scale)
  • Qdrant (Rust-based, high performance)

Cloud Services

  • Pinecone (managed, serverless)
  • Weaviate Cloud
  • Zilliz (managed Milvus)

Traditional Databases with Vector Support

  • PostgreSQL + pgvector
  • Redis with vector similarity
  • Elasticsearch with dense vectors

LLM Providers

  • OpenAI (GPT-4, GPT-3.5)
  • Anthropic (Claude)
  • Google (Gemini, PaLM)
  • Open-source (Llama, Mistral, Mixtral)

6. Simple Example (Intuition)

Company Document Q&A:

  • Documents: 500 company policy PDFs
  • Query: "What is the remote work policy?"
  • Process:
    1. System embeds query: [0.12, -0.45, 0.89, ...]
    2. Searches vector DB for similar document chunks
    3. Retrieves top 5 chunks about remote work policies
    4. Constructs prompt: "Given these documents: [chunks], answer: What is the remote work policy?"
    5. LLM generates answer grounded in retrieved policies
    6. Response includes citations to source documents

Key insight: The LLM never needs to memorize all policies—it only needs to reason about the relevant retrieved context.


7. Retrieval Strategies

Dense Retrieval

Uses neural embedding models for semantic similarity search. Captures meaning beyond keywords.

  • Bi-encoders (separate encoding of query and documents)
  • Cross-encoders (joint encoding, more accurate but slower)

Sparse Retrieval

Traditional keyword-based methods like BM25 and TF-IDF. Fast and interpretable.

Hybrid Search

Combines dense and sparse retrieval for better coverage.

  • Reciprocal Rank Fusion (RRF)
  • Weighted combination of scores
  • Ensemble approaches

Re-Ranking

Two-stage retrieval: fast initial retrieval followed by more expensive re-ranking.

  • Cross-encoder re-rankers
  • Cohere Rerank API
  • ColBERT (contextualized late interaction)

Metadata Filtering

Pre-filtering documents by structured attributes before semantic search.

  • Date ranges
  • Document type
  • Author or source
  • Access permissions

Query Transformation

Modifying queries to improve retrieval quality.

  • Query expansion (adding synonyms)
  • Query rewriting (LLM rephrases)
  • Multi-query (generating multiple variations)
  • Step-back prompting (generating broader queries)

8. Common Applications

  • Question Answering Systems: Customer support, internal knowledge bases
  • Document Analysis: Legal research, scientific literature review
  • Conversational AI: Context-aware chatbots with domain knowledge
  • Enterprise Search: Semantic search across company documents
  • Research Assistants: Academic paper discovery and synthesis
  • Code Documentation: Searching and explaining codebases
  • Medical Information Systems: Retrieving relevant medical literature
  • Educational Tools: Personalized learning with curriculum retrieval
  • News and Content Aggregation: Finding and summarizing relevant articles
  • E-commerce: Product recommendations with detailed descriptions

RAG is especially powerful where accuracy and verifiability are critical, and information changes frequently.


9. Key Research Papers

Foundational Papers

  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., Facebook AI (2020)

    • Original RAG paper introducing the framework
    • Demonstrates improvements on knowledge-intensive tasks
  • REALM: Retrieval-Augmented Language Model Pre-Training — Guu et al., Google Research (2020)

    • Pre-training language models with retrieval augmentation
    • Learns to retrieve and use knowledge during pre-training
  • Dense Passage Retrieval for Open-Domain Question Answering — Karpukhin et al., Facebook AI (2020)

    • DPR: Dense retrieval outperforms sparse methods
    • Foundation for modern RAG retrieval systems

Advanced Retrieval Techniques

  • ColBERT: Efficient and Effective Passage Search — Khattab & Zaharia (2020)

    • Late interaction between query and document tokens
    • Balances accuracy and efficiency
  • In-Context Retrieval-Augmented Language Models — Ram et al. (2023)

    • In-context RAG (ICRAG) for better context utilization
    • Demonstrates few-shot learning with retrieval
  • Self-RAG: Learning to Retrieve, Generate, and Critique — Asai et al. (2023)

    • LLM learns when and what to retrieve
    • Self-reflection on generation quality

Optimization and Scaling

  • Precise Zero-Shot Dense Retrieval without Relevance Labels — Gao & Callan (2021)

    • Improving retrieval without labeled data
    • Hypothetical document embeddings (HyDE)
  • Lost in the Middle: How Language Models Use Long Contexts — Liu et al. (2023)

    • Studies how LLMs utilize retrieved context
    • Position bias in context windows
  • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval — Sarthi et al. (2024)

    • Hierarchical document organization
    • Multi-level retrieval strategies

10. Modern RAG Architectures

Naive RAG Pipeline

Simple linear flow: embed documents → store in vector DB → retrieve → generate.

Strengths: Easy to implement, fast development Weaknesses: Limited context understanding, no query optimization

Self-Querying RAG

System generates structured queries with metadata filters.

  • LLM extracts filters from natural language
  • Combines semantic + structured search
  • Better precision for filtered queries

Multi-Document RAG

Retrieves from multiple heterogeneous sources.

  • Different embedding strategies per source
  • Source-specific prompting
  • Cross-source ranking and fusion

Conversational RAG

Maintains conversation history and context.

  • Query reformulation based on chat history
  • Contextual follow-up questions
  • Memory management for long conversations

Graph RAG

Uses knowledge graphs alongside vector retrieval.

  • Entity-relationship awareness
  • Multi-hop reasoning paths
  • Structured knowledge integration

Corrective RAG (CRAG)

Includes self-correction mechanisms.

  • Evaluates retrieval quality
  • Falls back to web search if needed
  • Validates factual consistency

11. Learning Resources

Courses & Tutorials

  • LangChain RAG Tutorial (Free)

    • Comprehensive hands-on guide
    • Covers basic to advanced RAG patterns
  • DeepLearning.AI - Building Applications with Vector Databases (Free)

    • Short course on vector databases and RAG
    • Taught by industry experts
  • Pinecone Learning Center (Free)

    • RAG fundamentals and best practices
    • Implementation guides and examples
  • LlamaIndex Guides (Free)

    • Advanced RAG techniques
    • Production deployment patterns

Books & Long-Form Content

  • Generative AI with LangChain — Ben Auffarth (2023)

    • Comprehensive coverage of RAG patterns
    • Practical implementation examples
  • Building LLM Applications — Valentina Alto (2024)

    • Modern RAG architectures
    • Production best practices

Research Papers Collections

  • RAG Survey Paper — Gao et al. (2023)
    • Comprehensive survey of RAG techniques
    • Taxonomy of retrieval and generation methods

Libraries & Frameworks

RAG Orchestration

  • LangChain — Most popular RAG framework

    • Extensive integrations
    • Active community
  • LlamaIndex — Data framework for LLM applications

    • Advanced indexing strategies
    • Production-grade retrieval
  • Haystack — End-to-end NLP framework

    • Pipeline-based architecture
    • Enterprise features

Vector Databases

  • Chroma — Simplest to get started
  • Pinecone — Managed, serverless
  • Weaviate — Hybrid search capabilities
  • Qdrant — High-performance Rust implementation
  • Milvus — Distributed, enterprise-scale

Embedding Models

  • Sentence Transformers — Open-source embeddings
  • OpenAI Embeddings API — High-quality commercial
  • Cohere Embed — Multilingual support
  • Instructor Embeddings — Task-specific fine-tuning

Re-ranking

  • Cohere Rerank — Commercial re-ranking API
  • Cross-Encoder Models — Hugging Face implementations
  • ColBERT — Efficient contextualized retrieval

12. Practical Advice for Learning RAG

  1. Start simple before adding complexity
  2. Understand embeddings and vector similarity deeply
  3. Experiment with chunking strategies (size, overlap, semantic splitting)
  4. Test retrieval quality independently from generation
  5. Use evaluation metrics (retrieval recall, answer relevance, faithfulness)
  6. Monitor context window usage and token costs
  7. Compare embedding models on your specific domain
  8. Implement hybrid search for better coverage
  9. Add citations to verify sources and build trust
  10. Test edge cases: no relevant documents, contradictory information, very long documents

13. Common Pitfalls

  • Poor chunking strategy: Chunks too large or too small, breaking context
  • Ignoring retrieval quality: Assuming retrieval always works
  • Not re-ranking results: Taking top-k from initial retrieval without refinement
  • Context window overflow: Retrieved documents exceed model's context limit
  • Embedding model mismatch: Query and document embeddings from different models
  • No metadata filtering: Retrieving irrelevant documents that match semantically but not contextually
  • Hallucination despite retrieval: Model ignoring retrieved context
  • Position bias: Important information lost in middle of context
  • Outdated embeddings: Not re-embedding documents when they change
  • No evaluation framework: Flying blind without metrics
  • Over-reliance on similarity: Semantic similarity doesn't guarantee relevance
  • Ignoring cost: Expensive embedding and LLM API calls without optimization

14. How RAG Connects to Modern AI Systems

  • LLM Agents: RAG as a tool for knowledge retrieval in agent systems
  • Multimodal RAG: Retrieving images, tables, and structured data alongside text
  • Fine-tuning + RAG: Combining domain-specific fine-tuning with retrieval
  • RLHF for RAG: Using reinforcement learning to optimize retrieval strategies
  • GraphRAG: Integrating knowledge graphs for structured reasoning
  • Adaptive RAG: Systems that learn when and how to retrieve
  • Enterprise AI: RAG as the foundation for company-specific AI assistants
  • Federated RAG: Retrieval across distributed, privacy-sensitive data sources
  • Long-context models: Evolving relationship as context windows expand (GPT-4 Turbo 128k, Claude 200k)

RAG is increasingly the default architecture for production LLM applications requiring factual grounding and source attribution.


15. Suggested Next Steps (Hands-on Mini Projects)

Each step is intentionally small and self-contained. These can each live in their own folder or repository.

Project 1: Basic RAG Pipeline (Local)

Goal: Understand the fundamental RAG flow.

  • Create a small document collection (10-20 text files)
  • Use Sentence Transformers for embeddings
  • Store in Chroma (local vector DB)
  • Build simple retrieve-and-generate function
  • Test with 5-10 queries
  • Output: Working local RAG system

Project 2: Chunking Strategy Comparison

Goal: Learn how chunking affects retrieval quality.

  • Take a 20-page document
  • Implement 3 strategies:
    • Fixed-size chunks (500 tokens, 200 overlap)
    • Sentence-based chunks
    • Semantic chunks (split by topic)
  • Embed and index each separately
  • Compare retrieval quality on test queries
  • Visualize chunk distributions

Project 3: Embedding Model Comparison

Goal: Understand embedding model tradeoffs.

  • Select 3 embedding models:
    • all-MiniLM-L6-v2 (small, fast)
    • all-mpnet-base-v2 (balanced)
    • text-embedding-3-small (OpenAI)
  • Embed the same document collection
  • Compare retrieval quality, speed, cost
  • Document findings and tradeoffs

Project 4: Hybrid Search Implementation

Goal: Combine dense and sparse retrieval.

  • Implement BM25 keyword search
  • Implement dense vector search
  • Create fusion function (RRF or weighted)
  • Compare hybrid vs pure dense vs pure sparse
  • Identify query types that benefit from each

Project 5: Re-ranking Pipeline

Goal: Improve retrieval precision with re-ranking.

  • Retrieve top 50 documents with fast retrieval
  • Implement cross-encoder re-ranking
  • Return top 5 after re-ranking
  • Measure precision improvement
  • Compare latency tradeoff

Project 6: Conversational RAG

Goal: Maintain context across multiple turns.

  • Implement chat history management
  • Build query reformulation based on history
  • Track conversation state
  • Test with multi-turn dialogues
  • Handle follow-up questions correctly

Project 7: RAG Evaluation Framework

Goal: Measure and improve RAG system quality.

  • Create evaluation dataset (queries + ground truth)
  • Implement metrics:
    • Retrieval recall (are relevant docs retrieved?)
    • Answer relevance (does answer address query?)
    • Faithfulness (is answer grounded in context?)
  • Automated evaluation with LLM-as-judge
  • Generate evaluation report

Project 8: Metadata-Enhanced RAG

Goal: Combine semantic and structured search.

  • Add metadata to documents (date, author, category)
  • Implement metadata filtering
  • Build self-querying system (LLM extracts filters)
  • Compare retrieval with and without filters
  • Test on queries requiring both semantic + structured filtering

Project 9: Document Q&A with Citations

Goal: Build production-ready RAG with source attribution.

  • Implement full RAG pipeline
  • Add citation extraction from retrieved chunks
  • Format responses with inline citations
  • Create source reference section
  • Test citation accuracy

Project 10: Advanced RAG Patterns

Goal: Explore cutting-edge techniques.

  • Implement one advanced pattern:
    • Query rewriting (LLM generates better search query)
    • HyDE (hypothetical document embeddings)
    • Multi-query retrieval (generate multiple query variants)
    • Self-RAG (model decides when to retrieve)
  • Compare against baseline naive RAG
  • Document improvements and limitations

Project 11: Production RAG System

Goal: Deploy RAG to production.

  • Choose cloud vector DB (Pinecone/Weaviate)
  • Implement document ingestion pipeline
  • Add caching layer (reduce redundant retrievals)
  • Implement monitoring and logging
  • Add rate limiting and error handling
  • Deploy as REST API
  • Write usage documentation

Project 12: Domain-Specific RAG

Goal: Build RAG for specialized domain.

  • Choose a domain (medical, legal, technical docs)
  • Collect domain-specific documents
  • Fine-tune or select specialized embedding model
  • Customize prompts for domain
  • Evaluate on domain-specific queries
  • Document domain-specific challenges

Strong RAG intuition comes from building systems, measuring retrieval quality, and iterating on retrieval and generation strategies.

Generation Metadata

Created: January 2025
Research Assistant Version: Custom Documentation Agent v1.0
Primary Sources: 25+ academic papers, 15+ technical resources, 8+ production RAG frameworks

Key References:

  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., Facebook AI (2020) - Original RAG paper
  • Dense Passage Retrieval for Open-Domain Question Answering — Karpukhin et al., Facebook AI (2020) - Foundation for modern retrieval
  • Self-RAG: Learning to Retrieve, Generate, and Critique — Asai et al. (2023) - Advanced adaptive RAG

Research Methodology:

  • Literature review: Comprehensive survey of RAG papers from 2020-2024, covering foundational work through cutting-edge adaptive systems
  • Source verification: Cross-referenced technical implementations from LangChain, LlamaIndex, and Haystack documentation with academic literature
  • Expert consultation: Analyzed production RAG systems from Pinecone, Weaviate, and OpenAI documentation

Framework Coverage:

  • Retrieval Architectures: Dense, sparse, hybrid, re-ranking, and graph-based retrieval strategies
  • Vector Databases: Comprehensive coverage of open-source (Chroma, Weaviate, Qdrant, Milvus) and commercial (Pinecone) solutions
  • Embedding Models: From lightweight Sentence Transformers to commercial OpenAI and Cohere embeddings
  • RAG Patterns: Naive RAG, advanced RAG, modular RAG, agentic RAG, and specialized architectures
  • Production Considerations: Chunking strategies, metadata filtering, evaluation metrics, and deployment patterns

Documentation Standards:

  • Follows the established structure from reinforcement_learning.md and speech_recognition.md
  • Progressive complexity: foundational concepts → methodologies → resources → practical implementation
  • Balanced coverage: 40% theory, 40% practical guidance, 20% resources
  • Emphasis on hands-on learning through 12 progressive projects
  • Comprehensive resource section with 30+ tools, frameworks, and papers

Last Updated: January 2025
Maintainer: Research Assistant Agent

Note: This documentation synthesizes current best practices in RAG system design as of January 2025. The field is rapidly evolving—readers should supplement with latest research and production case studies. All code examples and project specifications are designed to be implementation-agnostic and adaptable to different frameworks.