Prerequisite: ../03_Engineering/04_RAG/.
This document focuses on architectural decisions when building RAG systems for specific domains. For RAG fundamentals, see 03_Engineering/04_RAG. Here we address: given a domain requirement, how do you design the retrieval pipeline?
The term "RAG" covers a wide spectrum of architectures. Choosing the right variant is the first design decision.
Simple Complex
| |
Naive RAG Advanced RAG Modular RAG Agentic RAG
(retrieve→generate) (query rewrite, (routing, multi- (autonomous
re-rank, hybrid) source, iterative) tool use)
| Variant | Description | When to Use |
|---|---|---|
| Naive RAG | Single retrieval step, top-k chunks → LLM | Prototyping, simple Q&A over small corpus |
| Advanced RAG | Query rewriting, hybrid search, re-ranking | Production Q&A systems, most domain applications |
| Modular RAG | Multiple retrievers, routing logic, iterative retrieval | Multi-source knowledge bases, complex queries |
| Agentic RAG | LLM decides when/what/how to retrieve | Open-ended research tasks, multi-step reasoning |
Default recommendation: Start with Advanced RAG. It covers 80% of domain use cases with manageable complexity.
Chunking is the most underrated decision in RAG design. Bad chunking ruins everything downstream.
| Strategy | How It Works | Pros | Cons | Best For |
|---|---|---|---|---|
| Fixed-size | Split every N tokens with overlap | Simple, predictable | Breaks mid-sentence, mid-paragraph | Homogeneous text (news, articles) |
| Recursive | Split by paragraph → sentence → token | Respects natural boundaries | Uneven chunk sizes | General-purpose, good default |
| Semantic | Use embedding similarity to find topic boundaries | Coherent chunks | Slower, model-dependent | Long documents with topic shifts |
| Document-structure | Split by headings, sections, tables | Preserves logical units | Requires structured input | Technical documents, specifications |
| Hierarchical | Parent chunks (large) + child chunks (small) | Multi-granularity retrieval | Complex indexing | When both overview and detail matter |
For technical/engineering domains:
- Tables: Extract tables as separate chunks with their captions. Tables in construction specs contain critical data that gets destroyed by naive text splitting.
- Figures/diagrams: Store figure captions and surrounding text as chunks. Reference the figure ID for traceability.
- Cross-references: "See Section 3.2" is meaningless in a chunk. Either resolve references during chunking or store section metadata.
- Numbered lists/procedures: Keep procedural steps together. Splitting step 3 from step 4 of a safety procedure is dangerous.
| Use Case | Recommended Size | Reasoning |
|---|---|---|
| Factual Q&A | 256-512 tokens | Small chunks = precise retrieval |
| Analytical questions | 512-1024 tokens | Need more context for reasoning |
| Document summarization | 1024-2048 tokens | Larger context per chunk |
| Code retrieval | Function/class level | Semantic units, not token counts |
Overlap: 10-20% of chunk size. Ensures no information falls in the gap between chunks.
| Model Category | Examples | Strengths | Weaknesses |
|---|---|---|---|
| General-purpose | BGE-large, E5-large, GTE | Good baseline, multilingual | May miss domain-specific semantics |
| Instruction-tuned | BGE-M3, E5-mistral | Better at asymmetric queries | Slightly slower |
| Domain-fine-tuned | Custom trained on domain pairs | Best domain relevance | Requires training data and effort |
Decision path:
- Start with a strong general-purpose model (BGE-large-zh-v1.5 for Chinese, BGE-M3 for multilingual)
- Evaluate retrieval quality on domain queries
- If recall@10 < 80%, consider fine-tuning the embedding model on domain query-document pairs
- Fine-tuning data: 1K-10K (query, positive_doc, negative_doc) triples
Dense retrieval (embeddings) captures semantic similarity but misses exact keyword matches. Sparse retrieval (BM25) captures exact terms but misses paraphrases. Combining both is almost always better.
User Query → ┬→ Dense Retrieval (vector similarity) → Results A
└→ Sparse Retrieval (BM25 keyword) → Results B
↓
Reciprocal Rank Fusion (RRF)
↓
Merged Results → Re-ranker → Top-k → LLM
Why hybrid matters for domain applications:
- Domain terminology is often precise. "FIDIC Red Book" must match exactly — semantic similarity alone might retrieve "construction contract templates" instead.
- Abbreviations and codes (e.g., "GB 50300-2013") are essentially keywords. BM25 handles these perfectly; embeddings struggle.
The retriever casts a wide net (top-50 to top-100). The re-ranker narrows it down (top-3 to top-5) with higher precision.
| Re-ranker Type | Examples | Latency | Quality |
|---|---|---|---|
| Cross-encoder | BGE-reranker, Cohere Rerank | 50-200ms | High |
| LLM-based | GPT-4 as judge | 500ms-2s | Highest |
| Lightweight | ColBERT, late interaction | 10-50ms | Medium-high |
Default: Cross-encoder re-ranker. Best quality-latency trade-off for most applications.
Raw user queries are often poor retrieval queries. Processing them before retrieval significantly improves results.
| Technique | What It Does | When to Use |
|---|---|---|
| Query rewriting | LLM rephrases query for better retrieval | Always (low cost, high impact) |
| HyDE | Generate hypothetical answer, use it as query | When queries are short/vague |
| Query decomposition | Split complex query into sub-queries | Multi-aspect questions |
| Query expansion | Add synonyms, related terms | Domain with rich terminology |
[System instruction: role, constraints, output format]
[Retrieved context]
Document 1: {title} ({source}, {date})
{content}
Document 2: {title} ({source}, {date})
{content}
[User question]
[Output instructions: cite sources, admit uncertainty, format requirements]
Key design decisions:
- Context ordering: Most relevant first? Chronological? By source type? Research shows models attend more to the beginning and end of context.
- Source attribution: Include source metadata so the model can cite. "According to [Document Title, Section X]..."
- Uncertainty handling: Explicitly instruct: "If the provided documents do not contain sufficient information to answer, say so. Do not fabricate."
When retrieved content exceeds the context window:
- Truncation: Simple but loses information. Only acceptable for naive RAG.
- Map-reduce: Summarize each chunk independently, then synthesize. Good for analytical questions.
- Iterative refinement: Process chunks sequentially, refining the answer. Good for comprehensive answers.
- Selective inclusion: Use the re-ranker score to decide how many chunks to include. Dynamic context sizing.
Domain applications often need to query multiple knowledge sources and handle complex, multi-step reasoning.
| Pattern | Control Flow | Key Capability | Use Case |
|---|---|---|---|
| Modular RAG | Linear/DAG | Pre-defined routing between sources. | Customer support across docs + orders. |
| Self-RAG | Loop | Model critiques its own retrieval and relevance. | Fact-critical technical analysis. |
| Agentic RAG | Dynamic Graph | Model plans tool calls (search, SQL, calculate). | Financial research, comparative analysis. |
| GraphRAG | Map-Reduce | Global summarization over entity clusters. | "What are the common risks in all projects from 2023?" |
graph TD
A[User Query] --> B{Need Search?}
B -- Yes --> C[Query Rewriter]
C --> D[Multi-Source Retriever]
D --> E[Relevance Grader]
E -- Irrelevant --> C
E -- Relevant --> F{Need SQL/API?}
F -- Yes --> G[Tool Executor]
G --> H[Final Generator]
F -- No --> H
B -- No --> H
Implementation Components:
- Query Decomposer: Splits complex questions into sub-questions (e.g., "Compare Project A and B" -> "Retrieve A", "Retrieve B").
- State Manager: Keeps track of what has been found and what is missing.
- Corrective RAG (CRAG): If the local knowledge base fails, automatically triggers a web search (e.g., Tavily or Google Search) as a fallback.
- Self-Correction: If the generated answer is not grounded in the context, the model re-retrieves or re-writes.
| Metric | What It Measures | Target |
|---|---|---|
| Recall@k | % of relevant docs in top-k results | > 80% at k=10 |
| MRR | Rank of first relevant result | > 0.7 |
| NDCG@k | Ranking quality considering relevance grades | > 0.6 |
| Metric | What It Measures | How to Evaluate |
|---|---|---|
| Faithfulness | Does the answer stick to retrieved context? | LLM-as-judge or human review |
| Relevance | Does the answer address the question? | LLM-as-judge or human review |
| Completeness | Are all aspects of the question covered? | Human review |
| Hallucination rate | % of claims not supported by context | Automated claim verification |
| Metric | What It Measures |
|---|---|
| Answer accuracy | Compared against gold-standard answers |
| User satisfaction | Thumbs up/down, explicit feedback |
| Task completion rate | Did the user get what they needed? |
| Failure | Symptom | Root Cause | Mitigation |
|---|---|---|---|
| Retrieval miss | Correct answer exists but wasn't retrieved | Poor embedding, wrong chunk size | Hybrid search, query rewriting, chunk optimization |
| Context poisoning | Irrelevant chunks mislead the model | Low retrieval precision | Re-ranking, stricter top-k cutoff |
| Lost in the middle | Model ignores relevant context in the middle | LLM attention bias | Reorder context, use map-reduce |
| Hallucination despite context | Model generates plausible but unsupported claims | Weak grounding instruction | Stronger system prompt, citation requirement |
| Stale knowledge | Answer based on outdated information | Index not updated | Automated ingestion pipeline, metadata filtering by date |
| Cross-document contradiction | Sources disagree, model picks one arbitrarily | No conflict resolution logic | Surface contradictions explicitly, let user decide |
- Lewis et al. (2020): Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
- Gao et al. (2024): Retrieval-Augmented Generation for Large Language Models: A Survey.