Skip to content

Commit 520a6c0

Browse files
committed
Incremental check in
1 parent 8e2cab0 commit 520a6c0

File tree

2 files changed

+57
-28
lines changed

2 files changed

+57
-28
lines changed

learn-pr/wwl-data-ai/retrieval-augmented-generation-azure-databricks/includes/2-workflow.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Before a RAG system can find relevant information, it needs to convert all text
4141

4242
An embedding model is a specialized AI tool that converts text into numerical vectors (lists of numbers) that represent the meaning of the text. Think of it as a translator that turns works and sentences into a mathematical language that computers can understand and compare.
4343

44-
Document embedding, as shown in the diagram below, is part of a preparation phase. This is done once to set up a knowledge base. Before your RAG system can work, you need to prepare your documents. An embedding model takes all your text documents and transforms them in to mathematical vectors called embeddings, that capture their semantic meaning. This preprocessing step creates a searchable knowledge base.
44+
Document embedding, as shown in the diagram below, is part of a preparation phase. This is done once to set up a knowledge base. Before your RAG system can work, you need to prepare your documents. An embedding model takes all your text documents and transforms them into mathematical vectors called embeddings, that capture their semantic meaning. This preprocessing step creates a searchable knowledge base.
4545

4646
:::image type="content" source="../media/document-embedding.png" alt-text="Diagram of embeddings model converting documents to vectors.":::
4747

Lines changed: 56 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,90 @@
1-
Before you implement Retrieval Augmented Generation (RAG), it's important to prepare your data. Proper data preparation ensures that your RAG system functions effectively and your Large Language Model (LLM) delivers accurate results. When data is prepared improperly, some potential issues are:
1+
Before you implement Retrieval Augmented Generation (RAG), you need to prepare your data properly. This involves three essential steps: storing your data securely with proper governance, breaking documents into appropriately-sized chunks for processing, and converting everything into a searchable format that the system can quickly access when needed.
22

3-
- **Poor quality model output**: If data is inaccurate, incomplete, or biased, the RAG system is more likely to produce misleading or incorrect responses.
4-
- **"Lost in the middle"**: In long context, LLMs tend to overlook the documents placed in the middle. [Explore the research paper on how language models use long context](https://arxiv.org/pdf/2307.03172?azure-portal=true).
5-
- **Inefficient retrieval**: Poorly prepared data would decrease the accuracy and precision of retrieving relevant information from knowledge base.
6-
- **Exposing data**: Poor data governance could lead to exposing data during the retrieval process.
7-
- **Wrong embedding model**: Wrong embedding model would decrease the quality of embeddings and retrieval accuracy.
3+
Proper data preparation ensures that your RAG system functions effectively and your Large Language Model (LLM) delivers accurate results. When data is prepared improperly, some potential issues are:
84

9-
## Explore the data prep process
5+
- **Poor quality responses**: If your data is inaccurate, incomplete, or biased, the RAG system will produce misleading or incorrect responses because it can only work with the information you provide.
106

11-
A simple data prep process with Azure Databricks consists **data storage and governance** (1), **data extraction and chunking** (2), and **embedding your data** (3).
7+
- **Missing relevant information**: Poorly structured data makes it difficult for the system to find the right content when users ask questions, leading to incomplete or irrelevant answers.
8+
9+
- **Security and privacy risks**: Without proper data governance, sensitive information might be exposed during the retrieval process, creating compliance and security issues.
10+
11+
- **Ineffective search**: Using the wrong embedding model or chunking strategy decreases the quality of semantic search, making it harder to find relevant information.
12+
13+
- **"Lost in the middle" problem**: In long contexts, LLMs tend to overlook documents placed in the middle, so how you organize and present information matters for accuracy.
14+
15+
The good news is that proper data preparation solves these problems and sets your RAG system up for success. Let's explore how to do this effectively.
16+
17+
## Explore the data preparation process
18+
19+
There are various approaches to preparing data for RAG. This module covers a methodology that works with Azure Databricks.
20+
21+
The diagram below illustrates this workflow:
22+
1. Data storage and governance establishes the foundation with secure, data storage and proper access controls.
23+
2. Data extraction and chunking transforms large documents into smaller, manageable pieces that are optimized for processing and search.
24+
3. Embedding your data converts these text chunks into numerical representations that enable semantic search capabilities.
1225

1326
:::image type="content" source="../media/data-prep-process.png" alt-text="Diagram of the data prep process overview." lightbox="../media/data-prep-process.png":::
1427

1528
Let's explore each of these components in more detail.
1629

1730
### Store and govern your data storage
1831

32+
Before you can search your data, you need to store it securely and control who can access it. This foundational step ensures your RAG system only retrieves authorized data when responding to user queries, maintaining security and compliance.
33+
1934
To store data, you can use **Delta Lake**, which is a unified data management layer. It's the optimized storage layer that provides the foundation for storing data and tables in Azure Databricks.
2035

2136
You can use Delta Lake together with **Unity Catalog**, which provides a unified governance solution for all your data and AI assets, including tables, files, notebooks, and models.
2237

2338
When used together, Unity Catalog and Delta Lake create a robust data management framework. Unity Catalog governs the data stored in Delta Lake, ensuring that access controls and metadata are consistently applied.
2439

25-
### Extract and chunk your data
40+
### Break data into chunks
2641

27-
When you **chunk** your data, you ingest text documents, split the documents up in chunks, and embed the chunks to make them searchable. How you chunk your data depends on your use case.
42+
**Chunking** is the process of breaking large documents into smaller, manageable pieces that can be processed individually. This step is necessary because language models have token limits—they can only process a limited amount of text at once. When someone asks a question, your RAG system retrieves relevant chunks and includes them in the prompt sent to the language model. If your chunks are too large, you'll exceed the model's token limit and won't be able to include all the relevant information.
2843

29-
For example, you need to consider the relationship between the amount of context you want to provide to the prompt and how much context can fit into the model's token limit.
44+
Language models work with tokens—basic units of text that can be words, parts of words, or punctuation. Different models have different token limits: some handle 4,000 tokens, others can process 128,000 tokens or more. The token limit includes everything in your prompt: the user's question, the retrieved chunks, and any instructions for the model.
3045

31-
You can chunk your data in two ways:
46+
Without proper chunking, you face two main problems, exceeding token limits or reduced precision. Large documents may exceed token limits the model can process, causing errors or truncation. Even if a document contains the right answer, if it's buried in lots of unrelated text, the model may struggle to find and use it effectively, reducing precision.
3247

33-
- **Context-aware**: Divide by sentence, paragraph, or section by using special punctuation. You can also include metadata, tags, or titles.
34-
- **Fixed-size**: Divide by a specific number of tokens. This approach is simple and computationally cheap.
48+
You can chunk your data using two main strategies:
3549

36-
To find the most optimal chunking strategy, you need to experiment with different chunk sizes and approaches.
50+
- **Context-aware chunking**: Divide documents based on their natural structure, such as sentences, paragraphs, or sections. This preserves the logical flow of information but creates variable-sized chunks. You can also include metadata like titles or section headers to provide additional context.
3751

38-
Some chunking approaches you can experiment with are:
52+
- **Fixed-size chunking**: Divide documents into chunks of a predetermined size (for example, 500 tokens each). This approach is simple and computationally efficient, but may split content at awkward places.
3953

40-
- **One chunk consists of one sentence**: Embeddings focus on a specific meaning.
54+
To find the optimal chunking strategy, experiment with different approaches and chunk sizes.
55+
56+
### Common chunking strategies
57+
58+
Here are practical approaches you can experiment with.
59+
60+
- **Sentence-based chunks**: Each chunk contains one or a few sentences. This creates focused embeddings that capture specific meanings, making them effective for precise question-answering. The diagram shows how a document is divided at sentence boundaries, creating small, focused chunks that each contain a complete thought or idea.
4161
:::image type="content" source="../media/chunk-sentence.png" alt-text="Diagram showing a chunk representing a sentence.":::
42-
- **One chunk includes multiple paragraphs**: Embeddings tend to capture a broader theme. For example, you can achieve the latter by splitting a document by headers.
62+
- **Paragraph-based chunks**: Each chunk contains multiple paragraphs or sections. This captures broader themes and provides more context, which works well for complex questions requiring comprehensive explanations. The diagram illustrates how larger sections of text are kept together, preserving the logical flow and complete context of related ideas within each chunk.
4363
:::image type="content" source="../media/chunk-paragraph.png" alt-text="Diagram showing a chunk representing a paragraph.":::
44-
- **Chunks overlap**: To ensure no contextual information is lost between chunks, you can define the amount of overlap between consecutive chunks.
64+
- **Overlapping chunks**: Adjacent chunks share some content (typically 10-20% overlap). This ensures that information spanning chunk boundaries isn't lost, though it does create some redundancy in your data. The diagram demonstrates how consecutive chunks share common text at their boundaries, ensuring that important information that might span multiple chunks isn't accidentally separated.
4565
:::image type="content" source="../media/chunk-overlap.png" alt-text="Diagram showing overlapping chunks.":::
46-
- **Windowed summarization**: Method where each chunk includes a windowed summary of previous chunks.
66+
- **Windowed summarization**: Each chunk includes a summary of previous chunks, providing additional context. This advanced technique can improve coherence across related chunks. The diagram shows how each new chunk contains not only its own content but also a condensed summary of what came before, creating a rolling context window that helps maintain continuity across the entire document.
4767
:::image type="content" source="../media/chunk-window.png" alt-text="Diagram showing windowed summarization.":::
4868

49-
When you experiment with different chunk sizes and approaches, it also helps to know the user's query patterns. When your end-users tend to have long queries, the queries tend to have better aligned embeddings to the returned chunks, while shorter queries can be more precise.
69+
Your chunking strategy should reflect how users will interact with your system.
70+
71+
Smaller chunks work well for short, specific questions because they provide focused, precise answers. Larger chunks that preserve complete explanations and context are often more effective for complex, detailed questions. Consider using overlapping chunks or multiple chunk sizes to handle a mix of specific and detailed questions. The key is to experiment with different approaches and measure which produces the best results for your specific use case and user patterns.
5072

5173
### Embed your data
74+
After chunking your data, you need to convert your text chunks into **embeddings**—numerical representations that computers can understand and compare. Embeddings are a way to translate human language into mathematical coordinates that capture the meaning and relationships between different pieces of text.
75+
76+
Computers can't directly understand the meaning of words like "dog" or "puppy" or recognize that these words are related. Embeddings solve this problem by converting text into vectors (lists of numbers) where similar concepts are positioned close together in mathematical space. This allows your RAG system to find chunks that are semantically similar to a user's question, even if they don't use the exact same words.
77+
78+
For example, if someone asks "How do I care for a canine?" your system can find relevant chunks about "dog care" because the embeddings recognize that "canine" and "dog" have similar meanings.
79+
80+
When you create embeddings, text chunks and user queries are converted into vectors. Similarity search finds the chunks whose vectors are closest to the query vector. Then relevant chunks are retrieved and send to the language model for response generation
81+
82+
There are various embedding models available, each with different strengths. To choose the best model for your application, consider:
5283

53-
To create numerical representations of your content, you can generate **embeddings**.
84+
- **Data and text properties**: Match the model to your content type. Some models work better with technical documents, others with conversational text. Consider your vocabulary size, domain-specific terms, and typical text length.
5485

55-
There are various embedding models you can use to represent your words with vectors. To choose the most optimal model for your application, you can consider:
86+
- **Model capabilities**: Evaluate whether you need support for multiple languages, different content types (text, images, etc.), or specific embedding dimensions. Larger models often provide better quality but require more computational resources.
5687

57-
- **Data and text properties** like the vocabulary size, domain, and text length.
58-
- **Model capabilities** like whether a model supports multiple languages or modalities, and the supported embedding dimensions and size.
59-
- **Practical considerations** like privacy, cost, and licensing. Also be aware of the context window limitations as many models ignore text beyond their context window limit.
88+
- **Practical considerations**: Consider factors like privacy requirements, cost, and licensing. Also be aware of context window limitations—many models ignore text beyond their maximum input length, which could affect embedding quality for longer chunks.
6089

61-
The recommended approach is to benchmark multiple models on both your queries and documents, and to choose the one that strikes the best balance.
90+
The recommended approach is to benchmark multiple models on both your queries and documents. Test different models and measure which produces the best retrieval results for your specific use case.

0 commit comments

Comments
 (0)