Skip to content

Commit b9b6037

Browse files
Merge pull request #146 from HeidiSteen/heidist-rag
[azure search] RAG tutorial
2 parents 42bb17c + e37704a commit b9b6037

13 files changed

+1063
-3
lines changed
70.6 KB
Loading
137 KB
Loading
129 KB
Loading
42.9 KB
Loading
114 KB
Loading

articles/search/retrieval-augmented-generation-overview.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ The decision about which information retrieval system to use is critical because
3030
Azure AI Search is a [proven solution for information retrieval](/azure/developer/python/get-started-app-chat-template?tabs=github-codespaces) in a RAG architecture. It provides indexing and query capabilities, with the infrastructure and security of the Azure cloud. Through code and other components, you can design a comprehensive RAG solution that includes all of the elements for generative AI over your proprietary content.
3131

3232
> [!NOTE]
33-
> New to copilot and RAG concepts? Watch [Vector search and state of the art retrieval for Generative AI apps](https://ignite.microsoft.com/sessions/18618ca9-0e4d-4f9d-9a28-0bc3ef5cf54e?source=sessions).
33+
> New to copilot and RAG concepts? Watch [Vector search and state of the art retrieval for Generative AI apps](https://www.youtube.com/watch?v=lSzc1MJktAo).
3434
3535
## Approaches for RAG with Azure AI Search
3636

@@ -222,6 +222,8 @@ A RAG solution that includes Azure AI Search can leverage [built-in data chunkin
222222

223223
+ [Try this RAG quickstart](search-get-started-rag.md) for a demonstration of query integration with chat models over a search index.
224224

225+
+ [Tutorial: How to build a RAG solution in Azure AI Search](tutorial-rag-build-solution.md) for focused coverage on the features and pattern for RAG solutions that obtain grounding data from a search index.
226+
225227
+ Start with solution accelerators:
226228

227229
+ ["Chat with your data" solution accelerator](https://github.com/Azure-Samples/chat-with-your-data-solution-accelerator) helps you create a custom RAG solution over your content.

articles/search/search-get-started-rag.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,6 @@ When you're working in your own subscription, it's a good idea at the end of a p
276276

277277
You can find and manage resources in the portal by using the **All resources** or **Resource groups** link in the leftmost pane.
278278

279-
## Next steps
279+
## See also
280280

281-
As a next step, we recommend that you review the demo code for [Python](https://github.com/Azure/azure-search-vector-samples/tree/main/demo-python), [C#](https://github.com/Azure/azure-search-vector-samples/tree/main/demo-dotnet), or [JavaScript](https://github.com/Azure/azure-search-vector-samples/tree/main/demo-javascript) on the azure-search-vector-samples repository.
281+
- [Tutorial: How to build a RAG solution in Azure AI Search](tutorial-rag-build-solution.md)

articles/search/toc.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
items:
12
- name: Azure AI Search Documentation
23
href: index.yml
34
- name: Overview
@@ -89,6 +90,18 @@
8990
href: search-howto-index-encrypted-blobs.md
9091
- name: Create a custom analyzer
9192
href: tutorial-create-custom-analyzer.md
93+
- name: RAG tutorials
94+
items:
95+
- name: Build a RAG solution
96+
href: tutorial-rag-build-solution.md
97+
- name: Choose models
98+
href: tutorial-rag-build-solution-models.md
99+
- name: Design an index
100+
href: tutorial-rag-build-solution-index-schema.md
101+
- name: Build an indexing pipeline
102+
href: tutorial-rag-build-solution-pipeline.md
103+
- name: Search and generate answers
104+
href: tutorial-rag-build-solution-query.md
92105
- name: Skills tutorials
93106
items:
94107
- name: C#
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
---
2+
title: 'RAG Tutorial: Design an index'
3+
titleSuffix: Azure AI Search
4+
description: Design an index for RAG patterns in Azure AI Search.
5+
6+
manager: nitinme
7+
author: HeidiSteen
8+
ms.author: heidist
9+
ms.service: cognitive-search
10+
ms.topic: tutorial
11+
ms.date: 09/12/2024
12+
13+
---
14+
15+
# Tutorial: Design an index for RAG in Azure AI Search
16+
17+
An index contains searchable text and vector content, plus configurations. In a RAG pattern that uses a chat model for responses, you want an index that contains chunks of content that can be passed to an LLM at query time.
18+
19+
In this tutorial, you:
20+
21+
> [!div class="checklist"]
22+
> - Learn the characteristics of an index schema built for RAG
23+
> - Create an index that accommodate vectors and hybrid queries
24+
> - Add vector profiles and configurations
25+
> - Add structured data
26+
> - Add filtering
27+
28+
## Prerequisites
29+
30+
[Visual Studio Code](https://code.visualstudio.com/download) with the [Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python) and the [Jupyter package](https://pypi.org/project/jupyter/). For more information, see [Python in Visual Studio Code](https://code.visualstudio.com/docs/languages/python).
31+
32+
The output of this exercise is an index definition in JSON. At this point, it's not uploaded to Azure AI Search, so there are no requirements for cloud services or permissions in this exercise.
33+
34+
## Review schema considerations for RAG
35+
36+
In conversational search, LLMs compose the response that the user sees, not the search engine, so you don't need to think about what fields to show in your search results, and whether the representations of individual search documents are coherent to the user. Depending on the question, the LLM might return verbatim content from your index, or more likely, repackage the content for a better answer.
37+
38+
### Organized around chunks
39+
40+
When LLMs generate a response, they operate on chunks of content for message inputs, and while they need to know where the chunk came from for citation purposes, what matters most is the quality of message inputs and its relevance to the user's question. Whether the chunks come from one document or a thousand, the LLM ingests the information or *grounding data*, and formulates the response using instructions provided in a system prompt.
41+
42+
Chunks are the focus of the schema, and each chunk is the defining element of a search document in a RAG pattern. You can think of your index as a large collection of chunks, as opposed to traditional search documents that probably have more structure, such as fields containing uniform content for a name, descriptions, categories, and addresses.
43+
44+
### Content centricity and structured data
45+
46+
In addition to structural considerations, like chunked content, you also want to consider the substance of your content because it also informs what fields are indexed.
47+
48+
In this tutorial, sample data consists of PDFs and content from the NASA Earth Book. This content is descriptive and informative, with numerous references to geographies, countries, and areas across the world. To capture this information in our index and potentially use it in queries, we include skills in our indexing pipeline that recognize and extract this information, loading it into a searchable and filterable `locations` field. Adding structured content to your index gives you more options for filtering, relevance tuning, and richer answers.
49+
50+
The original ebook is large, over 100 pages and 35 MB in size. We broke it up into smaller PDFs, one per page of text, to stay under the REST API payload limit of 16 MB per API call.
51+
52+
For simplicity, we omit image vectorization for this exercise.
53+
54+
### Parent-child fields in one or two indexes
55+
56+
Chunked content typically derives from a larger document. And although the schema is organized around chunks, you also want to capture properties and content at the parent level. Examples of these properties might include the parent file path, title, authors, publication date, or a summary.
57+
58+
An inflection point in schema design is whether to have two indexes for parent and child/chunked content, or a single index that repeats parent elements for each chunk.
59+
60+
In this tutorial, because all of the chunks of text originate from a single parent (NASA Earth Book), you don't need a separate index dedicated to up level the parent fields. However, if you're indexing from multiple parent PDFs, you might want a parent-child index pair to capture level-specific fields and then send [lookup queries](/rest/api/searchservice/documents/get) to the parent index to retrieve those fields relevant to each chunk.
61+
62+
### Checklist of schema considerations
63+
64+
In Azure AI Search, an index that works best for RAG workloads has these qualities:
65+
66+
- Returns chunks that are relevant to the query and readable to the LLM. LLMs can handle a certain level of dirty data in chunks, such as mark up, redundancy, and incomplete strings. While chunks need to be readable and relevant to the question, they don't need to be pristine.
67+
68+
- Maintains a parent-child relationship between chunks of a document and the properties of the parent document, such as the file name, file type, title, author, and so forth. To answer a query, chunks could be pulled from anywhere in the index. Association with the parent document providing the chunk is useful for context, citations, and follow up queries.
69+
70+
- Accommodates the queries you want create. You should have fields for vector and hybrid content, and those fields should be attributed to support specific query behaviors. You can only query one index at a time (no joins) so your fields collection should define all of your searchable content.
71+
72+
- Your schema should be flat (no complex types or structures). This requirement is specific to the RAG pattern in Azure AI Search.
73+
74+
Although Azure AI Search can't join indexes, you can create indexes that preserve parent-child relationship, and then use sequential or parallel queries in your search logic to pull from both. This exercise includes templates for parent-child elements in the same index and in separate indexes, where information from the parent index is retrieved using a lookup query.
75+
76+
<!-- > [!NOTE]
77+
> Schema design affects storage and costs. This exercise is focused on schema fundamentals. In the [Minimize storage and costs](tutorial-rag-build-solution-minimize-storage.md) tutorial, you revisit schema design to consider narrow data types, attribution, and vector configurations that offer more efficient. -->
78+
79+
## Create an index for RAG workloads
80+
81+
A minimal index for LLM is designed to store chunks of content. It typically includes vector fields if you want similarity search for highly relevant results. It also includes nonvector fields for human-readable inputs to the LLM for conversational search. Nonvector chunked content in the search results becomes the grounding data sent to the LLM.
82+
83+
1. Open Visual Studio Code and create a new file. It doesn't have to be a Python file type for this exercise.
84+
85+
1. Here's a minimal index definition for RAG solutions that support vector and hybrid search. Review it for an introduction to required elements: index name, fields, and a configuration section for vector fields.
86+
87+
```json
88+
{
89+
"name": "example-minimal-index",
90+
"fields": [
91+
{ "name": "id", "type": "Edm.String", "key": true },
92+
{ "name": "chunked_content", "type": "Edm.String", "searchable": true, "retrievable": true },
93+
{ "name": "chunked_content_vectorized", "type": "Edm.Single", "dimensions": 1536, "vectorSearchProfile": "my-vector-profile", "searchable": true, "retrievable": false, "stored": false },
94+
{ "name": "metadata", "type": "Edm.String", "retrievable": true, "searchable": true, "filterable": true }
95+
],
96+
"vectorSearch": {
97+
"algorithms": [
98+
{ "name": "my-algo-config", "kind": "hnsw", "hnswParameters": { } }
99+
],
100+
"profiles": [
101+
{ "name": "my-vector-profile", "algorithm": "my-algo-config" }
102+
]
103+
}
104+
}
105+
```
106+
107+
Fields must include key field (`"id"`) and should include vector chunks for similarity search, and nonvector chunks for inputs to the LLM.
108+
109+
Vector fields have [specific types](/rest/api/searchservice/supported-data-types#edm-data-types-for-vector-fields) and extra attributes for embedding model dimensions and configuration. `Edm.Single` is a data type that works for commonly used LLMs. For more information about vector fields, see [Create a vector index](vector-search-how-to-create-index.md).
110+
111+
Metadata fields might be file path, creation date, or content type and are useful for [filters](vector-search-filters.md).
112+
113+
1. Here's the index schema for the [tutorial source code](https://github.com/Azure-Samples/azure-search-python-samples/blob/main/Tutorial-RAG/Tutorial-rag.ipynb) and the [Earth Book content](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/nasa-e-book/earth_book_2019_text_pages).
114+
115+
Like the basic schema, it's organized around chunks. The `chunk_id` uniquely identifies each chunk. The `text_vector` field is an embedding of the chunk. The nonvector `chunk` field is a readable string. The `title` maps to a unique metadata storage path for the blobs. The `parent_id` is the only parent-level field, and it's a base64-encoded version of the parent file URI.
116+
117+
The schema also includes a `locations` field for storing generated content that's created by the [indexing pipeline](tutorial-rag-build-solution-pipeline.md).
118+
119+
```python
120+
index_name = "py-rag-tutorial-idx"
121+
index_client = SearchIndexClient(endpoint=AZURE_SEARCH_SERVICE, credential=AZURE_SEARCH_CREDENTIAL)
122+
fields = [
123+
SearchField(name="parent_id", type=SearchFieldDataType.String),
124+
SearchField(name="title", type=SearchFieldDataType.String),
125+
SearchField(name="locations", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True),
126+
SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),
127+
SearchField(name="chunk", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),
128+
SearchField(name="text_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile")
129+
]
130+
131+
# Configure the vector search configuration
132+
vector_search = VectorSearch(
133+
algorithms=[
134+
HnswAlgorithmConfiguration(name="myHnsw"),
135+
],
136+
profiles=[
137+
VectorSearchProfile(
138+
name="myHnswProfile",
139+
algorithm_configuration_name="myHnsw",
140+
vectorizer="myOpenAI",
141+
)
142+
],
143+
vectorizers=[
144+
AzureOpenAIVectorizer(
145+
name="myOpenAI",
146+
kind="azureOpenAI",
147+
azure_open_ai_parameters=AzureOpenAIParameters(
148+
resource_uri=AZURE_OPENAI_ACCOUNT,
149+
deployment_id="text-embedding-ada-002",
150+
model_name="text-embedding-ada-002"
151+
),
152+
),
153+
],
154+
)
155+
156+
# Create the search index
157+
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)
158+
result = index_client.create_or_update_index(index)
159+
print(f"{result.name} created")
160+
```
161+
162+
1. For an index schema that more closely mimics structured content, you would have separate indexes for parent and child (chunked) fields. You would need index projections to coordinate the indexing of the two indexes simultaneously. Queries execute against the child index. Query logic includes a lookup query, using the parent_idto retrieve content from the parent index.
163+
164+
Fields in the child index:
165+
166+
- ID
167+
- chunk
168+
- chunkVectcor
169+
- parent_id
170+
171+
Fields in the parent index (everything that you want "one of"):
172+
173+
- parent_id
174+
- parent-level fields (name, title, category)
175+
176+
<!-- Objective:
177+
178+
- Design an index schema that generates results in a format that works for LLMs.
179+
180+
Key points:
181+
182+
- schema for rag is designed for producing chunks of content
183+
- schema should be flat (no complex types or structures)
184+
- schema determines what queries you can create (be generous in attribute assignments)
185+
- schema must cover all the queries you want to run. You can only query one index at a time (no joins), but you can create indexes that preserve parent-child relationship, and then use nested queries or parallel queries in your search logic to pull from both.
186+
- schema has impact on storage/size. Consider narrow data types, attribution, vector configuration.
187+
- show schema patterns: one for parent-child all-up, one for paired indexes via index projections
188+
- note metadata for filters
189+
- TBD: add fields for location and use entity recognition to pull this values out of the PDFs? Not sure how the extraction will work on chunked documents or how it will query, but goal would be to show that you can add structured data to the schema.
190+
191+
Tasks:
192+
193+
- H2 How to create an index for chunked and vectorized data (show examples for parent-child variants)
194+
- H2 How to define vector profiles and configuration (discuss pros and cons, shouldn't be a rehash of existing how-to)
195+
- H2 How to add filters
196+
- H2 How to add structured data (example is "location", top-level field, data aquisition is through the pipeline) -->
197+
198+
## Next step
199+
200+
> [!div class="nextstepaction"]
201+
> [Create an indexing pipeline](tutorial-rag-build-solution-pipeline.md)

0 commit comments

Comments
 (0)