Skip to content

Commit 0c0d9e8

Browse files
committed
checkpoint
1 parent 9756ee8 commit 0c0d9e8

5 files changed

+349
-39
lines changed

articles/search/tutorial-rag-build-solution-index-schema.md

Lines changed: 81 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.date: 09/12/2024
1212

1313
---
1414

15-
# Tutorial: Design an index (RAG in Azure AI Search)
15+
# Tutorial: Design an index for RAG in Azure AI Search
1616

1717
An index contains searchable text and vector content, plus configurations. In a RAG pattern that uses a chat model for responses, you want an index that contains chunks of content that can be passed to an LLM at query time.
1818

@@ -35,29 +35,29 @@ The output of this exercise is an index definition in JSON. At this point, it's
3535

3636
In conversational search, LLMs compose the response that the user sees, not the search engine, so you don't need to think about what fields to show in your search results, and whether the representations of individual search documents are coherent to the user. Depending on the question, the LLM might return verbatim content from your index, or more likely, repackage the content for a better answer.
3737

38-
### Focus on chunks
38+
### Organized around chunks
3939

4040
When LLMs generate a response, they operate on chunks of content for message inputs, and while they need to know where the chunk came from for citation purposes, what matters most is the quality of message inputs and its relevance to the user's question. Whether the chunks come from one document or a thousand, the LLM ingests the information or *grounding data*, and formulates the response using instructions provided in a system prompt.
4141

4242
Chunks are the focus of the schema, and each chunk is the defining element of a search document in a RAG pattern. You can think of your index as a large collection of chunks, as opposed to traditional search documents that probably have more structure, such as fields containing uniform content for a name, descriptions, categories, and addresses.
4343

44-
### Focus on content
44+
### Content-aware
4545

4646
In addition to structural considerations, like chunked content, you also want to consider the substance of your content because it also informs what fields are indexed.
4747

48-
In this tutorial, we use PDFs and content from the NASA Earth Book. This content is descriptive and informative, with numerous references to geographies, countries, and areas across the world. To capture this information in our index and potentially use it in queries, we can include skills in our indexing pipeline that recognize and extract this information, loading it into a searchable and filterable `locations` field.
48+
In this tutorial, sample data consists of PDFs and content from the NASA Earth Book. This content is descriptive and informative, with numerous references to geographies, countries, and areas across the world. To capture this information in our index and potentially use it in queries, we include skills in our indexing pipeline that recognize and extract this information, loading it into a searchable and filterable `locations` field.
4949

5050
The original ebook is large, over 100 pages and 35 MB in size. We broke it up into smaller PDFs, one per page of text, to stay under the REST API payload limit of 16 MB per API call.
5151

5252
For simplicity, we omit image vectorization for this exercise.
5353

54-
### Focus on parent-child indexes
54+
### Parent-child fields in one or two indexes
5555

56-
Chunked content typically derives from a larger document. And although the schema is organized around chunks, you also want to capture properties and content at the parent level. Examples of these properties might include the parent file path, title, authors, publication date, summary.
56+
Chunked content typically derives from a larger document. And although the schema is organized around chunks, you also want to capture properties and content at the parent level. Examples of these properties might include the parent file path, title, authors, publication date, or a summary.
5757

5858
An inflection point in schema design is whether to have two indexes for parent and child/chunked content, or a single index that repeats parent elements for each chunk.
5959

60-
In this tutorial, because all of the chunks of text originate from a single parent (NASA Earth Book), you don't need a separate index dedicated to up level parent fields. If you index from multiple parent PDFs, you might want a parent-child index pair to capture level-specific fields and then send lookup queries to the parent index to retrieve those fields relevant to each chunk. We include an example of that parent-child index template in this exercise for comparison.
60+
In this tutorial, because all of the chunks of text originate from a single parent (NASA Earth Book), you don't need a separate index dedicated to up level parent fields. However, if you index from multiple parent PDFs, you might want a parent-child index pair to capture level-specific fields and then send lookup queries to the parent index to retrieve those fields relevant to each chunk.
6161

6262
### Checklist of schema considerations
6363

@@ -76,13 +76,13 @@ Although Azure AI Search can't join indexes, you can create indexes that preserv
7676
> [!NOTE]
7777
> Schema design affects storage and costs. This exercise is focused on schema fundamentals. In the [Minimize storage and costs](tutorial-rag-build-solution-minimize-storage.md) tutorial, you revisit schema design to consider narrow data types, attribution, and vector configurations that offer more efficient.
7878
79-
## Create a basic index
79+
## Create an index for RAG workloads
8080

81-
A minimal index for LLM is designed to store chunks of content. It includes vector fields if you want similarity search for highly relevant results, and nonvector fields for human-readable inputs to the LLM for conversational search. Nonvector chunked content in the search results becomes the grounding data sent to the LLM.
81+
A minimal index for LLM is designed to store chunks of content. It typically includes vector fields if you want similarity search for highly relevant results. It also includes nonvector fields for human-readable inputs to the LLM for conversational search. Nonvector chunked content in the search results becomes the grounding data sent to the LLM.
8282

8383
1. Open Visual Studio Code and create a new file. It doesn't have to be a Python file type for this exercise.
8484

85-
1. Here's a minimal index definition for RAG solutions that support vector and hybrid search. Review it for an introduction to required elements: name, fields, and a `vectorSearch` configuration for the vector fields.
85+
1. Here's a minimal index definition for RAG solutions that support vector and hybrid search. Review it for an introduction to required elements: index name, fields, and a configuration section for vector fields.
8686

8787
```json
8888
{
@@ -91,7 +91,7 @@ A minimal index for LLM is designed to store chunks of content. It includes vect
9191
{ "name": "id", "type": "Edm.String", "key": true },
9292
{ "name": "chunked_content", "type": "Edm.String", "searchable": true, "retrievable": true },
9393
{ "name": "chunked_content_vectorized", "type": "Edm.Single", "dimensions": 1536, "vectorSearchProfile": "my-vector-profile", "searchable": true, "retrievable": false, "stored": false },
94-
{ "name": "metadata", "type": "Edm.String", "retrievable": true, "searchable": true }
94+
{ "name": "metadata", "type": "Edm.String", "retrievable": true, "searchable": true, "filterable": true }
9595
],
9696
"vectorSearch": {
9797
"algorithms": [
@@ -104,13 +104,62 @@ A minimal index for LLM is designed to store chunks of content. It includes vect
104104
}
105105
```
106106

107-
Fields must include key field (`"id"`) and should include vector chunks for similarity search, and nonvector chunks for the LLM. Metadata about the source file might be file path, creation date, or content type.
108-
109-
Vector fields have [specific types](/rest/api/searchservice/supported-data-types#edm-data-types-for-vector-fields) and extra attributes for embedding model dimensions and configuration. `Edm.Single` is a data type that works for the more commonly used LLMs. For more information about vector fields, see [Create a vector index](vector-search-how-to-create-index.md).
110-
111-
1. Here's the index schema for the tutorial and the Earth Book content. It's similar to the basic schema, but adds a parent ID, metadata (`title`), strings (`chunks`), and vectors for similarity search (`text_vectors`). It also includes a `locations` field for storing generated content that's created in the [indexing pipeline](tutorial-rag-build-solution-pipeline.md).
107+
Fields must include key field (`"id"`) and should include vector chunks for similarity search, and nonvector chunks for inputs to the LLM.
108+
109+
Vector fields have [specific types](/rest/api/searchservice/supported-data-types#edm-data-types-for-vector-fields) and extra attributes for embedding model dimensions and configuration. `Edm.Single` is a data type that works for commonly used LLMs. For more information about vector fields, see [Create a vector index](vector-search-how-to-create-index.md).
110+
111+
Metadata fields might be file path, creation date, or content type and are useful for [filters](vector-search-filters.md).
112+
113+
1. Here's the index schema for the [tutorial source code](https://github.com/Azure-Samples/azure-search-python-samples/blob/main/Tutorial-RAG/Tutorial-rag.ipynb) and the [Earth Book content](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/nasa-e-book/earth_book_2019_text_pages).
114+
115+
Like the basic schema, it's organized around chunks. The `chunk_id` uniquely identifies each chunk. The `text_vector` field is an embedding of the chunk. The nonvector `chunk` field is a readable string. The `title` maps to a unique metadata storage path for the blobs. The `parent_id` is the only parent-level field, and it's a base64-encoded version of the parent file URI.
116+
117+
The schema also includes a `locations` field for storing generated content that's created by the [indexing pipeline](tutorial-rag-build-solution-pipeline.md).
118+
119+
```python
120+
index_name = "py-rag-tutorial-idx"
121+
index_client = SearchIndexClient(endpoint=AZURE_SEARCH_SERVICE, credential=AZURE_SEARCH_CREDENTIAL)
122+
fields = [
123+
SearchField(name="parent_id", type=SearchFieldDataType.String),
124+
SearchField(name="title", type=SearchFieldDataType.String),
125+
SearchField(name="locations", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True),
126+
SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),
127+
SearchField(name="chunk", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),
128+
SearchField(name="text_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile")
129+
]
130+
131+
# Configure the vector search configuration
132+
vector_search = VectorSearch(
133+
algorithms=[
134+
HnswAlgorithmConfiguration(name="myHnsw"),
135+
],
136+
profiles=[
137+
VectorSearchProfile(
138+
name="myHnswProfile",
139+
algorithm_configuration_name="myHnsw",
140+
vectorizer="myOpenAI",
141+
)
142+
],
143+
vectorizers=[
144+
AzureOpenAIVectorizer(
145+
name="myOpenAI",
146+
kind="azureOpenAI",
147+
azure_open_ai_parameters=AzureOpenAIParameters(
148+
resource_uri=AZURE_OPENAI_ACCOUNT,
149+
deployment_id="text-embedding-ada-002",
150+
model_name="text-embedding-ada-002"
151+
),
152+
),
153+
],
154+
)
155+
156+
# Create the search index
157+
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)
158+
result = index_client.create_or_update_index(index)
159+
print(f"{result.name} created")
160+
```
112161

113-
```json
162+
<!-- ```json
114163
{
115164
"name": "rag-tutorial-earth-book",
116165
"defaultScoringProfile": null,
@@ -223,7 +272,21 @@ A minimal index for LLM is designed to store chunks of content. It includes vect
223272
"compressions": []
224273
}
225274
}
275+
``` -->
276+
277+
1. For an index schema that more closely mimics structured content, you would have separate indexes for parent and child (chunked) fields. You would need index projections to coordinate the indexing of the two indexes simultaneously. Queries execute against the child index. Query logic includes a lookup query, using the parent_idto retrieve content from the parent index.
278+
279+
Fields in the child index:
280+
281+
- id
282+
- chunk
283+
- chunkVectcor
284+
- parent_id
226285

286+
Fields in the parent index (everything that you want "one of"):
287+
288+
- parent_id
289+
- parent-level fields (name, title, category)
227290

228291
<!-- Objective:
229292

@@ -251,22 +314,7 @@ Tasks:
251314
<!--
252315

253316
ps 2: A richer index has more fields and configurations, and is often better because extra fields support richer queries and more opportunities for relevance tuning. Filters and scoring profiles for boosting apply to nonvector fields. If you have content that should be matched precisely and not similarly, such as a name or employee number, then create fields to contain that information.*
254-
255-
## BLOCKED: Index for hybrid queries and relevance tuning
256-
257-
Per Carey, you would want a couple indexes for this scenario - parent index, chunked/child index linked to parent - with queries that include lookup to access fields at the parent level. You would need index projections to "project" to coordiate the indexing of the two indexes simultaneously.
258-
259-
child index:
260-
- id
261-
- chunk
262-
- chunkVectcor
263-
- parentId
264-
265-
parent index (everything that you want "one of"):
266-
- fields for verbatim matching (name, title, category)
267-
- fields for filters or boosting (dates, geo coordates)
268-
269-
This is probably out of scope for this tutorial, but could be an extension. -->
317+
-->
270318

271319
## Next step
272320

articles/search/tutorial-rag-build-solution-minimize-storage.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.date: 09/12/2024
1212

1313
---
1414

15-
# Tutorial: Minimize storage and costs using vector compression and narrow data types (RAG in Azure AI Search)
15+
# Tutorial: Minimize storage and costs (RAG in Azure AI Search)
1616

1717
In this tutorial, learn the techniques for reducing index size, with a focus on vector compression and storage.
1818

articles/search/tutorial-rag-build-solution-models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ ms.date: 09/12/2024
1313

1414
---
1515

16-
# Tutorial: Choose embedding and chat models (RAG in Azure AI Search)
16+
# Tutorial: Choose embedding and chat models for RAG in Azure AI Search
1717

1818
A RAG solution built on Azure AI Search takes a dependency on embedding models for vectorization, and on chat models for conversational search over your data.
1919

0 commit comments

Comments
 (0)