Skip to content

Commit 579160c

Browse files
committed
tidy up commented text
1 parent aaa2b4f commit 579160c

File tree

4 files changed

+32
-297
lines changed

4 files changed

+32
-297
lines changed

articles/search/tutorial-rag-build-solution-index-schema.md

Lines changed: 0 additions & 121 deletions
Original file line numberDiff line numberDiff line change
@@ -159,121 +159,6 @@ A minimal index for LLM is designed to store chunks of content. It typically inc
159159
print(f"{result.name} created")
160160
```
161161

162-
<!-- ```json
163-
{
164-
"name": "rag-tutorial-earth-book",
165-
"defaultScoringProfile": null,
166-
"fields": [
167-
{
168-
"name": "chunk_id",
169-
"type": "Edm.String",
170-
"key": true,
171-
"searchable": true,
172-
"filterable": true,
173-
"retrievable": true,
174-
"stored": true,
175-
"sortable": true,
176-
"facetable": true,
177-
"analyzer": "keyword",
178-
},
179-
{
180-
"name": "parent_id",
181-
"type": "Edm.String",
182-
"searchable": true,
183-
"filterable": true,
184-
"retrievable": true,
185-
"stored": true,
186-
"sortable": true,
187-
"facetable": true,
188-
"analyzer": null,
189-
},
190-
{
191-
"name": "chunk",
192-
"type": "Edm.String",
193-
"searchable": true,
194-
"filterable": false,
195-
"retrievable": true,
196-
"stored": true,
197-
"sortable": false,
198-
"facetable": false,
199-
"analyzer": null,
200-
},
201-
{
202-
"name": "title",
203-
"type": "Edm.String",
204-
"searchable": true,
205-
"filterable": true,
206-
"retrievable": true,
207-
"stored": true,
208-
"sortable": false,
209-
"facetable": false,
210-
"analyzer": null,
211-
},
212-
{
213-
"name": "text_vector",
214-
"type": "Collection(Edm.Single)",
215-
"searchable": true,
216-
"retrievable": true,
217-
"stored": true,
218-
"dimensions": 1536,
219-
"vectorSearchProfile": "rag-tutorial-earth-book-azureOpenAi-text-profile",
220-
"vectorEncoding": null,
221-
},
222-
{
223-
"name": "locations",
224-
"type": "Collection(Edm.String)",
225-
"searchable": true,
226-
"filterable": true,
227-
"retrievable": true,
228-
"stored": true,
229-
"sortable": false,
230-
"facetable": false,
231-
"analyzer": "standard.lucene",
232-
}
233-
],
234-
"vectorSearch": {
235-
"algorithms": [
236-
{
237-
"name": "rag-tutorial-earth-book-algorithm",
238-
"kind": "hnsw",
239-
"hnswParameters": {
240-
"metric": "cosine",
241-
"m": 4,
242-
"efConstruction": 400,
243-
"efSearch": 500
244-
},
245-
"exhaustiveKnnParameters": null
246-
}
247-
],
248-
"profiles": [
249-
{
250-
"name": "rag-tutorial-earth-book-azureOpenAi-text-profile",
251-
"algorithm": "rag-tutorial-earth-book-algorithm",
252-
"vectorizer": "rag-tutorial-earth-book-azureOpenAi-text-vectorizer",
253-
"compression": null
254-
}
255-
],
256-
"vectorizers": [
257-
{
258-
"name": "rag-tutorial-earth-book-azureOpenAi-text-vectorizer",
259-
"kind": "azureOpenAI",
260-
"azureOpenAIParameters": {
261-
"resourceUri": "https://heidistazureopenaieastus.openai.azure.com",
262-
"deploymentId": "text-embedding-ada-002",
263-
"apiKey": null,
264-
"modelName": "text-embedding-ada-002",
265-
"authIdentity": null
266-
},
267-
"customWebApiParameters": null,
268-
"aiServicesVisionParameters": null,
269-
"amlParameters": null
270-
}
271-
],
272-
"compressions": []
273-
}
274-
}
275-
``` -->
276-
277162
1. For an index schema that more closely mimics structured content, you would have separate indexes for parent and child (chunked) fields. You would need index projections to coordinate the indexing of the two indexes simultaneously. Queries execute against the child index. Query logic includes a lookup query, using the parent_idto retrieve content from the parent index.
278163

279164
Fields in the child index:
@@ -294,7 +179,6 @@ A minimal index for LLM is designed to store chunks of content. It typically inc
294179

295180
Key points:
296181

297-
-
298182
- schema for rag is designed for producing chunks of content
299183
- schema should be flat (no complex types or structures)
300184
- schema determines what queries you can create (be generous in attribute assignments)
@@ -311,11 +195,6 @@ Tasks:
311195
- H2 How to add filters
312196
- H2 How to add structured data (example is "location", top-level field, data aquisition is through the pipeline) -->
313197

314-
<!--
315-
316-
ps 2: A richer index has more fields and configurations, and is often better because extra fields support richer queries and more opportunities for relevance tuning. Filters and scoring profiles for boosting apply to nonvector fields. If you have content that should be matched precisely and not similarly, such as a name or employee number, then create fields to contain that information.*
317-
-->
318-
319198
## Next step
320199

321200
> [!div class="nextstepaction"]

articles/search/tutorial-rag-build-solution-models.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -175,13 +175,6 @@ Tasks:
175175
- H2: How to use other models (create a custom skill, create a custom vectorizer).
176176
- H2: How to configure access. Set up an Azure AI Search managed identity, give it permissions on Azure-hosted models. -->
177177

178-
<!--
179-
The GPT-35-Turbo and GPT-4 models are optimized to work with inputs formatted as a conversation.
180-
181-
The messages variable passes an array of dictionaries with different roles in the conversation delineated by system, user, and assistant.
182-
183-
The system message can be used to prime the model by including context or instructions on how the model should respond. -->
184-
185178
## Next step
186179

187180
> [!div class="nextstepaction"]

articles/search/tutorial-rag-build-solution-pipeline.md

Lines changed: 18 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,13 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
3434

3535
- [Visual Studio Code](https://code.visualstudio.com/download) with the [Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python) and the [Jupyter package](https://pypi.org/project/jupyter/). For more information, see [Python in Visual Studio Code](https://code.visualstudio.com/docs/languages/python).
3636

37-
- Azure Storage general purpose account. This exercise uploads PDF files into blob storage for automated indexing.
37+
- [Azure Storage](/azure/storage/common/storage-account-create) general purpose account. This exercise uploads PDF files into blob storage for automated indexing.
3838

39-
- Azure AI Search, Basic tier or above for managed identity and semantic ranking. Choose a region that's shared with Azure OpenAI.
39+
- [Azure AI Search](search-create-service-portal.md), Basic tier or above for managed identity and semantic ranking. Choose a region that's shared with Azure OpenAI and Azure AI Services.
4040

41-
- Azure OpenAI, with a deployment of text-embedding-002. For more information about embedding models used in RAG solutions, see [Choose embedding models for RAG in Azure AI Search](tutorial-rag-build-solution-models.md)
41+
- [Azure OpenAI](/azure/ai-services/openai/how-to/create-resource), with a deployment of text-embedding-002, in the same region as Azure AI Search. For more information about embedding models used in RAG solutions, see [Choose embedding models for RAG in Azure AI Search](tutorial-rag-build-solution-models.md).
42+
43+
- [Azure AI Services multiservice account](/azure/ai-services/multi-service-resource), in the same region as Azure AI Search. This resource is used for the Entity Recognition skill that detects locations in your content.
4244

4345
## Download the sample
4446

@@ -200,7 +202,8 @@ index_projections = SearchIndexerIndexProjections(
200202
source_context="/document/pages/*",
201203
mappings=[
202204
InputFieldMappingEntry(name="chunk", source="/document/pages/*"),
203-
InputFieldMappingEntry(name="text_vector", source="/document/pages/*/text_vector"),
205+
InputFieldMappingEntry(name="text_vector", source="/document/pages/*/text_vector"),
206+
InputFieldMappingEntry(name="locations", source="/document/pages/*/locations"),
204207
InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),
205208
],
206209
),
@@ -255,17 +258,16 @@ indexer = SearchIndexer(
255258
parameters=indexer_parameters
256259
)
257260

261+
# Create and run the indexer
258262
indexer_client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=AZURE_SEARCH_CREDENTIAL)
259263
indexer_result = indexer_client.create_or_update_indexer(indexer)
260-
261-
# Run the indexer
262-
indexer_client.run_indexer(indexer_name)
264+
263265
print(f' {indexer_name} is created and running. Give the indexer a few minutes before running a query.')
264266
```
265267

266-
## Run hybrid search to check results
268+
## Run a query to check results
267269

268-
Send a query to confirm your index is operational. A hybrid query is useful for verifying text and vector search.
270+
Send a query to confirm your index is operational. This request converts the text string "`where are the nasa headquarters located?`" into a vector for a vector search. Results consist of the fields in the select statement, some of which are printed as output.
269271

270272
```python
271273
from azure.search.documents import SearchClient
@@ -286,14 +288,17 @@ results = search_client.search(
286288

287289
for result in results:
288290
print(f"Score: {result['@search.score']}")
289-
print(f"Title: {result['title']}")
291+
print(f"Title: {result['title']}")
292+
print(f"Locations: {result['locations']}")
290293
print(f"Content: {result['chunk']}")
291294
```
292295

293296
This query returns a single match (`top=1`) consisting of the one chunk determined by the search engine to be the most relevant. Results from the query should look similar to the following example:
294297

295298
```
296299
Score: 0.03306011110544205
300+
Title: page-178.pdf
301+
Locations: ['Headquarters', 'Washington']
297302
Content: national Aeronautics and Space Administration
298303
299304
earth Science
@@ -309,12 +314,10 @@ www.nasa.gov
309314
np-2018-05-2546-hQ
310315
```
311316

312-
Try a few more queries to get a sense of what the search engine returns directly so that you can compare it with an LLM-enabled response. Rerun the previous script with this query: "how much of the earth is covered in water"?
317+
Try a few more queries to get a sense of what the search engine returns directly so that you can compare it with an LLM-enabled response. Rerun the previous script with this query: `"how much of the earth is covered in water"`?
313318

314319
Results from this second query should look similar to the following results, which are lightly edited for concision.
315320

316-
With this example, it's easier to spot how chunks are returned verbatim, and how keyword and similarity search identify top matches. This specific chunk definitely has information about water and coverage over the earth, but it's not exactly relevant to the query. Semantic ranking would find a better answer, but as a next step, let's see how to connect Azure AI Search to an LLM for conversational search.
317-
318321
```
319322
Score: 0.03333333507180214
320323
Content:
@@ -339,6 +342,8 @@ summer thaw. Nonetheless, this July 2001 image shows plenty of surface vegetatio
339342
shrubs, and grasses. The abundant fresh water also means the area is teeming with flies and mosquitoes.
340343
```
341344

345+
With this example, it's easier to spot how chunks are returned verbatim, and how keyword and similarity search identify top matches. This specific chunk definitely has information about water and coverage over the earth, but it's not exactly relevant to the query. Semantic ranking would find a better answer, but as a next step, let's see how to connect Azure AI Search to an LLM for conversational search.
346+
342347
<!-- Objective:
343348
344349
- Create objects and run the indexer to produce an operational search index with chunked and vectorized content.
@@ -362,105 +367,6 @@ Tasks:
362367
- H2: Create and run the indexer
363368
- H2: Check your data in the search index (hide vectors) -->
364369

365-
<!--
366-
## Prerequisites
367-
368-
TBD
369-
370-
## Create a blob data source
371-
372-
1. Create a baseline data source definition with required elements. Provide a valid connection string to your Azure Storage account. Provide the name of the container that has the sample data.
373-
374-
```http
375-
### Create a data source
376-
POST {{baseUrl}}/datasources?api-version=2024-05-01-preview HTTP/1.1
377-
Content-Type: application/json
378-
Authorization: Bearer {{token}}
379-
380-
{
381-
"name": "demo-rag-ds",
382-
"description": null,
383-
"type": "azureblob",
384-
"subtype": null,
385-
"credentials": {
386-
"connectionString": "{{storageConnectionString}}"
387-
},
388-
"container": {
389-
"name": "{{blobContainer}}",
390-
"query": null
391-
},
392-
"dataChangeDetectionPolicy": null,
393-
"dataDeletionDetectionPolicy": null
394-
}
395-
```
396-
397-
1. Review the [Datasource REST API](/rest/api/searchservice/data-sources/create) for information about other properties. For more information about blob indexers, see [Index data from Azure Blob Storage](search-howto-indexing-azure-blob-storage.md).
398-
399-
1. Send the request to save the data source to Azure AI Search.
400-
401-
## Create an indexer
402-
403-
1. Create a baseline indexer definition with required elements. In this example, the indexer is disabled so that it doesn't immediately run when it's saved to the search service. In later steps, you'll add a skillset and output field mappings, and run the indexer once it's fully specified.
404-
405-
```http
406-
### Create and run an indexer
407-
POST {{baseUrl}}/indexers?api-version=2023-11-01 HTTP/1.1
408-
Content-Type: application/json
409-
Authorization: Bearer {{token}}
410-
411-
{
412-
"name" : "demo-rag-idxr",
413-
"dataSourceName" : "demo-rag-ds",
414-
"targetIndexName" : "demo-rag-index",
415-
"skillsetName" : null,
416-
"disabled" : true,
417-
"fieldMappings" : null,
418-
"outputFieldMappings" : null
419-
}
420-
```
421-
422-
1. Review the [Indexer REST API](/rest/api/searchservice/indexers/create) for information about other properties. For more information about indexers, see [Create an indexer](search-howto-create-indexers.md).
423-
424-
1. Send the request to save the data source to Azure AI Search.
425-
426-
## About indexer execution
427-
428-
An indexer connects to a supported data source, retrieves data, serializes it into JSON, calls a skillset, and populates a predefined index with raw content from the source and generated content from a skillset.
429-
430-
An indexer requires a data source and an index, and accepts a skillset definition. All of these objects are distinct.
431-
432-
- An indexer object provides configuration information and field mappings.
433-
- A data source has connection information.
434-
- An index is the destination of an indexer pipeline and it defines the physical structure of your data in Azure AI Search.
435-
- A skillset is optional, but necessary for RAG workloads if you want integrated data chunking and vectorization.
436-
437-
If you're already familiar with indexers and data sources, the definitions don't change in a RAG solution.
438-
439-
## Checklist for indexer execution
440-
441-
Before you run an indexer, review this checklist to avoid problems during indexing. This checklist applies equally to RAG and non-RAG scenarios:
442-
443-
- Is the data source accessible to Azure AI Search? Check network configuration and permissions. Indexers connect under a search service identity. Consider configuring your search service for a managed identity and then granting it read permissions.
444-
- Does the data source support change tracking? Enable it so that your search service can keep your index up to date.
445-
- Is the data ready for indexing? Indexers consume a single table (or view), or a collection of documents from a single directory. You can either consolidate files into one location, or you could create multiple data sources and indexers that send data to the same index.
446-
- Do you need vectorization? Most RAG apps built on Azure AI Search include vector content in the index to support similarity search and hybrid queries. If you need vectorization and chunking, create a skillset and add it to your indexer.
447-
- Do you need field mappings? If source and destination field names or types are different, add field mappings.
448-
- If you have a skillset that generates content that you need to store in your index, add output field mappings. Data chunks fall into this category. More information about output field mappings is covered in the skillset exercise.
449-
450-
## Check index
451-
452-
duplicated content in the index
453-
454-
chunks aren't intended for classic search experience. Chunks might start or end mid-sentence or contain duplicated content if you specified an overlap.
455-
456-
Combined index means duplicated parent fields. Document grain is the chunk so each chunk has its copy of parent fields.
457-
Overlapping text also duplicates content.
458-
459-
All of this duplicated content is acceptable for LLMS because they aren't returning verbatim results.
460-
461-
if you're sending search results directly to a search page, it's a poor experience.
462-
-->
463-
464370
## Next step
465371

466372
> [!div class="nextstepaction"]

0 commit comments

Comments
 (0)