You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. For an index schema that more closely mimics structured content, you would have separate indexes for parent and child (chunked) fields. You would need index projections to coordinate the indexing of the two indexes simultaneously. Queries execute against the child index. Query logic includes a lookup query, using the parent_idto retrieve content from the parent index.
278
163
279
164
Fields in the child index:
@@ -294,7 +179,6 @@ A minimal index for LLM is designed to store chunks of content. It typically inc
294
179
295
180
Key points:
296
181
297
-
-
298
182
- schema for rag is designed for producing chunks of content
299
183
- schema should be flat (no complex types or structures)
300
184
- schema determines what queries you can create (be generous in attribute assignments)
@@ -311,11 +195,6 @@ Tasks:
311
195
- H2 How to add filters
312
196
- H2 How to add structured data (example is "location", top-level field, data aquisition is through the pipeline) -->
313
197
314
-
<!--
315
-
316
-
ps 2: A richer index has more fields and configurations, and is often better because extra fields support richer queries and more opportunities for relevance tuning. Filters and scoring profiles for boosting apply to nonvector fields. If you have content that should be matched precisely and not similarly, such as a name or employee number, then create fields to contain that information.*
@@ -34,11 +34,13 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
34
34
35
35
-[Visual Studio Code](https://code.visualstudio.com/download) with the [Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python) and the [Jupyter package](https://pypi.org/project/jupyter/). For more information, see [Python in Visual Studio Code](https://code.visualstudio.com/docs/languages/python).
36
36
37
-
- Azure Storage general purpose account. This exercise uploads PDF files into blob storage for automated indexing.
37
+
-[Azure Storage](/azure/storage/common/storage-account-create) general purpose account. This exercise uploads PDF files into blob storage for automated indexing.
38
38
39
-
- Azure AI Search, Basic tier or above for managed identity and semantic ranking. Choose a region that's shared with Azure OpenAI.
39
+
-[Azure AI Search](search-create-service-portal.md), Basic tier or above for managed identity and semantic ranking. Choose a region that's shared with Azure OpenAI and Azure AI Services.
40
40
41
-
- Azure OpenAI, with a deployment of text-embedding-002. For more information about embedding models used in RAG solutions, see [Choose embedding models for RAG in Azure AI Search](tutorial-rag-build-solution-models.md)
41
+
-[Azure OpenAI](/azure/ai-services/openai/how-to/create-resource), with a deployment of text-embedding-002, in the same region as Azure AI Search. For more information about embedding models used in RAG solutions, see [Choose embedding models for RAG in Azure AI Search](tutorial-rag-build-solution-models.md).
42
+
43
+
-[Azure AI Services multiservice account](/azure/ai-services/multi-service-resource), in the same region as Azure AI Search. This resource is used for the Entity Recognition skill that detects locations in your content.
print(f'{indexer_name} is created and running. Give the indexer a few minutes before running a query.')
264
266
```
265
267
266
-
## Run hybrid search to check results
268
+
## Run a query to check results
267
269
268
-
Send a query to confirm your index is operational. A hybrid query is useful for verifying text and vector search.
270
+
Send a query to confirm your index is operational. This request converts the text string "`where are the nasa headquarters located?`" into a vector for a vector search. Results consist of the fields in the select statement, some of which are printed as output.
This query returns a single match (`top=1`) consisting of the one chunk determined by the search engine to be the most relevant. Results from the query should look similar to the following example:
294
297
295
298
```
296
299
Score: 0.03306011110544205
300
+
Title: page-178.pdf
301
+
Locations: ['Headquarters', 'Washington']
297
302
Content: national Aeronautics and Space Administration
298
303
299
304
earth Science
@@ -309,12 +314,10 @@ www.nasa.gov
309
314
np-2018-05-2546-hQ
310
315
```
311
316
312
-
Try a few more queries to get a sense of what the search engine returns directly so that you can compare it with an LLM-enabled response. Rerun the previous script with this query: "how much of the earth is covered in water"?
317
+
Try a few more queries to get a sense of what the search engine returns directly so that you can compare it with an LLM-enabled response. Rerun the previous script with this query: `"how much of the earth is covered in water"`?
313
318
314
319
Results from this second query should look similar to the following results, which are lightly edited for concision.
315
320
316
-
With this example, it's easier to spot how chunks are returned verbatim, and how keyword and similarity search identify top matches. This specific chunk definitely has information about water and coverage over the earth, but it's not exactly relevant to the query. Semantic ranking would find a better answer, but as a next step, let's see how to connect Azure AI Search to an LLM for conversational search.
317
-
318
321
```
319
322
Score: 0.03333333507180214
320
323
Content:
@@ -339,6 +342,8 @@ summer thaw. Nonetheless, this July 2001 image shows plenty of surface vegetatio
339
342
shrubs, and grasses. The abundant fresh water also means the area is teeming with flies and mosquitoes.
340
343
```
341
344
345
+
With this example, it's easier to spot how chunks are returned verbatim, and how keyword and similarity search identify top matches. This specific chunk definitely has information about water and coverage over the earth, but it's not exactly relevant to the query. Semantic ranking would find a better answer, but as a next step, let's see how to connect Azure AI Search to an LLM for conversational search.
346
+
342
347
<!-- Objective:
343
348
344
349
- Create objects and run the indexer to produce an operational search index with chunked and vectorized content.
@@ -362,105 +367,6 @@ Tasks:
362
367
- H2: Create and run the indexer
363
368
- H2: Check your data in the search index (hide vectors) -->
364
369
365
-
<!--
366
-
## Prerequisites
367
-
368
-
TBD
369
-
370
-
## Create a blob data source
371
-
372
-
1. Create a baseline data source definition with required elements. Provide a valid connection string to your Azure Storage account. Provide the name of the container that has the sample data.
373
-
374
-
```http
375
-
### Create a data source
376
-
POST {{baseUrl}}/datasources?api-version=2024-05-01-preview HTTP/1.1
377
-
Content-Type: application/json
378
-
Authorization: Bearer {{token}}
379
-
380
-
{
381
-
"name": "demo-rag-ds",
382
-
"description": null,
383
-
"type": "azureblob",
384
-
"subtype": null,
385
-
"credentials": {
386
-
"connectionString": "{{storageConnectionString}}"
387
-
},
388
-
"container": {
389
-
"name": "{{blobContainer}}",
390
-
"query": null
391
-
},
392
-
"dataChangeDetectionPolicy": null,
393
-
"dataDeletionDetectionPolicy": null
394
-
}
395
-
```
396
-
397
-
1. Review the [Datasource REST API](/rest/api/searchservice/data-sources/create) for information about other properties. For more information about blob indexers, see [Index data from Azure Blob Storage](search-howto-indexing-azure-blob-storage.md).
398
-
399
-
1. Send the request to save the data source to Azure AI Search.
400
-
401
-
## Create an indexer
402
-
403
-
1. Create a baseline indexer definition with required elements. In this example, the indexer is disabled so that it doesn't immediately run when it's saved to the search service. In later steps, you'll add a skillset and output field mappings, and run the indexer once it's fully specified.
404
-
405
-
```http
406
-
### Create and run an indexer
407
-
POST {{baseUrl}}/indexers?api-version=2023-11-01 HTTP/1.1
408
-
Content-Type: application/json
409
-
Authorization: Bearer {{token}}
410
-
411
-
{
412
-
"name" : "demo-rag-idxr",
413
-
"dataSourceName" : "demo-rag-ds",
414
-
"targetIndexName" : "demo-rag-index",
415
-
"skillsetName" : null,
416
-
"disabled" : true,
417
-
"fieldMappings" : null,
418
-
"outputFieldMappings" : null
419
-
}
420
-
```
421
-
422
-
1. Review the [Indexer REST API](/rest/api/searchservice/indexers/create) for information about other properties. For more information about indexers, see [Create an indexer](search-howto-create-indexers.md).
423
-
424
-
1. Send the request to save the data source to Azure AI Search.
425
-
426
-
## About indexer execution
427
-
428
-
An indexer connects to a supported data source, retrieves data, serializes it into JSON, calls a skillset, and populates a predefined index with raw content from the source and generated content from a skillset.
429
-
430
-
An indexer requires a data source and an index, and accepts a skillset definition. All of these objects are distinct.
431
-
432
-
- An indexer object provides configuration information and field mappings.
433
-
- A data source has connection information.
434
-
- An index is the destination of an indexer pipeline and it defines the physical structure of your data in Azure AI Search.
435
-
- A skillset is optional, but necessary for RAG workloads if you want integrated data chunking and vectorization.
436
-
437
-
If you're already familiar with indexers and data sources, the definitions don't change in a RAG solution.
438
-
439
-
## Checklist for indexer execution
440
-
441
-
Before you run an indexer, review this checklist to avoid problems during indexing. This checklist applies equally to RAG and non-RAG scenarios:
442
-
443
-
- Is the data source accessible to Azure AI Search? Check network configuration and permissions. Indexers connect under a search service identity. Consider configuring your search service for a managed identity and then granting it read permissions.
444
-
- Does the data source support change tracking? Enable it so that your search service can keep your index up to date.
445
-
- Is the data ready for indexing? Indexers consume a single table (or view), or a collection of documents from a single directory. You can either consolidate files into one location, or you could create multiple data sources and indexers that send data to the same index.
446
-
- Do you need vectorization? Most RAG apps built on Azure AI Search include vector content in the index to support similarity search and hybrid queries. If you need vectorization and chunking, create a skillset and add it to your indexer.
447
-
- Do you need field mappings? If source and destination field names or types are different, add field mappings.
448
-
- If you have a skillset that generates content that you need to store in your index, add output field mappings. Data chunks fall into this category. More information about output field mappings is covered in the skillset exercise.
449
-
450
-
## Check index
451
-
452
-
duplicated content in the index
453
-
454
-
chunks aren't intended for classic search experience. Chunks might start or end mid-sentence or contain duplicated content if you specified an overlap.
455
-
456
-
Combined index means duplicated parent fields. Document grain is the chunk so each chunk has its copy of parent fields.
457
-
Overlapping text also duplicates content.
458
-
459
-
All of this duplicated content is acceptable for LLMS because they aren't returning verbatim results.
460
-
461
-
if you're sending search results directly to a search page, it's a poor experience.
0 commit comments