Skip to content

Commit cb723a8

Browse files
srbalakrpamelafox
andauthored
Add integrated vectorizer support (#1159)
* initial checkin * s * s * use workaround to use latest models * use workaround to use latest models * s * s * s * s * s * s * s * format black * fix uts * fix uts * Update README.md Co-authored-by: Pamela Fox <[email protected]> * Update README.md Co-authored-by: Pamela Fox <[email protected]> * Update README.md Co-authored-by: Pamela Fox <[email protected]> * Update README.md Co-authored-by: Pamela Fox <[email protected]> * Update README.md Co-authored-by: Pamela Fox <[email protected]> * Update README.md Co-authored-by: Pamela Fox <[email protected]> * Update infra/main.bicep Co-authored-by: Pamela Fox <[email protected]> * Update infra/main.bicep Co-authored-by: Pamela Fox <[email protected]> * Update scripts/prepdocs.py Co-authored-by: Pamela Fox <[email protected]> * Update scripts/prepdocslib/integratedvectorizerstrategy.py Co-authored-by: Pamela Fox <[email protected]> * Update scripts/prepdocslib/integratedvectorizerstrategy.py Co-authored-by: Pamela Fox <[email protected]> * Update scripts/prepdocslib/integratedvectorizerstrategy.py Co-authored-by: Pamela Fox <[email protected]> * s * s * fixut * s * s * s * fix black format * fix UT * add blob test' * s * s * s * add new ut * s * s * s * Docs tweaks * Rewords * Add close, fix typo * s * s * fix formatting --------- Co-authored-by: Pamela Fox <[email protected]> Co-authored-by: Pamela Fox <[email protected]>
1 parent 690ab5d commit cb723a8

16 files changed

+505
-31
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ urlFragment: azure-search-openai-demo
3737
- [Deploying again](#deploying-again)
3838
- [Sharing environments](#sharing-environments)
3939
- [Enabling optional features](#enabling-optional-features)
40+
- [Enabling Integrated Vectorization](#enabling-integrated-vectorization)
4041
- [Enabling authentication](#enabling-authentication)
4142
- [Enabling login and document level access control](#enabling-login-and-document-level-access-control)
4243
- [Enabling CORS for an alternate frontend](#enabling-cors-for-an-alternate-frontend)
@@ -246,6 +247,17 @@ either you or they can follow these steps:
246247

247248
This section covers the integration of GPT-4 Vision with Azure AI Search. Learn how to enhance your search capabilities with the power of image and text indexing, enabling advanced search functionalities over diverse document types. For a detailed guide on setup and usage, visit our [Enabling GPT-4 Turbo with Vision](docs/gpt4v.md) page.
248249

250+
### Enabling Integrated Vectorization
251+
252+
Azure AI search recently introduced an [integrated vectorization feature in preview mode](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers). This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
253+
254+
To enable integrated vectorization with this sample:
255+
256+
1. If you've previously deployed, delete the existing search index.
257+
2. Run `azd env set USE_FEATURE_INT_VECTORIZATION true`
258+
3. Run `azd up` to update system and user roles
259+
4. You can view the resources such as the indexer and skillset in Azure Portal and monitor the status of the vectorization process.
260+
249261
### Enabling authentication
250262

251263
By default, the deployed Azure web app will have no authentication or access restrictions enabled, meaning anyone with routable network access to the web app can chat with your indexed data. You can require authentication to your Azure Active Directory by following the [Add app authentication](https://learn.microsoft.com/azure/app-service/scenario-secure-app-authentication-app-service) tutorial and set it up against the deployed web app.

docs/data_ingestion.md

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,15 @@
22

33
This guide provides more details for using the `prepdocs` script to index documents for the Chat App.
44

5+
- [Overview of the manual indexing process](#overview-of-the-manual-indexing-process)
6+
- [Chunking](#chunking)
7+
- [Indexing additional documents](#indexing-additional-documents)
8+
- [Removing documents](#removing-documents)
9+
- [Overview of Integrated Vectorization](#overview-of-integrated-vectorization)
10+
- [Indexing additional documents](#indexing-additional-documents-1)
11+
- [Removing documents](#removing-documents-1)
12+
- [Scheduled indexing](#scheduled-indexing)
13+
514
## Overview of the manual indexing process
615

716
The `scripts/prepdocs.py` script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. Whenever `azd up` or `azd provision` is run, the script is called automatically.
@@ -23,16 +32,44 @@ Chunking allows us to limit the amount of information we send to OpenAI due to t
2332

2433
If needed, you can modify the chunking algorithm in `scripts/prepdocslib/textsplitter.py`.
2534

26-
## Indexing additional documents
35+
### Indexing additional documents
2736

2837
To upload more PDFs, put them in the data/ folder and run `./scripts/prepdocs.sh` or `./scripts/prepdocs.ps1`.
2938

3039
A [recent change](https://github.com/Azure-Samples/azure-search-openai-demo/pull/835) added checks to see what's been uploaded before. The prepdocs script now writes an .md5 file with an MD5 hash of each file that gets uploaded. Whenever the prepdocs script is re-run, that hash is checked against the current hash and the file is skipped if it hasn't changed.
3140

32-
## Removing documents
41+
### Removing documents
3342

3443
You may want to remove documents from the index. For example, if you're using the sample data, you may want to remove the documents that are already in the index before adding your own.
3544

3645
To remove all documents, use the `--removeall` flag. Open either `scripts/prepdocs.sh` or `scripts/prepdocs.ps1` and add `--removeall` to the command at the bottom of the file. Then run the script as usual.
3746

3847
You can also remove individual documents by using the `--remove` flag. Open either `scripts/prepdocs.sh` or `scripts/prepdocs.ps1`, add `--remove` to the command at the bottom of the file, and replace `/data/*` with `/data/YOUR-DOCUMENT-FILENAME-GOES-HERE.pdf`. Then run the script as usual.
48+
49+
## Overview of Integrated Vectorization
50+
51+
Azure AI search recently introduced an [integrated vectorization feature in preview mode](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers). This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
52+
53+
See [this notebook](https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/azure-search-integrated-vectorization-sample.ipynb) to understand the process of setting up integrated vectorization.
54+
We have integrated that code into our `prepdocs` script, so you can use it without needing to understand the details.
55+
56+
This feature cannot be used on existing index. You need to create a new index or drop and recreate an existing index.
57+
In the newly created index schema, a new field 'parent_id' is added. This is used internally by the indexer to manage life cycle of chunks.
58+
59+
This feature is not supported in the free SKU for Azure AI Search.
60+
61+
### Indexing of additional documents
62+
63+
To add additional documents to the index, first upload them to your data source (Blob storage, by default).
64+
Then navigate to the Azure portal, find the index, and run it.
65+
The Azure AI Search indexer will identify the new documents and ingest them into the index.
66+
67+
### Removing documents
68+
69+
To remove documents from the index, remove them from your data source (Blob storage, by default).
70+
Then navigate to the Azure portal, find the index, and run it.
71+
The Azure AI Search indexer will take care of removing those documents from the index.
72+
73+
### Scheduled indexing
74+
75+
If you would like the indexer to run automatically, you can set it up to [run on a schedule](https://learn.microsoft.com/azure/search/search-howto-schedule-indexers).

infra/core/search/search-services.bicep

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,3 +62,4 @@ resource search 'Microsoft.Search/searchServices@2021-04-01-preview' = {
6262
output id string = search.id
6363
output endpoint string = 'https://${name}.search.windows.net/'
6464
output name string = search.name
65+
output principalId string = search.identity.principalId

infra/main.bicep

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,8 @@ param useApplicationInsights bool = false
110110

111111
@description('Show options to use vector embeddings for searching in the app UI')
112112
param useVectors bool = false
113+
@description('Use Built-in integrated Vectorization feature of AI Search to vectorize and ingest documents')
114+
param useIntegratedVectorization bool = false
113115

114116
var abbrs = loadJsonContent('abbreviations.json')
115117
var resourceToken = toLower(uniqueString(subscription().id, environmentName, location))
@@ -504,6 +506,17 @@ module openAiRoleBackend 'core/security/role.bicep' = if (openAiHost == 'azure')
504506
}
505507
}
506508

509+
module openAiRoleSearchService 'core/security/role.bicep' = if (openAiHost == 'azure' && useIntegratedVectorization) {
510+
scope: openAiResourceGroup
511+
name: 'openai-role-searchservice'
512+
params: {
513+
principalId: searchService.outputs.principalId
514+
roleDefinitionId: '5e0bd9bd-7b93-4f28-af87-19fc36ad61bd'
515+
principalType: 'ServicePrincipal'
516+
}
517+
}
518+
519+
507520
module storageRoleBackend 'core/security/role.bicep' = {
508521
scope: storageResourceGroup
509522
name: 'storage-role-backend'
@@ -514,6 +527,16 @@ module storageRoleBackend 'core/security/role.bicep' = {
514527
}
515528
}
516529

530+
module storageRoleSearchService 'core/security/role.bicep' = if (useIntegratedVectorization) {
531+
scope: storageResourceGroup
532+
name: 'storage-role-searchservice'
533+
params: {
534+
principalId: searchService.outputs.principalId
535+
roleDefinitionId: '2a2b9908-6ea1-4ae2-8e65-a410df84e7d1'
536+
principalType: 'ServicePrincipal'
537+
}
538+
}
539+
517540
// Used to issue search queries
518541
// https://learn.microsoft.com/azure/search/search-security-rbac
519542
module searchRoleBackend 'core/security/role.bicep' = if (!useSearchServiceKey) {
@@ -572,6 +595,7 @@ output AZURE_SEARCH_SERVICE string = searchService.outputs.name
572595
output AZURE_SEARCH_SECRET_NAME string = useSearchServiceKey ? searchServiceSecretName : ''
573596
output AZURE_SEARCH_SERVICE_RESOURCE_GROUP string = searchServiceResourceGroup.name
574597
output AZURE_SEARCH_SEMANTIC_RANKER string = actualSearchServiceSemanticRankerLevel
598+
output AZURE_SEARCH_SERVICE_ASSIGNED_USERID string = searchService.outputs.principalId
575599

576600
output AZURE_STORAGE_ACCOUNT string = storage.outputs.name
577601
output AZURE_STORAGE_CONTAINER string = storageContainerName

infra/main.parameters.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,9 @@
118118
},
119119
"allowedOrigin": {
120120
"value": "${ALLOWED_ORIGIN}"
121+
},
122+
"useIntegratedVectorization" :{
123+
"value": "${USE_FEATURE_INT_VECTORIZATION}"
121124
}
122125
}
123126
}

scripts/prepdocs.ps1

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,11 +61,16 @@ if ($env:AZURE_TENANT_ID) {
6161
$tenantArg = "--tenantid $env:AZURE_TENANT_ID"
6262
}
6363

64+
if ($env:USE_FEATURE_INT_VECTORIZATION) {
65+
$integratedVectorizationArg = "--useintvectorization $env:USE_FEATURE_INT_VECTORIZATION"
66+
}
67+
6468
$cwd = (Get-Location)
6569
$dataArg = "`"$cwd/data/*`""
6670

6771
$argumentList = "./scripts/prepdocs.py $dataArg --verbose " + `
68-
"--storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER " + `
72+
"--subscriptionid $env:AZURE_SUBSCRIPTION_ID " + `
73+
"--storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER --storageresourcegroup $env:AZURE_STORAGE_RESOURCE_GROUP " + `
6974
"--searchservice $env:AZURE_SEARCH_SERVICE --index $env:AZURE_SEARCH_INDEX " + `
7075
"$searchAnalyzerNameArg $searchSecretNameArg " + `
7176
"--openaihost `"$env:OPENAI_HOST`" --openaimodelname `"$env:AZURE_OPENAI_EMB_MODEL_NAME`" " + `
@@ -76,5 +81,9 @@ $argumentList = "./scripts/prepdocs.py $dataArg --verbose " + `
7681
"$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg " + `
7782
"$tenantArg $aclArg " + `
7883
"$disableVectorsArg $localPdfParserArg " + `
79-
"$keyVaultName "
84+
"$keyVaultName " + `
85+
"$integratedVectorizationArg "
86+
87+
$argumentList
88+
8089
Start-Process -FilePath $venvPythonPath -ArgumentList $argumentList -Wait -NoNewWindow

scripts/prepdocs.py

Lines changed: 96 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,10 @@
1515
OpenAIEmbeddingService,
1616
)
1717
from prepdocslib.fileprocessor import FileProcessor
18-
from prepdocslib.filestrategy import DocumentAction, FileStrategy
18+
from prepdocslib.filestrategy import FileStrategy
19+
from prepdocslib.integratedvectorizerstrategy import (
20+
IntegratedVectorizerStrategy,
21+
)
1922
from prepdocslib.jsonparser import JsonParser
2023
from prepdocslib.listfilestrategy import (
2124
ADLSGen2ListFileStrategy,
@@ -24,7 +27,7 @@
2427
)
2528
from prepdocslib.parser import Parser
2629
from prepdocslib.pdfparser import DocumentAnalysisParser, LocalPdfParser
27-
from prepdocslib.strategy import SearchInfo, Strategy
30+
from prepdocslib.strategy import DocumentAction, SearchInfo, Strategy
2831
from prepdocslib.textsplitter import SentenceTextSplitter, SimpleTextSplitter
2932

3033

@@ -45,12 +48,15 @@ async def get_vision_key(credential: AsyncTokenCredential) -> Optional[str]:
4548
exit(1)
4649

4750

48-
async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> FileStrategy:
51+
async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> Strategy:
4952
storage_creds = credential if is_key_empty(args.storagekey) else args.storagekey
5053
blob_manager = BlobManager(
5154
endpoint=f"https://{args.storageaccount}.blob.core.windows.net",
5255
container=args.container,
56+
account=args.storageaccount,
5357
credential=storage_creds,
58+
resourceGroup=args.storageresourcegroup,
59+
subscriptionId=args.subscriptionid,
5460
store_page_images=args.searchimages,
5561
verbose=args.verbose,
5662
)
@@ -145,6 +151,70 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> Fi
145151
)
146152

147153

154+
async def setup_intvectorizer_strategy(credential: AsyncTokenCredential, args: Any) -> Strategy:
155+
storage_creds = credential if is_key_empty(args.storagekey) else args.storagekey
156+
blob_manager = BlobManager(
157+
endpoint=f"https://{args.storageaccount}.blob.core.windows.net",
158+
container=args.container,
159+
account=args.storageaccount,
160+
credential=storage_creds,
161+
resourceGroup=args.storageresourcegroup,
162+
subscriptionId=args.subscriptionid,
163+
store_page_images=args.searchimages,
164+
verbose=args.verbose,
165+
)
166+
167+
use_vectors = not args.novectors
168+
embeddings: Union[AzureOpenAIEmbeddingService, None] = None
169+
if use_vectors and args.openaihost != "openai":
170+
azure_open_ai_credential: Union[AsyncTokenCredential, AzureKeyCredential] = (
171+
credential if is_key_empty(args.openaikey) else AzureKeyCredential(args.openaikey)
172+
)
173+
embeddings = AzureOpenAIEmbeddingService(
174+
open_ai_service=args.openaiservice,
175+
open_ai_deployment=args.openaideployment,
176+
open_ai_model_name=args.openaimodelname,
177+
credential=azure_open_ai_credential,
178+
disable_batch=args.disablebatchvectors,
179+
verbose=args.verbose,
180+
)
181+
182+
print("Processing files...")
183+
list_file_strategy: ListFileStrategy
184+
if args.datalakestorageaccount:
185+
adls_gen2_creds = credential if is_key_empty(args.datalakekey) else args.datalakekey
186+
print(f"Using Data Lake Gen2 Storage Account {args.datalakestorageaccount}")
187+
list_file_strategy = ADLSGen2ListFileStrategy(
188+
data_lake_storage_account=args.datalakestorageaccount,
189+
data_lake_filesystem=args.datalakefilesystem,
190+
data_lake_path=args.datalakepath,
191+
credential=adls_gen2_creds,
192+
verbose=args.verbose,
193+
)
194+
else:
195+
print(f"Using local files in {args.files}")
196+
list_file_strategy = LocalListFileStrategy(path_pattern=args.files, verbose=args.verbose)
197+
198+
if args.removeall:
199+
document_action = DocumentAction.RemoveAll
200+
elif args.remove:
201+
document_action = DocumentAction.Remove
202+
else:
203+
document_action = DocumentAction.Add
204+
205+
return IntegratedVectorizerStrategy(
206+
list_file_strategy=list_file_strategy,
207+
blob_manager=blob_manager,
208+
document_action=document_action,
209+
embeddings=embeddings,
210+
subscription_id=args.subscriptionid,
211+
search_service_user_assigned_id=args.searchserviceassignedid,
212+
search_analyzer_name=args.searchanalyzername,
213+
use_acls=args.useacls,
214+
category=args.category,
215+
)
216+
217+
148218
async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
149219
search_key = args.searchkey
150220
if args.keyvaultname and args.searchsecretname:
@@ -203,6 +273,7 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
203273
)
204274
parser.add_argument("--storageaccount", help="Azure Blob Storage account name")
205275
parser.add_argument("--container", help="Azure Blob Storage container name")
276+
parser.add_argument("--storageresourcegroup", help="Azure blob storage resource group")
206277
parser.add_argument(
207278
"--storagekey",
208279
required=False,
@@ -211,10 +282,20 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
211282
parser.add_argument(
212283
"--tenantid", required=False, help="Optional. Use this to define the Azure directory where to authenticate)"
213284
)
285+
parser.add_argument(
286+
"--subscriptionid",
287+
required=False,
288+
help="Optional. Use this to define managed identity connection string in integrated vectorization",
289+
)
214290
parser.add_argument(
215291
"--searchservice",
216292
help="Name of the Azure AI Search service where content should be indexed (must exist already)",
217293
)
294+
parser.add_argument(
295+
"--searchserviceassignedid",
296+
required=False,
297+
help="Search service system assigned Identity (Managed identity) (used for integrated vectorization)",
298+
)
218299
parser.add_argument(
219300
"--index",
220301
help="Name of the Azure AI Search index where content should be indexed (will be created if it doesn't exist)",
@@ -309,8 +390,14 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
309390
required=False,
310391
help="Required if --searchimages is specified and --keyvaultname is provided. Fetch the Azure AI Vision key from this key vault instead of using the current user identity to login.",
311392
)
393+
parser.add_argument(
394+
"--useintvectorization",
395+
required=False,
396+
help="Required if --useintvectorization is specified. Enable Integrated vectorizer indexer support which is in preview)",
397+
)
312398
parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
313399
args = parser.parse_args()
400+
use_int_vectorization = args.useintvectorization and args.useintvectorization.lower() == "true"
314401

315402
# Use the current user identity to connect to Azure services unless a key is explicitly set for any of them
316403
azd_credential = (
@@ -320,6 +407,10 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
320407
)
321408

322409
loop = asyncio.get_event_loop()
323-
file_strategy = loop.run_until_complete(setup_file_strategy(azd_credential, args))
324-
loop.run_until_complete(main(file_strategy, azd_credential, args))
410+
ingestion_strategy = None
411+
if use_int_vectorization:
412+
ingestion_strategy = loop.run_until_complete(setup_intvectorizer_strategy(azd_credential, args))
413+
else:
414+
ingestion_strategy = loop.run_until_complete(setup_file_strategy(azd_credential, args))
415+
loop.run_until_complete(main(ingestion_strategy, azd_credential, args))
325416
loop.close()

scripts/prepdocs.sh

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,8 +66,13 @@ if [ -n "$AZURE_TENANT_ID" ]; then
6666
tenantArg="--tenantid $AZURE_TENANT_ID"
6767
fi
6868

69+
if [ -n "$USE_FEATURE_INT_VECTORIZATION" ]; then
70+
integratedVectorizationArg="--useintvectorization $USE_FEATURE_INT_VECTORIZATION"
71+
fi
72+
6973
./scripts/.venv/bin/python ./scripts/prepdocs.py './data/*' --verbose \
70-
--storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER" \
74+
--subscriptionid $AZURE_SUBSCRIPTION_ID \
75+
--storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER" --storageresourcegroup $AZURE_STORAGE_RESOURCE_GROUP \
7176
--searchservice "$AZURE_SEARCH_SERVICE" --index "$AZURE_SEARCH_INDEX" \
7277
$searchAnalyzerNameArg $searchSecretNameArg \
7378
--openaihost "$OPENAI_HOST" --openaimodelname "$AZURE_OPENAI_EMB_MODEL_NAME" \
@@ -78,4 +83,5 @@ $searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg \
7883
$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg \
7984
$tenantArg $aclArg \
8085
$disableVectorsArg $localPdfParserArg \
81-
$keyVaultName
86+
$keyVaultName \
87+
$integratedVectorizationArg

0 commit comments

Comments
 (0)