Azure-Samples
diff --git a/‎README.md‎
Lines changed: 12 additions & 0 deletions b/‎README.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/data_ingestion.md‎
Lines changed: 39 additions & 2 deletions b/‎docs/data_ingestion.md‎
Lines changed: 39 additions & 2 deletions
diff --git a/‎infra/core/search/search-services.bicep‎
Lines changed: 1 addition & 0 deletions b/‎infra/core/search/search-services.bicep‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎infra/main.bicep‎
Lines changed: 24 additions & 0 deletions b/‎infra/main.bicep‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎infra/main.parameters.json‎
Lines changed: 3 additions & 0 deletions b/‎infra/main.parameters.json‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎scripts/prepdocs.ps1‎
Lines changed: 11 additions & 2 deletions b/‎scripts/prepdocs.ps1‎
Lines changed: 11 additions & 2 deletions
diff --git a/‎scripts/prepdocs.py‎
Lines changed: 96 additions & 5 deletions b/‎scripts/prepdocs.py‎
Lines changed: 96 additions & 5 deletions
diff --git a/‎scripts/prepdocs.sh‎
Lines changed: 8 additions & 2 deletions b/‎scripts/prepdocs.sh‎
Lines changed: 8 additions & 2 deletions
@@ -37,6 +37,7 @@ urlFragment: azure-search-openai-demo
   - [Deploying again](#deploying-again)
 - [Sharing environments](#sharing-environments)
 - [Enabling optional features](#enabling-optional-features)
+  - [Enabling Integrated Vectorization](#enabling-integrated-vectorization)
   - [Enabling authentication](#enabling-authentication)
   - [Enabling login and document level access control](#enabling-login-and-document-level-access-control)
   - [Enabling CORS for an alternate frontend](#enabling-cors-for-an-alternate-frontend)
@@ -246,6 +247,17 @@ either you or they can follow these steps:
 
 This section covers the integration of GPT-4 Vision with Azure AI Search. Learn how to enhance your search capabilities with the power of image and text indexing, enabling advanced search functionalities over diverse document types. For a detailed guide on setup and usage, visit our [Enabling GPT-4 Turbo with Vision](docs/gpt4v.md) page.
 
+### Enabling Integrated Vectorization
+
+Azure AI search recently introduced an [integrated vectorization feature in preview mode](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers). This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
+
+To enable integrated vectorization with this sample:
+
+1. If you've previously deployed, delete the existing search index.
+2. Run `azd env set USE_FEATURE_INT_VECTORIZATION true`
+3. Run `azd up` to update system and user roles
+4. You can view the resources such as the indexer and skillset in Azure Portal and monitor the status of the vectorization process.
+
 ### Enabling authentication
 
 By default, the deployed Azure web app will have no authentication or access restrictions enabled, meaning anyone with routable network access to the web app can chat with your indexed data.  You can require authentication to your Azure Active Directory by following the [Add app authentication](https://learn.microsoft.com/azure/app-service/scenario-secure-app-authentication-app-service) tutorial and set it up against the deployed web app.
 
@@ -2,6 +2,15 @@
 
 This guide provides more details for using the `prepdocs` script to index documents for the Chat App.
 
+- [Overview of the manual indexing process](#overview-of-the-manual-indexing-process)
+  - [Chunking](#chunking)
+  - [Indexing additional documents](#indexing-additional-documents)
+  - [Removing documents](#removing-documents)
+- [Overview of Integrated Vectorization](#overview-of-integrated-vectorization)
+  - [Indexing additional documents](#indexing-additional-documents-1)
+  - [Removing documents](#removing-documents-1)
+  - [Scheduled indexing](#scheduled-indexing)
+
 ## Overview of the manual indexing process
 
 The `scripts/prepdocs.py` script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. Whenever `azd up` or `azd provision` is run, the script is called automatically.
@@ -23,16 +32,44 @@ Chunking allows us to limit the amount of information we send to OpenAI due to t
 
 If needed, you can modify the chunking algorithm in `scripts/prepdocslib/textsplitter.py`.
 
-## Indexing additional documents
+### Indexing additional documents
 
 To upload more PDFs, put them in the data/ folder and run `./scripts/prepdocs.sh` or `./scripts/prepdocs.ps1`.
 
 A [recent change](https://github.com/Azure-Samples/azure-search-openai-demo/pull/835) added checks to see what's been uploaded before. The prepdocs script now writes an .md5 file with an MD5 hash of each file that gets uploaded. Whenever the prepdocs script is re-run, that hash is checked against the current hash and the file is skipped if it hasn't changed.
 
-## Removing documents
+### Removing documents
 
 You may want to remove documents from the index. For example, if you're using the sample data, you may want to remove the documents that are already in the index before adding your own.
 
 To remove all documents, use the `--removeall` flag. Open either `scripts/prepdocs.sh` or `scripts/prepdocs.ps1` and add `--removeall` to the command at the bottom of the file. Then run the script as usual.
 
 You can also remove individual documents by using the `--remove` flag. Open either `scripts/prepdocs.sh` or `scripts/prepdocs.ps1`, add `--remove` to the command at the bottom of the file, and replace `/data/*` with `/data/YOUR-DOCUMENT-FILENAME-GOES-HERE.pdf`. Then run the script as usual.
+
+## Overview of Integrated Vectorization
+
+Azure AI search recently introduced an [integrated vectorization feature in preview mode](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers). This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
+
+See [this notebook](https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/azure-search-integrated-vectorization-sample.ipynb) to understand the process of setting up integrated vectorization.
+We have integrated that code into our `prepdocs` script, so you can use it without needing to understand the details.
+
+This feature cannot be used on existing index. You need to create a new index or drop and recreate an existing index.
+In the newly created index schema, a new field 'parent_id' is added. This is used internally by the indexer to manage life cycle of chunks.
+
+This feature is not supported in the free SKU for Azure AI Search.
+
+### Indexing of additional documents
+
+To add additional documents to the index, first upload them to your data source (Blob storage, by default).
+Then navigate to the Azure portal, find the index, and run it.
+The Azure AI Search indexer will identify the new documents and ingest them into the index.
+
+### Removing documents
+
+To remove documents from the index, remove them from your data source (Blob storage, by default).
+Then navigate to the Azure portal, find the index, and run it.
+The Azure AI Search indexer will take care of removing those documents from the index.
+
+### Scheduled indexing
+
+If you would like the indexer to run automatically, you can set it up to [run on a schedule](https://learn.microsoft.com/azure/search/search-howto-schedule-indexers).
@@ -62,3 +62,4 @@ resource search 'Microsoft.Search/searchServices@2021-04-01-preview' = {
 output id string = search.id
 output endpoint string = 'https://${name}.search.windows.net/'
 output name string = search.name
+output principalId string = search.identity.principalId
@@ -110,6 +110,8 @@ param useApplicationInsights bool = false
 
 @description('Show options to use vector embeddings for searching in the app UI')
 param useVectors bool = false
+@description('Use Built-in integrated Vectorization feature of AI Search to vectorize and ingest documents')
+param useIntegratedVectorization bool = false
 
 var abbrs = loadJsonContent('abbreviations.json')
 var resourceToken = toLower(uniqueString(subscription().id, environmentName, location))
@@ -504,6 +506,17 @@ module openAiRoleBackend 'core/security/role.bicep' = if (openAiHost == 'azure')
   }
 }
 
+module openAiRoleSearchService 'core/security/role.bicep' = if (openAiHost == 'azure' && useIntegratedVectorization) {
+  scope: openAiResourceGroup
+  name: 'openai-role-searchservice'
+  params: {
+    principalId: searchService.outputs.principalId
+    roleDefinitionId: '5e0bd9bd-7b93-4f28-af87-19fc36ad61bd'
+    principalType: 'ServicePrincipal'
+  }
+}
+
+
 module storageRoleBackend 'core/security/role.bicep' = {
   scope: storageResourceGroup
   name: 'storage-role-backend'
@@ -514,6 +527,16 @@ module storageRoleBackend 'core/security/role.bicep' = {
   }
 }
 
+module storageRoleSearchService 'core/security/role.bicep' = if (useIntegratedVectorization) {
+  scope: storageResourceGroup
+  name: 'storage-role-searchservice'
+  params: {
+    principalId: searchService.outputs.principalId
+    roleDefinitionId: '2a2b9908-6ea1-4ae2-8e65-a410df84e7d1'
+    principalType: 'ServicePrincipal'
+  }
+}
+
 // Used to issue search queries
 // https://learn.microsoft.com/azure/search/search-security-rbac
 module searchRoleBackend 'core/security/role.bicep' = if (!useSearchServiceKey) {
@@ -572,6 +595,7 @@ output AZURE_SEARCH_SERVICE string = searchService.outputs.name
 output AZURE_SEARCH_SECRET_NAME string = useSearchServiceKey ? searchServiceSecretName : ''
 output AZURE_SEARCH_SERVICE_RESOURCE_GROUP string = searchServiceResourceGroup.name
 output AZURE_SEARCH_SEMANTIC_RANKER string = actualSearchServiceSemanticRankerLevel
+output AZURE_SEARCH_SERVICE_ASSIGNED_USERID string = searchService.outputs.principalId
 
 output AZURE_STORAGE_ACCOUNT string = storage.outputs.name
 output AZURE_STORAGE_CONTAINER string = storageContainerName
 
@@ -118,6 +118,9 @@
     },
     "allowedOrigin": {
       "value": "${ALLOWED_ORIGIN}"
+    },
+    "useIntegratedVectorization" :{
+      "value": "${USE_FEATURE_INT_VECTORIZATION}"
     }
   }
 }
@@ -61,11 +61,16 @@ if ($env:AZURE_TENANT_ID) {
   $tenantArg = "--tenantid $env:AZURE_TENANT_ID"
 }
 
+if ($env:USE_FEATURE_INT_VECTORIZATION) {
+  $integratedVectorizationArg = "--useintvectorization $env:USE_FEATURE_INT_VECTORIZATION"
+}
+
 $cwd = (Get-Location)
 $dataArg = "`"$cwd/data/*`""
 
 $argumentList = "./scripts/prepdocs.py $dataArg --verbose " + `
-"--storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER " + `
+"--subscriptionid $env:AZURE_SUBSCRIPTION_ID " + `
+"--storageaccount $env:AZURE_STORAGE_ACCOUNT --container $env:AZURE_STORAGE_CONTAINER --storageresourcegroup $env:AZURE_STORAGE_RESOURCE_GROUP " + `
 "--searchservice $env:AZURE_SEARCH_SERVICE --index $env:AZURE_SEARCH_INDEX " + `
 "$searchAnalyzerNameArg $searchSecretNameArg " + `
 "--openaihost `"$env:OPENAI_HOST`" --openaimodelname `"$env:AZURE_OPENAI_EMB_MODEL_NAME`" " + `
@@ -76,5 +81,9 @@ $argumentList = "./scripts/prepdocs.py $dataArg --verbose " + `
 "$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg  " + `
 "$tenantArg $aclArg " + `
 "$disableVectorsArg $localPdfParserArg " + `
-"$keyVaultName "
+"$keyVaultName " + `
+"$integratedVectorizationArg "
+
+$argumentList
+
 Start-Process -FilePath $venvPythonPath -ArgumentList $argumentList -Wait -NoNewWindow
@@ -15,7 +15,10 @@
     OpenAIEmbeddingService,
 )
 from prepdocslib.fileprocessor import FileProcessor
-from prepdocslib.filestrategy import DocumentAction, FileStrategy
+from prepdocslib.filestrategy import FileStrategy
+from prepdocslib.integratedvectorizerstrategy import (
+    IntegratedVectorizerStrategy,
+)
 from prepdocslib.jsonparser import JsonParser
 from prepdocslib.listfilestrategy import (
     ADLSGen2ListFileStrategy,
@@ -24,7 +27,7 @@
 )
 from prepdocslib.parser import Parser
 from prepdocslib.pdfparser import DocumentAnalysisParser, LocalPdfParser
-from prepdocslib.strategy import SearchInfo, Strategy
+from prepdocslib.strategy import DocumentAction, SearchInfo, Strategy
 from prepdocslib.textsplitter import SentenceTextSplitter, SimpleTextSplitter
 
 
@@ -45,12 +48,15 @@ async def get_vision_key(credential: AsyncTokenCredential) -> Optional[str]:
         exit(1)
 
 
-async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> FileStrategy:
+async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> Strategy:
     storage_creds = credential if is_key_empty(args.storagekey) else args.storagekey
     blob_manager = BlobManager(
         endpoint=f"https://{args.storageaccount}.blob.core.windows.net",
         container=args.container,
+        account=args.storageaccount,
         credential=storage_creds,
+        resourceGroup=args.storageresourcegroup,
+        subscriptionId=args.subscriptionid,
         store_page_images=args.searchimages,
         verbose=args.verbose,
     )
@@ -145,6 +151,70 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> Fi
     )
 
 
+async def setup_intvectorizer_strategy(credential: AsyncTokenCredential, args: Any) -> Strategy:
+    storage_creds = credential if is_key_empty(args.storagekey) else args.storagekey
+    blob_manager = BlobManager(
+        endpoint=f"https://{args.storageaccount}.blob.core.windows.net",
+        container=args.container,
+        account=args.storageaccount,
+        credential=storage_creds,
+        resourceGroup=args.storageresourcegroup,
+        subscriptionId=args.subscriptionid,
+        store_page_images=args.searchimages,
+        verbose=args.verbose,
+    )
+
+    use_vectors = not args.novectors
+    embeddings: Union[AzureOpenAIEmbeddingService, None] = None
+    if use_vectors and args.openaihost != "openai":
+        azure_open_ai_credential: Union[AsyncTokenCredential, AzureKeyCredential] = (
+            credential if is_key_empty(args.openaikey) else AzureKeyCredential(args.openaikey)
+        )
+        embeddings = AzureOpenAIEmbeddingService(
+            open_ai_service=args.openaiservice,
+            open_ai_deployment=args.openaideployment,
+            open_ai_model_name=args.openaimodelname,
+            credential=azure_open_ai_credential,
+            disable_batch=args.disablebatchvectors,
+            verbose=args.verbose,
+        )
+
+    print("Processing files...")
+    list_file_strategy: ListFileStrategy
+    if args.datalakestorageaccount:
+        adls_gen2_creds = credential if is_key_empty(args.datalakekey) else args.datalakekey
+        print(f"Using Data Lake Gen2 Storage Account {args.datalakestorageaccount}")
+        list_file_strategy = ADLSGen2ListFileStrategy(
+            data_lake_storage_account=args.datalakestorageaccount,
+            data_lake_filesystem=args.datalakefilesystem,
+            data_lake_path=args.datalakepath,
+            credential=adls_gen2_creds,
+            verbose=args.verbose,
+        )
+    else:
+        print(f"Using local files in {args.files}")
+        list_file_strategy = LocalListFileStrategy(path_pattern=args.files, verbose=args.verbose)
+
+    if args.removeall:
+        document_action = DocumentAction.RemoveAll
+    elif args.remove:
+        document_action = DocumentAction.Remove
+    else:
+        document_action = DocumentAction.Add
+
+    return IntegratedVectorizerStrategy(
+        list_file_strategy=list_file_strategy,
+        blob_manager=blob_manager,
+        document_action=document_action,
+        embeddings=embeddings,
+        subscription_id=args.subscriptionid,
+        search_service_user_assigned_id=args.searchserviceassignedid,
+        search_analyzer_name=args.searchanalyzername,
+        use_acls=args.useacls,
+        category=args.category,
+    )
+
+
 async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
     search_key = args.searchkey
     if args.keyvaultname and args.searchsecretname:
@@ -203,6 +273,7 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
     )
     parser.add_argument("--storageaccount", help="Azure Blob Storage account name")
     parser.add_argument("--container", help="Azure Blob Storage container name")
+    parser.add_argument("--storageresourcegroup", help="Azure blob storage resource group")
     parser.add_argument(
         "--storagekey",
         required=False,
@@ -211,10 +282,20 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
     parser.add_argument(
         "--tenantid", required=False, help="Optional. Use this to define the Azure directory where to authenticate)"
     )
+    parser.add_argument(
+        "--subscriptionid",
+        required=False,
+        help="Optional. Use this to define managed identity connection string in integrated vectorization",
+    )
     parser.add_argument(
         "--searchservice",
         help="Name of the Azure AI Search service where content should be indexed (must exist already)",
     )
+    parser.add_argument(
+        "--searchserviceassignedid",
+        required=False,
+        help="Search service system assigned Identity (Managed identity) (used for integrated vectorization)",
+    )
     parser.add_argument(
         "--index",
         help="Name of the Azure AI Search index where content should be indexed (will be created if it doesn't exist)",
@@ -309,8 +390,14 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
         required=False,
         help="Required if --searchimages is specified and --keyvaultname is provided. Fetch the Azure AI Vision key from this key vault instead of using the current user identity to login.",
     )
+    parser.add_argument(
+        "--useintvectorization",
+        required=False,
+        help="Required if --useintvectorization is specified. Enable Integrated vectorizer indexer support which is in preview)",
+    )
     parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
     args = parser.parse_args()
+    use_int_vectorization = args.useintvectorization and args.useintvectorization.lower() == "true"
 
     # Use the current user identity to connect to Azure services unless a key is explicitly set for any of them
     azd_credential = (
@@ -320,6 +407,10 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
     )
 
     loop = asyncio.get_event_loop()
-    file_strategy = loop.run_until_complete(setup_file_strategy(azd_credential, args))
-    loop.run_until_complete(main(file_strategy, azd_credential, args))
+    ingestion_strategy = None
+    if use_int_vectorization:
+        ingestion_strategy = loop.run_until_complete(setup_intvectorizer_strategy(azd_credential, args))
+    else:
+        ingestion_strategy = loop.run_until_complete(setup_file_strategy(azd_credential, args))
+    loop.run_until_complete(main(ingestion_strategy, azd_credential, args))
     loop.close()
@@ -66,8 +66,13 @@ if [ -n "$AZURE_TENANT_ID" ]; then
   tenantArg="--tenantid $AZURE_TENANT_ID"
 fi
 
+if [ -n "$USE_FEATURE_INT_VECTORIZATION" ]; then
+  integratedVectorizationArg="--useintvectorization $USE_FEATURE_INT_VECTORIZATION"
+fi
+
 ./scripts/.venv/bin/python ./scripts/prepdocs.py './data/*' --verbose \
---storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER"  \
+--subscriptionid $AZURE_SUBSCRIPTION_ID  \
+--storageaccount "$AZURE_STORAGE_ACCOUNT" --container "$AZURE_STORAGE_CONTAINER" --storageresourcegroup $AZURE_STORAGE_RESOURCE_GROUP \
 --searchservice "$AZURE_SEARCH_SERVICE" --index "$AZURE_SEARCH_INDEX" \
 $searchAnalyzerNameArg $searchSecretNameArg \
 --openaihost "$OPENAI_HOST" --openaimodelname "$AZURE_OPENAI_EMB_MODEL_NAME" \
@@ -78,4 +83,5 @@ $searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg \
 $adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg \
 $tenantArg $aclArg \
 $disableVectorsArg $localPdfParserArg \
-$keyVaultName
+$keyVaultName \
+$integratedVectorizationArg
Original file line number	Diff line number	Diff line change
`@@ -118,6 +118,9 @@`
`118`	`118`	`},`
`119`	`119`	`"allowedOrigin": {`
`120`	`120`	`"value": "${ALLOWED_ORIGIN}"`
	`121`	`+ },`
	`122`	`+ "useIntegratedVectorization" :{`
	`123`	`+ "value": "${USE_FEATURE_INT_VECTORIZATION}"`
`121`	`124`	`}`
`122`	`125`	`}`
`123`	`126`	`}`