Skip to content

Commit f694996

Browse files
Merge pull request #1376 from HeidiSteen/heidist-ignite
[release-azure-search] List portal data support for structured data sources
2 parents 579e2fd + 6b185ec commit f694996

5 files changed

+96
-46
lines changed

articles/search/search-file-storage-integration.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.service: azure-ai-search
99
ms.custom:
1010
- ignite-2023
1111
ms.topic: how-to
12-
ms.date: 08/23/2024
12+
ms.date: 11/19/2024
1313
---
1414

1515
# Index data from Azure Files
@@ -19,7 +19,12 @@ ms.date: 08/23/2024
1919
2020
In this article, learn how to configure an [**indexer**](search-indexer-overview.md) that imports content from Azure Files and makes it searchable in Azure AI Search. Inputs to the indexer are your files in a single share. Output is a search index with searchable content and metadata stored in individual fields.
2121

22-
This article supplements [**Create an indexer**](search-howto-create-indexers.md) with information that's specific to indexing files in Azure Storage. It uses the REST APIs to demonstrate a three-part workflow common to all indexers: create a data source, create an index, create an indexer. Data extraction occurs when you submit the Create Indexer request.
22+
To configure and run the indexer, you can use:
23+
24+
+ [Search Service preview REST APIs](/rest/api/searchservice), any preview version.
25+
+ An Azure SDK package, any version.
26+
+ [Import data](search-get-started-portal.md) wizard in the Azure portal.
27+
+ [Import and vectorize data](search-get-started-portal-import-vectors.md) wizard in the Azure portal.
2328

2429
## Prerequisites
2530

@@ -33,6 +38,16 @@ This article supplements [**Create an indexer**](search-howto-create-indexers.md
3338

3439
+ Use a [REST client](search-get-started-rest.md) to formulate REST calls similar to the ones shown in this article.
3540

41+
## Supported tasks
42+
43+
You can use this indexer for the following tasks:
44+
45+
+ **Data indexing and incremental indexing:** The indexer can index files and associated metadata from tables. It detects new and updated files and metadata through built-in change detection. You can configure data refresh on a schedule or on demand.
46+
+ **Deletion detection:** The indexer can [detect deletions through custom metadata](search-howto-index-changed-deleted-blobs.md).
47+
+ **Applied AI through skillsets:** [Skillsets](cognitive-search-concept-intro.md) are fully supported by the indexer. This includes key features like [integrated vectorization](vector-search-integrated-vectorization.md) that adds data chunking and embedding steps.
48+
+ **Parsing modes:** The indexer supports [JSON parsing modes](search-howto-index-json-blobs.md) if you want to parse JSON arrays or lines into individual search documents. It also supports [Markdown parsing mode](search-how-to-index-markdown-blobs.md).
49+
+ **Compatibility with other features:** The indexer is designed to work seamlessly with other indexer features, such as [debug sessions](cognitive-search-debug-session.md), [indexer cache for incremental enrichments](search-howto-incremental-index.md), and [knowledge store](knowledge-store-concept-intro.md).
50+
3651
## Supported document formats
3752

3853
The Azure Files indexer can extract text from the following document formats:

articles/search/search-get-started-portal-import-vectors.md

Lines changed: 43 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -8,47 +8,44 @@ ms.service: azure-ai-search
88
ms.custom:
99
- build-2024
1010
ms.topic: quickstart
11-
ms.date: 10/18/2024
11+
ms.date: 11/19/2024
1212
---
1313

1414
# Quickstart: Vectorize text and images by using the Azure portal
1515

1616
This quickstart helps you get started with [integrated vectorization](vector-search-integrated-vectorization.md) by using the **Import and vectorize data** wizard in the Azure portal. The wizard chunks your content and calls an embedding model to vectorize content during indexing and for queries.
1717

18-
Key points about the wizard:
18+
## Prerequisites
1919

20-
+ Supported data sources are Azure Blob Storage, Azure Data Lake Storage (ADLS) Gen2, or OneLake files and shortcuts.
21-
+ Supported embedding models are hosted on Azure OpenAI, Azure AI Studio model catalog, Azure AI Vision multimodal.
22-
+ Index schema provides vector and nonvector fields for chunked data.
23-
+ You can add fields, but you can't delete or modify generated fields.
24-
+ Document parsing mode creates chunks (one search document per chunk).
25-
+ Chunking is nonconfigurable. The effective settings are:
20+
+ An Azure subscription. [Create one for free](https://azure.microsoft.com/free/).
2621

27-
```json
28-
"textSplitMode": "pages",
29-
"maximumPageLength": 2000,
30-
"pageOverlapLength": 500,
31-
"maximumPagesToTake": 0, #unlimited
32-
"unit": "characters",
33-
```
22+
+ [An Azure AI Search service](search-create-service-portal.md) in the same region as Azure AI. We recommend the Basic tier or higher.
3423

35-
## Prerequisites
24+
+ [A supported data source](#supported-data-sources).
3625

37-
+ An Azure subscription. [Create one for free](https://azure.microsoft.com/free/).
26+
+ [A supported embedding model](#supported-embedding-models).
27+
28+
### Supported data sources
29+
30+
+ [Azure Data Lake Storage (ADLS) Gen2](/azure/storage/blobs/create-data-lake-storage-account) (a storage account with a hierarchical namespace).
3831

39-
+ [Azure AI Search service](search-create-service-portal.md) in the same region as Azure AI. We recommend the Basic tier or higher.
32+
+ [Azure Storage](/azure/storage/common/storage-account-create) for blobs, files, and tables. Azure Storage must be a standard performance (general-purpose v2) account. Access tiers can be hot, cool, and cold.
4033

41-
+ [Azure Blob Storage](/azure/storage/common/storage-account-create), [Azure Data Lake Storage (ADLS) Gen2](/azure/storage/blobs/create-data-lake-storage-account) (a storage account with a hierarchical namespace), or a [OneLake lakehouse](search-how-to-index-onelake-files.md).
34+
+ [Azure Cosmos DB](/azure/cosmos-db/nosql/quickstart-portal) for NoSQL, Mongo DB, and Apache Gremlin.
4235

43-
Azure Storage must be a standard performance (general-purpose v2) account. Access tiers can be hot, cool, and cold.
36+
+ [Azure SQL Database](/azure/azure-sql/database/single-database-create-quickstart), [Azure SQL Managed Instance](/azure/azure-sql/managed-instance/instance-create-quickstart), and Azure SQL Server virtual machines.
4437

45-
+ An embedding model on an Azure AI platform in the [same region as Azure AI Search](search-create-service-portal.md#regions-with-the-most-overlap). [Deployment instructions](#set-up-embedding-models) are in this article.
38+
+ [OneLake lakehouse](search-how-to-index-onelake-files.md).
4639

47-
| Provider | Supported models |
48-
|---|---|
49-
| [Azure OpenAI Service](https://aka.ms/oai/access) | text-embedding-ada-002, text-embedding-3-large, or text-embedding-3-small. |
50-
| [Azure AI Studio model catalog](/azure/ai-studio/what-is-ai-studio) | Azure, Cohere, and Facebook embedding models. |
51-
| [Azure AI services multi-service account](/azure/ai-services/multi-service-resource) | [Azure AI Vision multimodal](/azure/ai-services/computer-vision/how-to/image-retrieval) for image and text vectorization. Azure AI Vision multimodal is available in selected regions. [Check the documentation](/azure/ai-services/computer-vision/how-to/image-retrieval?tabs=csharp) for an updated list. **To use this resource, the account must be in an available region and in the same region as Azure AI Search**. |
40+
### Supported embedding models
41+
42+
Use an embedding model on an Azure AI platform in the [same region as Azure AI Search](search-create-service-portal.md#regions-with-the-most-overlap). [Deployment instructions](#set-up-embedding-models) are in this article.
43+
44+
| Provider | Supported models |
45+
|---|---|
46+
| [Azure OpenAI Service](https://aka.ms/oai/access) | text-embedding-ada-002, text-embedding-3-large, or text-embedding-3-small. |
47+
| [Azure AI Studio model catalog](/azure/ai-studio/what-is-ai-studio) | Azure, Cohere, and Facebook embedding models. |
48+
| [Azure AI services multi-service account](/azure/ai-services/multi-service-resource) | [Azure AI Vision multimodal](/azure/ai-services/computer-vision/how-to/image-retrieval) for image and text vectorization. Azure AI Vision multimodal is available in selected regions. [Check the documentation](/azure/ai-services/computer-vision/how-to/image-retrieval?tabs=csharp) for an updated list. **To use this resource, the account must be in an available region and in the same region as Azure AI Search**. |
5249

5350
If using the Azure OpenAI Service, it must have an associated [custom subdomain](/azure/ai-services/cognitive-services-custom-subdomains). If the service was created through the Azure portal, this subdomain is automatically generated as part of your service setup. Ensure that your service includes a custom subdomain before using it with the Azure AI Search integration.
5451

@@ -60,17 +57,17 @@ For the purposes of this quickstart, all of the preceding resources must have pu
6057

6158
If private endpoints are already present and you can't disable them, the alternative option is to run the respective end-to-end flow from a script or program on a virtual machine. The virtual machine must be on the same virtual network as the private endpoint. [Here's a Python code sample](https://github.com/Azure/azure-search-vector-samples/tree/main/demo-python/code/integrated-vectorization) for integrated vectorization. The same [GitHub repo](https://github.com/Azure/azure-search-vector-samples/tree/main) has samples in other programming languages.
6259

63-
### Role-based access control requirements
60+
### Role requirements
6461

6562
We recommend role assignments for search service connections to other resources.
6663

6764
1. On Azure AI Search, [enable roles](search-security-enable-roles.md).
6865

6966
1. Configure your search service to [use a managed identity](search-howto-managed-identities-data-sources.md#create-a-system-managed-identity).
7067

71-
1. On your data source platform and embedding model provider, create role assignments that allow search service to access data and models. [Prepare sample data](#prepare-sample-data) provides instructions for setting up roles.
68+
1. On your data source platform and embedding model provider, create role assignments that allow search service to access data and models. [Prepare sample data](#prepare-sample-data) provides instructions for setting up roles for each supported data source.
7269

73-
A free search service supports RBAC on connections to Azure AI Search, but it doesn't support managed identities on outbound connections to Azure Storage or Azure AI Vision. This level of support means you must use key-based authentication on connections between a free search service and other Azure services.
70+
A free search service supports role-based connections to Azure AI Search, but it doesn't support managed identities on outbound connections to Azure Storage or Azure AI Vision. This level of support means you must use key-based authentication on connections between a free search service and other Azure services.
7471

7572
For more secure connections:
7673

@@ -216,7 +213,7 @@ The wizard supports Azure, Cohere, and Facebook embedding models in the Azure AI
216213

217214
1. Deploy a supported embedding model to the model catalog in your project.
218215

219-
1. For RBAC, create two role assignments: one for Azure AI Search, and another for the AI Studio project. Assign the [Cognitive Services OpenAI User](/azure/ai-services/openai/how-to/role-based-access-control) role for embeddings and vectorization.
216+
1. For role-based connections, create two role assignments: one for Azure AI Search, and another for the AI Studio project. Assign the [Cognitive Services OpenAI User](/azure/ai-services/openai/how-to/role-based-access-control) role for embeddings and vectorization.
220217

221218
---
222219

@@ -307,6 +304,16 @@ Support for OneLake indexing is in preview. For more information about supported
307304

308305
In this step, specify the embedding model for vectorizing chunked data.
309306

307+
Chunking is built-in and nonconfigurable. The effective settings are:
308+
309+
```json
310+
"textSplitMode": "pages",
311+
"maximumPageLength": 2000,
312+
"pageOverlapLength": 500,
313+
"maximumPagesToTake": 0, #unlimited
314+
"unit": "characters"
315+
```
316+
310317
1. On the **Vectorize your text** page, choose the source of the embedding model:
311318

312319
+ Azure OpenAI
@@ -361,6 +368,12 @@ On the **Advanced settings** page, you can optionally add [semantic ranking](sem
361368

362369
## Map new fields
363370

371+
Key points about this step:
372+
373+
+ Index schema provides vector and nonvector fields for chunked data.
374+
+ You can add fields, but you can't delete or modify generated fields.
375+
+ Document parsing mode creates chunks (one search document per chunk).
376+
364377
On the **Advanced settings** page, you can optionally add new fields. By default, the wizard generates the following fields with these attributes:
365378

366379
| Field | Applies to | Description |

articles/search/search-get-started-portal.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: HeidiSteen
77
ms.author: heidist
88
ms.service: azure-ai-search
99
ms.topic: quickstart
10-
ms.date: 05/30/2024
10+
ms.date: 11/19/2024
1111
ms.custom:
1212
- mode-ui
1313
- ignite-2023

articles/search/search-how-to-index-onelake-files.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,19 @@ ms.service: azure-ai-search
99
ms.custom:
1010
- build-2024
1111
ms.topic: how-to
12-
ms.date: 06/12/2024
12+
ms.date: 11/19/2024
1313
---
1414

1515
# Index data from OneLake files and shortcuts
1616

1717
In this article, learn how to configure a OneLake files indexer for extracting searchable data and metadata data from a [lakehouse](/fabric/onelake/create-lakehouse-onelake) on top of [OneLake](/fabric/onelake/onelake-overview).
1818

19-
Use this indexer for the following tasks:
20-
21-
+ **Data indexing and incremental indexing:** The indexer can index files and associated metadata from data paths within a lakehouse. It detects new and updated files and metadata through built-in change detection. You can configure data refresh on a schedule or on demand.
22-
+ **Deletion detection:** The indexer can [detect deletions via custom metadata](#detect-deletions-via-custom-metadata) for most files and shortcuts. This requires adding metadata to files to signify that they have been "soft deleted", enabling their removal from the search index. Currently, it's not possible to detect deletions in Google Cloud Storage or Amazon S3 shortcut files because custom metadata isn't supported for those data sources.
23-
+ **Applied AI through skillsets:** [Skillsets](cognitive-search-concept-intro.md) are fully supported by the OneLake files indexer. This includes key features like [integrated vectorization](vector-search-integrated-vectorization.md) that adds data chunking and embedding steps.
24-
+ **Parsing modes:** The indexer supports [JSON parsing modes](search-howto-index-json-blobs.md) if you want to parse JSON arrays or lines into individual search documents.
25-
+ **Compatibility with other features:** The OneLake indexer is designed to work seamlessly with other indexer features, such as [debug sessions](cognitive-search-debug-session.md), [indexer cache for incremental enrichments](search-howto-incremental-index.md), and [knowledge store](knowledge-store-concept-intro.md).
19+
To configure and run the indexer, you can use:
2620

27-
Use the [2024-05-01-preview REST API](/rest/api/searchservice/data-sources/create-or-update?view=rest-searchservice-2024-05-01-preview&tabs=HTTP&preserve-view=true), a beta Azure SDK package, or [Import and vectorize data](search-get-started-portal-import-vectors.md) in the Azure portal to index from OneLake.
21+
+ [2024-05-01-preview REST API](/rest/api/searchservice/data-sources/create-or-update?view=rest-searchservice-2024-05-01-preview&tabs=HTTP&preserve-view=true) or a newer preview REST API.
22+
+ An Azure SDK beta package that provides the feature.
23+
+ [Import data](search-get-started-portal.md) wizard in the Azure portal.
24+
+ [Import and vectorize data](search-get-started-portal-import-vectors.md) wizard in the Azure portal.
2825

2926
This article uses the REST APIs to illustrate each step.
3027

@@ -47,7 +44,17 @@ This article uses the REST APIs to illustrate each step.
4744
+ A Contributor role assignment in the Microsoft Fabric workspace where the lakehouse is located. Steps are outlined in the [Grant permissions](#assign-service-permissions) section of this article.
4845

4946
+ [A REST client](search-get-started-rest.md) to formulate REST calls similar to the ones shown in this article.
47+
48+
## Supported tasks
49+
50+
You can use this indexer for the following tasks:
5051

52+
+ **Data indexing and incremental indexing:** The indexer can index files and associated metadata from data paths within a lakehouse. It detects new and updated files and metadata through built-in change detection. You can configure data refresh on a schedule or on demand.
53+
+ **Deletion detection:** The indexer can [detect deletions via custom metadata](#detect-deletions-via-custom-metadata) for most files and shortcuts. This requires adding metadata to files to signify that they have been "soft deleted", enabling their removal from the search index. Currently, it's not possible to detect deletions in Google Cloud Storage or Amazon S3 shortcut files because custom metadata isn't supported for those data sources.
54+
+ **Applied AI through skillsets:** [Skillsets](cognitive-search-concept-intro.md) are fully supported by the OneLake files indexer. This includes key features like [integrated vectorization](vector-search-integrated-vectorization.md) that adds data chunking and embedding steps.
55+
+ **Parsing modes:** The indexer supports [JSON parsing modes](search-howto-index-json-blobs.md) if you want to parse JSON arrays or lines into individual search documents. It also supports [Markdown parsing mode](search-how-to-index-markdown-blobs.md).
56+
+ **Compatibility with other features:** The OneLake indexer is designed to work seamlessly with other indexer features, such as [debug sessions](cognitive-search-debug-session.md), [indexer cache for incremental enrichments](search-howto-incremental-index.md), and [knowledge store](knowledge-store-concept-intro.md).
57+
5158
<a name="SupportedFormats"></a>
5259

5360
## Supported document formats
@@ -69,7 +76,7 @@ The following OneLake shortcuts are supported by the OneLake files indexer:
6976
+ [Google Cloud Storage shortcut](/fabric/onelake/create-gcs-shortcut)
7077

7178
## Limitations in this preview
72-
79+
7380
+ Parquet (including delta parquet) file types aren't currently supported.
7481

7582
+ File deletion isn't supported for Amazon S3 and Google Cloud Storage shortcuts.

0 commit comments

Comments
 (0)