You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-file-storage-integration.md
+17-2Lines changed: 17 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ ms.service: azure-ai-search
9
9
ms.custom:
10
10
- ignite-2023
11
11
ms.topic: how-to
12
-
ms.date: 08/23/2024
12
+
ms.date: 11/19/2024
13
13
---
14
14
15
15
# Index data from Azure Files
@@ -19,7 +19,12 @@ ms.date: 08/23/2024
19
19
20
20
In this article, learn how to configure an [**indexer**](search-indexer-overview.md) that imports content from Azure Files and makes it searchable in Azure AI Search. Inputs to the indexer are your files in a single share. Output is a search index with searchable content and metadata stored in individual fields.
21
21
22
-
This article supplements [**Create an indexer**](search-howto-create-indexers.md) with information that's specific to indexing files in Azure Storage. It uses the REST APIs to demonstrate a three-part workflow common to all indexers: create a data source, create an index, create an indexer. Data extraction occurs when you submit the Create Indexer request.
22
+
To configure and run the indexer, you can use:
23
+
24
+
+[Search Service preview REST APIs](/rest/api/searchservice), any preview version.
25
+
+ An Azure SDK package, any version.
26
+
+[Import data](search-get-started-portal.md) wizard in the Azure portal.
27
+
+[Import and vectorize data](search-get-started-portal-import-vectors.md) wizard in the Azure portal.
23
28
24
29
## Prerequisites
25
30
@@ -33,6 +38,16 @@ This article supplements [**Create an indexer**](search-howto-create-indexers.md
33
38
34
39
+ Use a [REST client](search-get-started-rest.md) to formulate REST calls similar to the ones shown in this article.
35
40
41
+
## Supported tasks
42
+
43
+
You can use this indexer for the following tasks:
44
+
45
+
+**Data indexing and incremental indexing:** The indexer can index files and associated metadata from tables. It detects new and updated files and metadata through built-in change detection. You can configure data refresh on a schedule or on demand.
46
+
+**Deletion detection:** The indexer can [detect deletions through custom metadata](search-howto-index-changed-deleted-blobs.md).
47
+
+**Applied AI through skillsets:**[Skillsets](cognitive-search-concept-intro.md) are fully supported by the indexer. This includes key features like [integrated vectorization](vector-search-integrated-vectorization.md) that adds data chunking and embedding steps.
48
+
+**Parsing modes:** The indexer supports [JSON parsing modes](search-howto-index-json-blobs.md) if you want to parse JSON arrays or lines into individual search documents. It also supports [Markdown parsing mode](search-how-to-index-markdown-blobs.md).
49
+
+**Compatibility with other features:** The indexer is designed to work seamlessly with other indexer features, such as [debug sessions](cognitive-search-debug-session.md), [indexer cache for incremental enrichments](search-howto-incremental-index.md), and [knowledge store](knowledge-store-concept-intro.md).
50
+
36
51
## Supported document formats
37
52
38
53
The Azure Files indexer can extract text from the following document formats:
Copy file name to clipboardExpand all lines: articles/search/search-get-started-portal-import-vectors.md
+43-30Lines changed: 43 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,47 +8,44 @@ ms.service: azure-ai-search
8
8
ms.custom:
9
9
- build-2024
10
10
ms.topic: quickstart
11
-
ms.date: 10/18/2024
11
+
ms.date: 11/19/2024
12
12
---
13
13
14
14
# Quickstart: Vectorize text and images by using the Azure portal
15
15
16
16
This quickstart helps you get started with [integrated vectorization](vector-search-integrated-vectorization.md) by using the **Import and vectorize data** wizard in the Azure portal. The wizard chunks your content and calls an embedding model to vectorize content during indexing and for queries.
17
17
18
-
Key points about the wizard:
18
+
## Prerequisites
19
19
20
-
+ Supported data sources are Azure Blob Storage, Azure Data Lake Storage (ADLS) Gen2, or OneLake files and shortcuts.
21
-
+ Supported embedding models are hosted on Azure OpenAI, Azure AI Studio model catalog, Azure AI Vision multimodal.
22
-
+ Index schema provides vector and nonvector fields for chunked data.
23
-
+ You can add fields, but you can't delete or modify generated fields.
+[Azure Data Lake Storage (ADLS) Gen2](/azure/storage/blobs/create-data-lake-storage-account) (a storage account with a hierarchical namespace).
38
31
39
-
+[Azure AI Search service](search-create-service-portal.md) in the same region as Azure AI. We recommend the Basic tier or higher.
32
+
+[Azure Storage](/azure/storage/common/storage-account-create) for blobs, files, and tables. Azure Storage must be a standard performance (general-purpose v2) account. Access tiers can be hot, cool, and cold.
40
33
41
-
+[Azure Blob Storage](/azure/storage/common/storage-account-create), [Azure Data Lake Storage (ADLS) Gen2](/azure/storage/blobs/create-data-lake-storage-account) (a storage account with a hierarchical namespace), or a [OneLake lakehouse](search-how-to-index-onelake-files.md).
34
+
+[Azure Cosmos DB](/azure/cosmos-db/nosql/quickstart-portal) for NoSQL, Mongo DB, and Apache Gremlin.
42
35
43
-
Azure Storage must be a standard performance (general-purpose v2) account. Access tiers can be hot, cool, and cold.
36
+
+[Azure SQL Database](/azure/azure-sql/database/single-database-create-quickstart), [Azure SQL Managed Instance](/azure/azure-sql/managed-instance/instance-create-quickstart), and Azure SQL Server virtual machines.
44
37
45
-
+An embedding model on an Azure AI platform in the [same region as Azure AI Search](search-create-service-portal.md#regions-with-the-most-overlap). [Deployment instructions](#set-up-embedding-models) are in this article.
|[Azure OpenAI Service](https://aka.ms/oai/access)| text-embedding-ada-002, text-embedding-3-large, or text-embedding-3-small. |
50
-
|[Azure AI Studio model catalog](/azure/ai-studio/what-is-ai-studio)| Azure, Cohere, and Facebook embedding models. |
51
-
|[Azure AI services multi-service account](/azure/ai-services/multi-service-resource)|[Azure AI Vision multimodal](/azure/ai-services/computer-vision/how-to/image-retrieval) for image and text vectorization. Azure AI Vision multimodal is available in selected regions. [Check the documentation](/azure/ai-services/computer-vision/how-to/image-retrieval?tabs=csharp) for an updated list. **To use this resource, the account must be in an available region and in the same region as Azure AI Search**. |
40
+
### Supported embedding models
41
+
42
+
Use an embedding model on an Azure AI platform in the [same region as Azure AI Search](search-create-service-portal.md#regions-with-the-most-overlap). [Deployment instructions](#set-up-embedding-models) are in this article.
43
+
44
+
| Provider | Supported models |
45
+
|---|---|
46
+
|[Azure OpenAI Service](https://aka.ms/oai/access)| text-embedding-ada-002, text-embedding-3-large, or text-embedding-3-small. |
47
+
|[Azure AI Studio model catalog](/azure/ai-studio/what-is-ai-studio)| Azure, Cohere, and Facebook embedding models. |
48
+
|[Azure AI services multi-service account](/azure/ai-services/multi-service-resource)|[Azure AI Vision multimodal](/azure/ai-services/computer-vision/how-to/image-retrieval) for image and text vectorization. Azure AI Vision multimodal is available in selected regions. [Check the documentation](/azure/ai-services/computer-vision/how-to/image-retrieval?tabs=csharp) for an updated list. **To use this resource, the account must be in an available region and in the same region as Azure AI Search**. |
52
49
53
50
If using the Azure OpenAI Service, it must have an associated [custom subdomain](/azure/ai-services/cognitive-services-custom-subdomains). If the service was created through the Azure portal, this subdomain is automatically generated as part of your service setup. Ensure that your service includes a custom subdomain before using it with the Azure AI Search integration.
54
51
@@ -60,17 +57,17 @@ For the purposes of this quickstart, all of the preceding resources must have pu
60
57
61
58
If private endpoints are already present and you can't disable them, the alternative option is to run the respective end-to-end flow from a script or program on a virtual machine. The virtual machine must be on the same virtual network as the private endpoint. [Here's a Python code sample](https://github.com/Azure/azure-search-vector-samples/tree/main/demo-python/code/integrated-vectorization) for integrated vectorization. The same [GitHub repo](https://github.com/Azure/azure-search-vector-samples/tree/main) has samples in other programming languages.
62
59
63
-
### Role-based access control requirements
60
+
### Role requirements
64
61
65
62
We recommend role assignments for search service connections to other resources.
66
63
67
64
1. On Azure AI Search, [enable roles](search-security-enable-roles.md).
68
65
69
66
1. Configure your search service to [use a managed identity](search-howto-managed-identities-data-sources.md#create-a-system-managed-identity).
70
67
71
-
1. On your data source platform and embedding model provider, create role assignments that allow search service to access data and models. [Prepare sample data](#prepare-sample-data) provides instructions for setting up roles.
68
+
1. On your data source platform and embedding model provider, create role assignments that allow search service to access data and models. [Prepare sample data](#prepare-sample-data) provides instructions for setting up roles for each supported data source.
72
69
73
-
A free search service supports RBAC on connections to Azure AI Search, but it doesn't support managed identities on outbound connections to Azure Storage or Azure AI Vision. This level of support means you must use key-based authentication on connections between a free search service and other Azure services.
70
+
A free search service supports role-based connections to Azure AI Search, but it doesn't support managed identities on outbound connections to Azure Storage or Azure AI Vision. This level of support means you must use key-based authentication on connections between a free search service and other Azure services.
74
71
75
72
For more secure connections:
76
73
@@ -216,7 +213,7 @@ The wizard supports Azure, Cohere, and Facebook embedding models in the Azure AI
216
213
217
214
1. Deploy a supported embedding model to the model catalog in your project.
218
215
219
-
1. For RBAC, create two role assignments: one for Azure AI Search, and another for the AI Studio project. Assign the [Cognitive Services OpenAI User](/azure/ai-services/openai/how-to/role-based-access-control) role for embeddings and vectorization.
216
+
1. For role-based connections, create two role assignments: one for Azure AI Search, and another for the AI Studio project. Assign the [Cognitive Services OpenAI User](/azure/ai-services/openai/how-to/role-based-access-control) role for embeddings and vectorization.
220
217
221
218
---
222
219
@@ -307,6 +304,16 @@ Support for OneLake indexing is in preview. For more information about supported
307
304
308
305
In this step, specify the embedding model for vectorizing chunked data.
309
306
307
+
Chunking is built-in and nonconfigurable. The effective settings are:
308
+
309
+
```json
310
+
"textSplitMode": "pages",
311
+
"maximumPageLength": 2000,
312
+
"pageOverlapLength": 500,
313
+
"maximumPagesToTake": 0, #unlimited
314
+
"unit": "characters"
315
+
```
316
+
310
317
1. On the **Vectorize your text** page, choose the source of the embedding model:
311
318
312
319
+ Azure OpenAI
@@ -361,6 +368,12 @@ On the **Advanced settings** page, you can optionally add [semantic ranking](sem
361
368
362
369
## Map new fields
363
370
371
+
Key points about this step:
372
+
373
+
+ Index schema provides vector and nonvector fields for chunked data.
374
+
+ You can add fields, but you can't delete or modify generated fields.
Copy file name to clipboardExpand all lines: articles/search/search-how-to-index-onelake-files.md
+17-10Lines changed: 17 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,22 +9,19 @@ ms.service: azure-ai-search
9
9
ms.custom:
10
10
- build-2024
11
11
ms.topic: how-to
12
-
ms.date: 06/12/2024
12
+
ms.date: 11/19/2024
13
13
---
14
14
15
15
# Index data from OneLake files and shortcuts
16
16
17
17
In this article, learn how to configure a OneLake files indexer for extracting searchable data and metadata data from a [lakehouse](/fabric/onelake/create-lakehouse-onelake) on top of [OneLake](/fabric/onelake/onelake-overview).
18
18
19
-
Use this indexer for the following tasks:
20
-
21
-
+**Data indexing and incremental indexing:** The indexer can index files and associated metadata from data paths within a lakehouse. It detects new and updated files and metadata through built-in change detection. You can configure data refresh on a schedule or on demand.
22
-
+**Deletion detection:** The indexer can [detect deletions via custom metadata](#detect-deletions-via-custom-metadata) for most files and shortcuts. This requires adding metadata to files to signify that they have been "soft deleted", enabling their removal from the search index. Currently, it's not possible to detect deletions in Google Cloud Storage or Amazon S3 shortcut files because custom metadata isn't supported for those data sources.
23
-
+**Applied AI through skillsets:**[Skillsets](cognitive-search-concept-intro.md) are fully supported by the OneLake files indexer. This includes key features like [integrated vectorization](vector-search-integrated-vectorization.md) that adds data chunking and embedding steps.
24
-
+**Parsing modes:** The indexer supports [JSON parsing modes](search-howto-index-json-blobs.md) if you want to parse JSON arrays or lines into individual search documents.
25
-
+**Compatibility with other features:** The OneLake indexer is designed to work seamlessly with other indexer features, such as [debug sessions](cognitive-search-debug-session.md), [indexer cache for incremental enrichments](search-howto-incremental-index.md), and [knowledge store](knowledge-store-concept-intro.md).
19
+
To configure and run the indexer, you can use:
26
20
27
-
Use the [2024-05-01-preview REST API](/rest/api/searchservice/data-sources/create-or-update?view=rest-searchservice-2024-05-01-preview&tabs=HTTP&preserve-view=true), a beta Azure SDK package, or [Import and vectorize data](search-get-started-portal-import-vectors.md) in the Azure portal to index from OneLake.
21
+
+[2024-05-01-preview REST API](/rest/api/searchservice/data-sources/create-or-update?view=rest-searchservice-2024-05-01-preview&tabs=HTTP&preserve-view=true) or a newer preview REST API.
22
+
+ An Azure SDK beta package that provides the feature.
23
+
+[Import data](search-get-started-portal.md) wizard in the Azure portal.
24
+
+[Import and vectorize data](search-get-started-portal-import-vectors.md) wizard in the Azure portal.
28
25
29
26
This article uses the REST APIs to illustrate each step.
30
27
@@ -47,7 +44,17 @@ This article uses the REST APIs to illustrate each step.
47
44
+ A Contributor role assignment in the Microsoft Fabric workspace where the lakehouse is located. Steps are outlined in the [Grant permissions](#assign-service-permissions) section of this article.
48
45
49
46
+[A REST client](search-get-started-rest.md) to formulate REST calls similar to the ones shown in this article.
47
+
48
+
## Supported tasks
49
+
50
+
You can use this indexer for the following tasks:
50
51
52
+
+**Data indexing and incremental indexing:** The indexer can index files and associated metadata from data paths within a lakehouse. It detects new and updated files and metadata through built-in change detection. You can configure data refresh on a schedule or on demand.
53
+
+**Deletion detection:** The indexer can [detect deletions via custom metadata](#detect-deletions-via-custom-metadata) for most files and shortcuts. This requires adding metadata to files to signify that they have been "soft deleted", enabling their removal from the search index. Currently, it's not possible to detect deletions in Google Cloud Storage or Amazon S3 shortcut files because custom metadata isn't supported for those data sources.
54
+
+**Applied AI through skillsets:**[Skillsets](cognitive-search-concept-intro.md) are fully supported by the OneLake files indexer. This includes key features like [integrated vectorization](vector-search-integrated-vectorization.md) that adds data chunking and embedding steps.
55
+
+**Parsing modes:** The indexer supports [JSON parsing modes](search-howto-index-json-blobs.md) if you want to parse JSON arrays or lines into individual search documents. It also supports [Markdown parsing mode](search-how-to-index-markdown-blobs.md).
56
+
+**Compatibility with other features:** The OneLake indexer is designed to work seamlessly with other indexer features, such as [debug sessions](cognitive-search-debug-session.md), [indexer cache for incremental enrichments](search-howto-incremental-index.md), and [knowledge store](knowledge-store-concept-intro.md).
57
+
51
58
<aname="SupportedFormats"></a>
52
59
53
60
## Supported document formats
@@ -69,7 +76,7 @@ The following OneLake shortcuts are supported by the OneLake files indexer:
0 commit comments