You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/samples-dotnet.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -58,7 +58,6 @@ Code samples from the Azure AI Search team demonstrate features and workflows. A
58
58
| [multiple-data-sources](https://github.com/Azure-Samples/azure-search-dotnet-scale/tree/main/multiple-data-sources) | [Tutorial: Index from multiple data sources](tutorial-multiple-data-sources.md). | Merges content from two data sources into one search index.
59
59
|[Optimize-data-indexing](https://github.com/Azure-Samples/azure-search-dotnet-scale/tree/main/optimize-data-indexing)|[Tutorial: Optimize indexing with the push API](tutorial-optimize-indexing-push-api.md).| Demonstrates optimization techniques for pushing data into a search index. |
60
60
|[DotNetHowTo](https://github.com/Azure-Samples/search-dotnet-getting-started/tree/master/DotNetHowTo)|[How to use the .NET client library](search-howto-dotnet-sdk.md)| Steps through the basic workflow, but in more detail and with discussion of API usage. |
61
-
|[DotNetHowToSynonyms](https://github.com/Azure-Samples/search-dotnet-getting-started/tree/master/DotNetHowToSynonyms)|[Example: Add synonyms in C#](search-synonyms-tutorial-sdk.md)| Synonym lists are used for query expansion, providing matchable terms that are external to an index. |
62
61
|[DotNetToIndexers](https://github.com/Azure-Samples/search-dotnet-getting-started/tree/master/DotNetHowToIndexers)|[Tutorial: Index Azure SQL data](search-indexer-tutorial.md)| Shows how to configure an Azure SQL indexer that has a schedule, field mappings, and parameters. |
63
62
|[DotNetHowToEncryptionUsingCMK](https://github.com/Azure-Samples/search-dotnet-getting-started/tree/master/DotNetHowToEncryptionUsingCMK)|[How to configure customer-managed keys for data encryption](search-security-manage-encryption-keys.md)| Shows how to create objects that are encrypted with a Customer Key. |
64
63
|[DotNetVectorDemo](https://github.com/Azure/azure-search-vector-samples/tree/main/demo-dotnet/DotNetVectorDemo)|[readme](https://github.com/Azure/azure-search-vector-samples/tree/main/demo-dotnet/DotNetVectorDemo/readme.md)| Create, load, and query a vector index. |
Copy file name to clipboardExpand all lines: articles/search/search-synapseml-cognitive-services.md
+29-27Lines changed: 29 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: 'Tutorial: Index at scale (Spark)'
3
3
titleSuffix: Azure AI Search
4
-
description: Search big data from Apache Spark that's been transformed by SynapseML. You'll load invoices into data frames, apply machine learning, and then send output to a generated search index.
4
+
description: Search big data from Apache Spark that's been transformed by SynapseML. Load invoices into data frames, apply machine learning, and then send output to a generated search index.
5
5
6
6
manager: nitinme
7
7
author: HeidiSteen
@@ -10,12 +10,12 @@ ms.service: cognitive-search
10
10
ms.custom:
11
11
- ignite-2023
12
12
ms.topic: tutorial
13
-
ms.date: 02/01/2023
13
+
ms.date: 04/22/2024
14
14
---
15
15
16
16
# Tutorial: Index large data from Apache Spark using SynapseML and Azure AI Search
17
17
18
-
In this Azure AI Search tutorial, learn how to index and query large data loaded from a Spark cluster. You'll set up a Jupyter Notebook that performs the following actions:
18
+
In this Azure AI Search tutorial, learn how to index and query large data loaded from a Spark cluster. Set up a Jupyter Notebook that performs the following actions:
19
19
20
20
> [!div class="checklist"]
21
21
> + Load various forms (invoices) into a data frame in an Apache Spark session
@@ -24,7 +24,7 @@ In this Azure AI Search tutorial, learn how to index and query large data loaded
24
24
> + Write the output to a search index hosted in Azure AI Search
25
25
> + Explore and query over the content you created
26
26
27
-
This tutorial takes a dependency on [SynapseML](https://www.microsoft.com/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/), an open source library that supports massively parallel machine learning over big data. In SynapseML, search indexing and machine learning are exposed through *transformers* that perform specialized tasks. Transformers tap into a wide range of AI capabilities. In this exercise, you'll use the **AzureSearchWriter** APIs for analysis and AI enrichment.
27
+
This tutorial takes a dependency on [SynapseML](https://www.microsoft.com/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/), an open source library that supports massively parallel machine learning over big data. In SynapseML, search indexing and machine learning are exposed through *transformers* that perform specialized tasks. Transformers tap into a wide range of AI capabilities. In this exercise, use the **AzureSearchWriter** APIs for analysis and AI enrichment.
28
28
29
29
Although Azure AI Search has native [AI enrichment](cognitive-search-concept-intro.md), this tutorial shows you how to access AI capabilities outside of Azure AI Search. By using SynapseML instead of indexers or skills, you're not subject to data limits or other constraints associated with those objects.
30
30
@@ -33,7 +33,7 @@ Although Azure AI Search has native [AI enrichment](cognitive-search-concept-int
33
33
34
34
## Prerequisites
35
35
36
-
You'll need the `synapseml` library and several Azure resources. If possible, use the same subscription and region for your Azure resources and put everything into one resource group for simple cleanup later. The following links are for portal installs. The sample data is imported from a public site.
36
+
You need the `synapseml` library and several Azure resources. If possible, use the same subscription and region for your Azure resources and put everything into one resource group for simple cleanup later. The following links are for portal installs. The sample data is imported from a public site.
+[Azure AI Search](search-create-service-portal.md) (any tier) <sup>2</sup>
@@ -42,38 +42,38 @@ You'll need the `synapseml` library and several Azure resources. If possible, us
42
42
43
43
<sup>1</sup> This link resolves to a tutorial for loading the package.
44
44
45
-
<sup>2</sup> You can use the free search tier to index the sample data, but [choose a higher tier](search-sku-tier.md) if your data volumes are large. For non-free tiers, you'll need to provide the [search API key](search-security-api-keys.md#find-existing-keys) in the [Set up dependencies](#2---set-up-dependencies) step further on.
45
+
<sup>2</sup> You can use the free search tier to index the sample data, but [choose a higher tier](search-sku-tier.md) if your data volumes are large. For billable tiers, provide the [search API key](search-security-api-keys.md#find-existing-keys) in the [Set up dependencies](#step-2-set-up-dependencies) step further on.
46
46
47
-
<sup>3</sup> This tutorial uses Azure AI Document Intelligence and Azure AI Translator. In the instructions that follow, you'll provide a [multi-service key](../ai-services/multi-service-resource.md?pivots=azportal#get-the-keys-for-your-resource) and the region, and it will work for both services.
47
+
<sup>3</sup> This tutorial uses Azure AI Document Intelligence and Azure AI Translator. In the instructions that follow, provide a [multi-service key](../ai-services/multi-service-resource.md?pivots=azportal#get-the-keys-for-your-resource) and the region. The same key works for both services.
48
48
49
-
<sup>4</sup> In this tutorial, Azure Databricks provides the Spark computing platform and the instructions in the link will tell you how to set up the workspace. For this tutorial, we used the portal steps in "Create a workspace".
49
+
<sup>4</sup> In this tutorial, Azure Databricks provides the Spark computing platform. We used the [portal instructions](/azure/databricks/scenarios/quickstart-create-databricks-workspace-portal?tabs=azure-portal)to set up the workspace.
50
50
51
51
> [!NOTE]
52
52
> All of the above Azure resources support security features in the Microsoft Identity platform. For simplicity, this tutorial assumes key-based authentication, using endpoints and keys copied from the portal pages of each service. If you implement this workflow in a production environment, or share the solution with others, remember to replace hard-coded keys with integrated security or encrypted keys.
53
53
54
-
## 1 - Create a Spark cluster and notebook
54
+
## Step 1: Create a Spark cluster and notebook
55
55
56
-
In this section, you'll create a cluster, install the `synapseml` library, and create a notebook to run the code.
56
+
In this section, create a cluster, install the `synapseml` library, and create a notebook to run the code.
57
57
58
58
1. In Azure portal, find your Azure Databricks workspace and select **Launch workspace**.
59
59
60
60
1. On the left menu, select **Compute**.
61
61
62
-
1. Select **Create cluster**.
62
+
1. Select **Create compute**.
63
63
64
-
1.Give the cluster a name, accept the default configuration, and then create the cluster. It takes several minutes to create the cluster.
64
+
1.Accept the default configuration. It takes several minutes to create the cluster.
65
65
66
66
1. Install the `synapseml` library after the cluster is created:
67
67
68
-
1. Select **Library** from the tabs at the top of the cluster's page.
68
+
1. Select **Libraries** from the tabs at the top of the cluster's page.
69
69
70
70
1. Select **Install new**.
71
71
72
72
:::image type="content" source="media/search-synapseml-cognitive-services/install-library.png" alt-text="Screenshot of the Install New command." border="true":::
73
73
74
74
1. Select **Maven**.
75
75
76
-
1. In Coordinates, enter `com.microsoft.azure:synapseml_2.12:0.10.0`
76
+
1. In Coordinates, enter `com.microsoft.azure:synapseml_2.12:1.0.4`
77
77
78
78
1. Select **Install**.
79
79
@@ -85,13 +85,15 @@ In this section, you'll create a cluster, install the `synapseml` library, and c
85
85
86
86
1. Give the notebook a name, select **Python** as the default language, and select the cluster that has the `synapseml` library.
87
87
88
-
1. Create seven consecutive cells. You'll paste code into each one.
88
+
1. Create seven consecutive cells. Paste code into each one.
89
89
90
90
:::image type="content" source="media/search-synapseml-cognitive-services/create-seven-cells.png" alt-text="Screenshot of the notebook with placeholder cells." border="true":::
91
91
92
-
## 2 - Set up dependencies
92
+
## Step 2: Set up dependencies
93
93
94
-
Paste the following code into the first cell of your notebook. Replace the placeholders with endpoints and access keys for each resource. No other modifications are required, so run the code when you're ready.
94
+
Paste the following code into the first cell of your notebook.
95
+
96
+
Replace the placeholders with endpoints and access keys for each resource. Provide a name for a new search index. No other modifications are required, so run the code when you're ready.
95
97
96
98
This code imports multiple packages and sets up access to the Azure resources used in this workflow.
Paste the following code into the second cell. No modifications are required, so run the code when you're ready.
115
117
116
-
This code loads a few external files from an Azure storage account that's used for demo purposes. The files are various invoices, and they're read into a data frame.
118
+
This code loads a few external files from an Azure storage account. The files are various invoices, and they're read into a data frame.
Paste the following code into the third cell. No modifications are required, so run the code when you're ready.
141
143
142
-
This code loads the [AnalyzeInvoices transformer](https://mmlspark.blob.core.windows.net/docs/0.11.2/pyspark/synapse.ml.cognitive.form.html#module-synapse.ml.cognitive.form.AnalyzeInvoices) and passes a reference to the data frame containing the invoices. It calls the pre-built[invoice model](../ai-services/document-intelligence/concept-invoice.md) of Azure AI Document Intelligence to extract information from the invoices.
144
+
This code loads the [AnalyzeInvoices transformer](https://mmlspark.blob.core.windows.net/docs/0.11.2/pyspark/synapse.ml.cognitive.form.html#module-synapse.ml.cognitive.form.AnalyzeInvoices) and passes a reference to the data frame containing the invoices. It calls the prebuilt[invoice model](../ai-services/document-intelligence/concept-invoice.md) of Azure AI Document Intelligence to extract information from the invoices.
143
145
144
146
```python
145
147
from synapse.ml.cognitive import AnalyzeInvoices
@@ -161,7 +163,7 @@ The output from this step should look similar to the next screenshot. Notice how
161
163
162
164
:::image type="content" source="media/search-synapseml-cognitive-services/analyze-forms-output.png" alt-text="Screenshot of the AnalyzeInvoices output." border="true":::
Notice how this transformation recasts the nested fields into a table, which enables the next two transformations. This screenshot is trimmed for brevity. If you're following along in your own notebook, you'll have 19 columns and 26 rows.
188
+
Notice how this transformation recasts the nested fields into a table, which enables the next two transformations. This screenshot is trimmed for brevity. If you're following along in your own notebook, you have 19 columns and 26 rows.
187
189
188
190
:::image type="content" source="media/search-synapseml-cognitive-services/form-ontology-learner-output.png" alt-text="Screenshot of the FormOntologyLearner output." border="true":::
189
191
190
-
## 6 - Add translations
192
+
## Step 6: Add translations
191
193
192
194
Paste the following code into the fifth cell. No modifications are required, so run the code when you're ready.
193
195
@@ -217,11 +219,11 @@ display(translated_df)
217
219
>
218
220
> :::image type="content" source="media/search-synapseml-cognitive-services/translated-strings.png" alt-text="Screenshot of table output, showing the Translations column." border="true":::
219
221
220
-
## 7 - Add a search index with AzureSearchWriter
222
+
## Step 7: Add a search index with AzureSearchWriter
221
223
222
224
Paste the following code in the sixth cell and then run it. No modifications are required.
223
225
224
-
This code loads [AzureSearchWriter](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/#azure-cognitive-search-sample). It consumes a tabular dataset and infers a search index schema that defines one field for each column. The translations structure is an array, so it's articulated in the index as a complex collection with subfields for each language translation. The generated index will have a document key and use the default values for fields created using the [Create Index REST API](/rest/api/searchservice/create-index).
226
+
This code loads [AzureSearchWriter](https://microsoft.github.io/SynapseML/docs/Explore%20Algorithms/AI%20Services/Overview/#azure-cognitive-search-sample). It consumes a tabular dataset and infers a search index schema that defines one field for each column. Because the translations structure is an array, it's articulated in the index as a complex collection with subfields for each language translation. The generated index has a document key and use the default values for fields created using the [Create Index REST API](/rest/api/searchservice/create-index).
225
227
226
228
```python
227
229
from synapse.ml.cognitive import*
@@ -242,7 +244,7 @@ You can check the search service pages in Azure portal to explore the index defi
242
244
> [!NOTE]
243
245
> If you can't use default search index, you can provide an external custom definition in JSON, passing its URI as a string in the "indexJson" property. Generate the default index first so that you know which fields to specify, and then follow with customized properties if you need specific analyzers, for example.
244
246
245
-
## 8 - Query the index
247
+
## Step 8: Query the index
246
248
247
249
Paste the following code into the seventh cell and then run it. No modifications are required, except that you might want to vary the syntax or try more examples to further explore your content:
0 commit comments