You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-synapseml-cognitive-services.md
+41-18Lines changed: 41 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,14 @@
1
1
---
2
2
title: Use Search with SynapseML
3
3
titleSuffix: Azure Cognitive Search
4
-
description: Add full text search to big data on Apache Spark that's been loaded and transformed through the opensource SynapseML library. In this walkthrough, you'll load invoice files into data frames, apply machine learning through SynapseML, then send it into a generated search index.
4
+
description: Add full text search to big data on Apache Spark that's been loaded and transformed through the open-source library, SynapseML. In this walkthrough, you'll load invoice files into data frames, apply machine learning through SynapseML, then send it into a generated search index.
5
5
6
6
manager: nitinme
7
7
author: HeidiSteen
8
8
ms.author: heidist
9
9
ms.service: cognitive-search
10
10
ms.topic: how-to
11
-
ms.date: 08/09/2022
11
+
ms.date: 08/23/2022
12
12
---
13
13
14
14
# Add search to AI-enriched data from Apache Spark using SynapseML
@@ -17,7 +17,7 @@ In this Azure Cognitive Search article, learn how to add data exploration and fu
17
17
18
18
[SynapseML](https://www.microsoft.com/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/) is an open source library that supports massively parallel machine learning over big data. In SynapseML, one of the ways in which machine learning is exposed is through *transformers* that perform specialized tasks. Transformers tap into a wide range of AI capabilities. In this article, we'll focus on just those that call Cognitive Services and Cognitive Search.
19
19
20
-
In this walkthrough, you'll set up a workbook that does the following:
20
+
In this walkthrough, you'll set up a workbook that includes the follow actions:
21
21
22
22
> [!div class="checklist"]
23
23
> + Load various forms (invoices) into a data frame in an Apache Spark session
@@ -35,18 +35,21 @@ Although Azure Cognitive Search has native [AI enrichment](cognitive-search-conc
35
35
36
36
You'll need the `synapseml` library and several Azure resources. If possible, use the same subscription and region for your Azure resources and put everything into one resource group for simple cleanup later. The following links are for portal installs. The sample data is imported from a public site.
<sup>1</sup> You can use the free tier for this walkthrough but [choose a higher tier](search-sku-tier.md) if data volumes are large. You'll need the [API key](search-security-api-keys.md#find-existing-keys) for this resource.
43
+
<sup>1</sup> This article includes instructions for loading the package.
43
44
44
-
<sup>2</sup> This walkthrough uses Azure Forms Recognizer and Azure Translator. In the instructions below, you'll provide a [Cognitive Services multi-service key](../cognitive-services/cognitive-services-apis-create-account.md?tabs=multiservice%2cwindows#get-the-keys-for-your-resource) and the region, and it'll work for both services.
45
+
<sup>2</sup> You can use the free tier for this walkthrough but [choose a higher tier](search-sku-tier.md) if data volumes are large. You'll need the [API key](search-security-api-keys.md#find-existing-keys)for this resource.
45
46
46
-
<sup>3</sup> In this walkthrough, Azure Databricks provides the computing platform. You could also use Azure Synapse Analytics or any other computing platform supported by `synapseml`. The Azure Databricks article listed in the prerequisites includes multiple steps. For this walkthrough, follow only the instructions in "Create a workspace".
47
+
<sup>3</sup> This walkthrough uses Azure Forms Recognizer and Azure Translator. In the instructions below, you'll provide a [Cognitive Services multi-service key](../cognitive-services/cognitive-services-apis-create-account.md?tabs=multiservice%2cwindows#get-the-keys-for-your-resource) and the region, and it will work for both services.
48
+
49
+
<sup>4</sup> In this walkthrough, Azure Databricks provides the computing platform. You could also use Azure Synapse Analytics or any other computing platform supported by `synapseml`. The Azure Databricks article listed in the prerequisites includes multiple steps. For this walkthrough, follow only the instructions in "Create a workspace".
47
50
48
51
> [!NOTE]
49
-
> All of the above resources support security features in the Microsoft Identity platform. For simplicity, this walkthrough assumes key-based authentication, using endpoints and keys copied from the portal pages of each service. If you implement this workflow in a production environment, or share the solution with others, remember to replace hard-coded keys with integrated security or encrypted keys.
52
+
> All of the above Azure resources support security features in the Microsoft Identity platform. For simplicity, this walkthrough assumes key-based authentication, using endpoints and keys copied from the portal pages of each service. If you implement this workflow in a production environment, or share the solution with others, remember to replace hard-coded keys with integrated security or encrypted keys.
50
53
51
54
## Create a Spark cluster and notebook
52
55
@@ -74,6 +77,8 @@ In this section, you'll create a cluster, install the `synapseml` library, and c
74
77
75
78
1. Select **Install**.
76
79
80
+
:::image type="content" source="media/search-synapseml-cognitive-services/install-library-from-maven.png" alt-text="Screenshot of Maven package specification." border="true":::
81
+
77
82
1. On the left menu, select **Create** > **Notebook**.
78
83
79
84
:::image type="content" source="media/search-synapseml-cognitive-services/create-notebook.png" alt-text="Screenshot of the Create Notebook command." border="true":::
Paste the following code into the second cell. No modifications are required, so run the code when you're ready.
110
115
111
-
This code loads a small number of external files from an Azure storage account that's used for demo purposes. The files are various invoices, and they're read into a data frame.
116
+
This code loads a few external files from an Azure storage account that's used for demo purposes. The files are various invoices, and they're read into a data frame.
The output from this step should look similar to the next screenshot. Notice how the forms analysis is packed into a densely structured column, which is difficult to work with. The next transformation resolves this issue by parsing the column into rows and columns.
161
+
162
+
:::image type="content" source="media/search-synapseml-cognitive-services/analyze-forms-output.png" alt-text="Screenshot of the AnalyzeInvoices output." border="true":::
163
+
164
+
## Restructure form recognition output
156
165
157
166
Paste the following code into the fourth cell and run it. No modifications are required.
158
167
159
-
This code loads [FormOntologyLearner](https://mmlspark.blob.core.windows.net/docs/0.10.0/pyspark/synapse.ml.cognitive.html#module-synapse.ml.cognitive.FormOntologyTransformer), a transformer that analyzes the output of Form Recognizer transformers and infers a tabular data structure. The output of AnalyzeInvoices is dynamic and varies based on the features detected in your content. Furthermore, the AnalyzeInvoices transformer consolidates output into a single column. Because the output is dynamic and consolidated, it's difficult to use in downstream transformations that require more structure.
168
+
This code loads [FormOntologyLearner](https://mmlspark.blob.core.windows.net/docs/0.10.0/pyspark/synapse.ml.cognitive.html#module-synapse.ml.cognitive.FormOntologyTransformer), a transformer that analyzes the output of Form Recognizer transformers and infers a tabular data structure. The output of AnalyzeInvoices is dynamic and varies based on the features detected in your content. Furthermore, the transformer consolidates output into a single column. Because the output is dynamic and consolidated, it's difficult to use in downstream transformations that require more structure.
160
169
161
170
FormOntologyLearner extends the utility of the AnalyzeInvoices transformer by looking for patterns that can be used to create a tabular data structure. Organizing the output into multiple columns and rows makes the content consumable in other transformers, like AzureSearchWriter.
Notice how this transformation recasts the nested fields into a table, which enables the next two transformations. This screenshot is trimmed for brevity. If you're following along in your own notebook, you'll have 19 columns and 26 rows.
187
+
188
+
:::image type="content" source="media/search-synapseml-cognitive-services/form-ontology-learner-output.png" alt-text="Screenshot of the FormOntologyLearner output." border="true":::
189
+
190
+
## Add translations
178
191
179
192
Paste the following code into the fifth cell. No modifications are required, so run the code when you're ready.
180
193
@@ -204,7 +217,7 @@ display(translated_df)
204
217
>
205
218
> :::image type="content" source="media/search-synapseml-cognitive-services/translated-strings.png" alt-text="Screenshot of table output, showing the Translations column." border="true":::
206
219
207
-
## Apply search indexing
220
+
## Add a search index with AzureSearchWriter
208
221
209
222
Paste the following code in the sixth cell and then run it. No modifications are required.
210
223
@@ -224,11 +237,21 @@ from synapse.ml.cognitive import *
224
237
))
225
238
```
226
239
240
+
You can check the search service pages in Azure portal to explore the index definition created by AzureSearchWriter.
241
+
242
+
<!-- > [!NOTE]
243
+
> If you can't use default search index, you can provide an external custom definition in JSON, passing its URI as a string in the "indexJson" property. Generate the default index first so that you know which fields to specify, and then follow with customized properties if you need specific analyzers, for example. -->
244
+
227
245
## Query the index
228
246
229
-
Paste the following code into the seventh cell and then run it. No modifications are required, except that you might want to vary the [query syntax](query-simple-syntax.md) or [review these query examples](search-query-simple-examples.md) to further explore your content.
247
+
Paste the following code into the seventh cell and then run it. No modifications are required, except that you might want to vary the syntax or try more examples to further explore your content:
There's no transformer or module that issues queries. This cell is a simple call to the [Search Documents REST API](/rest/api/searchservice/search-documents).
230
253
231
-
This code calls the [Search Documents REST API](/rest/api/searchservice/search-documents) that queries an index. This particular example is searching for the word "door". This query returns a count of the number of matching documents. It also returns just the contents of the "Description' and "Translations" fields. If you want to see the full list of fields, remove the "select" parameter.
254
+
This particular example is searching for the word "door" (`"search": "door"`). It also returns a "count" of the number of matching documents, and selects just the contents of the "Description' and "Translations" fields for the results. If you want to see the full list of fields, remove the "select" parameter.
0 commit comments