Skip to content

Commit b37cb16

Browse files
committed
Incorporated recent feedback from PMs/Devs
1 parent 7c25fea commit b37cb16

File tree

5 files changed

+41
-18
lines changed

5 files changed

+41
-18
lines changed
49 KB
Loading
28.1 KB
Loading
12.5 KB
Loading
9.28 KB
Loading

articles/search/search-synapseml-cognitive-services.md

Lines changed: 41 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
---
22
title: Use Search with SynapseML
33
titleSuffix: Azure Cognitive Search
4-
description: Add full text search to big data on Apache Spark that's been loaded and transformed through the open source SynapseML library. In this walkthrough, you'll load invoice files into data frames, apply machine learning through SynapseML, then send it into a generated search index.
4+
description: Add full text search to big data on Apache Spark that's been loaded and transformed through the open-source library, SynapseML. In this walkthrough, you'll load invoice files into data frames, apply machine learning through SynapseML, then send it into a generated search index.
55

66
manager: nitinme
77
author: HeidiSteen
88
ms.author: heidist
99
ms.service: cognitive-search
1010
ms.topic: how-to
11-
ms.date: 08/09/2022
11+
ms.date: 08/23/2022
1212
---
1313

1414
# Add search to AI-enriched data from Apache Spark using SynapseML
@@ -17,7 +17,7 @@ In this Azure Cognitive Search article, learn how to add data exploration and fu
1717

1818
[SynapseML](https://www.microsoft.com/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/) is an open source library that supports massively parallel machine learning over big data. In SynapseML, one of the ways in which machine learning is exposed is through *transformers* that perform specialized tasks. Transformers tap into a wide range of AI capabilities. In this article, we'll focus on just those that call Cognitive Services and Cognitive Search.
1919

20-
In this walkthrough, you'll set up a workbook that does the following:
20+
In this walkthrough, you'll set up a workbook that includes the follow actions:
2121

2222
> [!div class="checklist"]
2323
> + Load various forms (invoices) into a data frame in an Apache Spark session
@@ -35,18 +35,21 @@ Although Azure Cognitive Search has native [AI enrichment](cognitive-search-conc
3535

3636
You'll need the `synapseml` library and several Azure resources. If possible, use the same subscription and region for your Azure resources and put everything into one resource group for simple cleanup later. The following links are for portal installs. The sample data is imported from a public site.
3737

38-
+ [Azure Cognitive Search](search-create-service-portal.md) (any tier) <sup>1</sup>
39-
+ [Azure Cognitive Services](../cognitive-services/cognitive-services-apis-create-account.md?tabs=multiservice%2cwindows#create-a-new-azure-cognitive-services-resource) (any tier) <sup>2</sup>
40-
+ [Azure Databricks](/azure/databricks/scenarios/quickstart-create-databricks-workspace-portal?tabs=azure-portal) (any tier) <sup>3</sup>
38+
+ [SynapseML package](https://microsoft.github.io/SynapseML/docs/getting_started/installation/#python) <sup>1</sup>
39+
+ [Azure Cognitive Search](search-create-service-portal.md) (any tier) <sup>2</sup>
40+
+ [Azure Cognitive Services](../cognitive-services/cognitive-services-apis-create-account.md?tabs=multiservice%2cwindows#create-a-new-azure-cognitive-services-resource) (any tier) <sup>3</sup>
41+
+ [Azure Databricks](/azure/databricks/scenarios/quickstart-create-databricks-workspace-portal?tabs=azure-portal) (any tier) <sup>4</sup>
4142

42-
<sup>1</sup> You can use the free tier for this walkthrough but [choose a higher tier](search-sku-tier.md) if data volumes are large. You'll need the [API key](search-security-api-keys.md#find-existing-keys) for this resource.
43+
<sup>1</sup> This article includes instructions for loading the package.
4344

44-
<sup>2</sup> This walkthrough uses Azure Forms Recognizer and Azure Translator. In the instructions below, you'll provide a [Cognitive Services multi-service key](../cognitive-services/cognitive-services-apis-create-account.md?tabs=multiservice%2cwindows#get-the-keys-for-your-resource) and the region, and it'll work for both services.
45+
<sup>2</sup> You can use the free tier for this walkthrough but [choose a higher tier](search-sku-tier.md) if data volumes are large. You'll need the [API key](search-security-api-keys.md#find-existing-keys) for this resource.
4546

46-
<sup>3</sup> In this walkthrough, Azure Databricks provides the computing platform. You could also use Azure Synapse Analytics or any other computing platform supported by `synapseml`. The Azure Databricks article listed in the prerequisites includes multiple steps. For this walkthrough, follow only the instructions in "Create a workspace".
47+
<sup>3</sup> This walkthrough uses Azure Forms Recognizer and Azure Translator. In the instructions below, you'll provide a [Cognitive Services multi-service key](../cognitive-services/cognitive-services-apis-create-account.md?tabs=multiservice%2cwindows#get-the-keys-for-your-resource) and the region, and it will work for both services.
48+
49+
<sup>4</sup> In this walkthrough, Azure Databricks provides the computing platform. You could also use Azure Synapse Analytics or any other computing platform supported by `synapseml`. The Azure Databricks article listed in the prerequisites includes multiple steps. For this walkthrough, follow only the instructions in "Create a workspace".
4750

4851
> [!NOTE]
49-
> All of the above resources support security features in the Microsoft Identity platform. For simplicity, this walkthrough assumes key-based authentication, using endpoints and keys copied from the portal pages of each service. If you implement this workflow in a production environment, or share the solution with others, remember to replace hard-coded keys with integrated security or encrypted keys.
52+
> All of the above Azure resources support security features in the Microsoft Identity platform. For simplicity, this walkthrough assumes key-based authentication, using endpoints and keys copied from the portal pages of each service. If you implement this workflow in a production environment, or share the solution with others, remember to replace hard-coded keys with integrated security or encrypted keys.
5053
5154
## Create a Spark cluster and notebook
5255

@@ -74,6 +77,8 @@ In this section, you'll create a cluster, install the `synapseml` library, and c
7477

7578
1. Select **Install**.
7679

80+
:::image type="content" source="media/search-synapseml-cognitive-services/install-library-from-maven.png" alt-text="Screenshot of Maven package specification." border="true":::
81+
7782
1. On the left menu, select **Create** > **Notebook**.
7883

7984
:::image type="content" source="media/search-synapseml-cognitive-services/create-notebook.png" alt-text="Screenshot of the Create Notebook command." border="true":::
@@ -108,7 +113,7 @@ search_index = "placeholder-search-index-name"
108113

109114
Paste the following code into the second cell. No modifications are required, so run the code when you're ready.
110115

111-
This code loads a small number of external files from an Azure storage account that's used for demo purposes. The files are various invoices, and they're read into a data frame.
116+
This code loads a few external files from an Azure storage account that's used for demo purposes. The files are various invoices, and they're read into a data frame.
112117

113118
```python
114119
def blob_to_url(blob):
@@ -130,7 +135,7 @@ df2 = (spark.read.format("binaryFile")
130135
display(df2)
131136
```
132137

133-
## Apply form recognition
138+
## Add form recognition
134139

135140
Paste the following code into the third cell. No modifications are required, so run the code when you're ready.
136141

@@ -152,11 +157,15 @@ analyzed_df = (AnalyzeInvoices()
152157
display(analyzed_df)
153158
```
154159

155-
## Apply data restructuring
160+
The output from this step should look similar to the next screenshot. Notice how the forms analysis is packed into a densely structured column, which is difficult to work with. The next transformation resolves this issue by parsing the column into rows and columns.
161+
162+
:::image type="content" source="media/search-synapseml-cognitive-services/analyze-forms-output.png" alt-text="Screenshot of the AnalyzeInvoices output." border="true":::
163+
164+
## Restructure form recognition output
156165

157166
Paste the following code into the fourth cell and run it. No modifications are required.
158167

159-
This code loads [FormOntologyLearner](https://mmlspark.blob.core.windows.net/docs/0.10.0/pyspark/synapse.ml.cognitive.html#module-synapse.ml.cognitive.FormOntologyTransformer), a transformer that analyzes the output of Form Recognizer transformers and infers a tabular data structure. The output of AnalyzeInvoices is dynamic and varies based on the features detected in your content. Furthermore, the AnalyzeInvoices transformer consolidates output into a single column. Because the output is dynamic and consolidated, it's difficult to use in downstream transformations that require more structure.
168+
This code loads [FormOntologyLearner](https://mmlspark.blob.core.windows.net/docs/0.10.0/pyspark/synapse.ml.cognitive.html#module-synapse.ml.cognitive.FormOntologyTransformer), a transformer that analyzes the output of Form Recognizer transformers and infers a tabular data structure. The output of AnalyzeInvoices is dynamic and varies based on the features detected in your content. Furthermore, the transformer consolidates output into a single column. Because the output is dynamic and consolidated, it's difficult to use in downstream transformations that require more structure.
160169

161170
FormOntologyLearner extends the utility of the AnalyzeInvoices transformer by looking for patterns that can be used to create a tabular data structure. Organizing the output into multiple columns and rows makes the content consumable in other transformers, like AzureSearchWriter.
162171

@@ -174,7 +183,11 @@ itemized_df = (FormOntologyLearner()
174183
display(itemized_df)
175184
```
176185

177-
## Apply translations
186+
Notice how this transformation recasts the nested fields into a table, which enables the next two transformations. This screenshot is trimmed for brevity. If you're following along in your own notebook, you'll have 19 columns and 26 rows.
187+
188+
:::image type="content" source="media/search-synapseml-cognitive-services/form-ontology-learner-output.png" alt-text="Screenshot of the FormOntologyLearner output." border="true":::
189+
190+
## Add translations
178191

179192
Paste the following code into the fifth cell. No modifications are required, so run the code when you're ready.
180193

@@ -204,7 +217,7 @@ display(translated_df)
204217
>
205218
> :::image type="content" source="media/search-synapseml-cognitive-services/translated-strings.png" alt-text="Screenshot of table output, showing the Translations column." border="true":::
206219
207-
## Apply search indexing
220+
## Add a search index with AzureSearchWriter
208221

209222
Paste the following code in the sixth cell and then run it. No modifications are required.
210223

@@ -224,11 +237,21 @@ from synapse.ml.cognitive import *
224237
))
225238
```
226239

240+
You can check the search service pages in Azure portal to explore the index definition created by AzureSearchWriter.
241+
242+
<!-- > [!NOTE]
243+
> If you can't use default search index, you can provide an external custom definition in JSON, passing its URI as a string in the "indexJson" property. Generate the default index first so that you know which fields to specify, and then follow with customized properties if you need specific analyzers, for example. -->
244+
227245
## Query the index
228246

229-
Paste the following code into the seventh cell and then run it. No modifications are required, except that you might want to vary the [query syntax](query-simple-syntax.md) or [review these query examples](search-query-simple-examples.md) to further explore your content.
247+
Paste the following code into the seventh cell and then run it. No modifications are required, except that you might want to vary the syntax or try more examples to further explore your content:
248+
249+
+ [Query syntax](query-simple-syntax.md)
250+
+ [Query examples](search-query-simple-examples.md)
251+
252+
There's no transformer or module that issues queries. This cell is a simple call to the [Search Documents REST API](/rest/api/searchservice/search-documents).
230253

231-
This code calls the [Search Documents REST API](/rest/api/searchservice/search-documents) that queries an index. This particular example is searching for the word "door". This query returns a count of the number of matching documents. It also returns just the contents of the "Description' and "Translations" fields. If you want to see the full list of fields, remove the "select" parameter.
254+
This particular example is searching for the word "door" (`"search": "door"`). It also returns a "count" of the number of matching documents, and selects just the contents of the "Description' and "Translations" fields for the results. If you want to see the full list of fields, remove the "select" parameter.
232255

233256
```python
234257
import requests

0 commit comments

Comments
 (0)