Skip to content

Commit 1c2573c

Browse files
authored
Merge pull request #206045 from HeidiSteen/heidist-synapse
[azure search] Synapse-search integration doc
2 parents b4b691f + ea59468 commit 1c2573c

File tree

7 files changed

+273
-14
lines changed

7 files changed

+273
-14
lines changed

articles/search/TOC.yml

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -77,19 +77,7 @@
7777
href: tutorial-python-deploy-static-web-app.md
7878
- name: Explore the code
7979
href: tutorial-python-search-query-integration.md
80-
- name: Create a C# app
81-
items:
82-
- name: 1 - Basic search page
83-
href: tutorial-csharp-create-first-app.md
84-
- name: 2 - Add results paging
85-
href: tutorial-csharp-paging.md
86-
- name: 3 - Add type-ahead
87-
href: tutorial-csharp-type-ahead-and-suggestions.md
88-
- name: 4 - Add facets
89-
href: tutorial-csharp-facets.md
90-
- name: 5 - Add results ordering
91-
href: tutorial-csharp-orders.md
92-
- name: Index Azure data
80+
- name: Index with indexers
9381
items:
9482
- name: Index Azure SQL Database
9583
href: search-indexer-tutorial.md
@@ -101,6 +89,8 @@
10189
href: search-howto-index-encrypted-blobs.md
10290
- name: Index any data
10391
href: tutorial-optimize-indexing-push-api.md
92+
- name: Enrich with SynapseML
93+
href: search-synapseml-cognitive-services.md
10494
- name: Enrich with AI (skills)
10595
items:
10696
- name: C#
@@ -114,7 +104,19 @@
114104
- name: Create a custom analyzer
115105
href: tutorial-create-custom-analyzer.md
116106
- name: Query from Power Apps
117-
href: search-howto-powerapps.md
107+
href: search-howto-powerapps.md
108+
- name: Create a C# app
109+
items:
110+
- name: 1 - Basic search page
111+
href: tutorial-csharp-create-first-app.md
112+
- name: 2 - Add results paging
113+
href: tutorial-csharp-paging.md
114+
- name: 3 - Add type-ahead
115+
href: tutorial-csharp-type-ahead-and-suggestions.md
116+
- name: 4 - Add facets
117+
href: tutorial-csharp-facets.md
118+
- name: 5 - Add results ordering
119+
href: tutorial-csharp-orders.md
118120
- name: Samples
119121
items:
120122
- name: C# samples
48.6 KB
Loading
28.2 KB
Loading
23.7 KB
Loading
49.1 KB
Loading
30.4 KB
Loading
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
---
2+
title: Use Search with SynapseML
3+
titleSuffix: Azure Cognitive Search
4+
description: Add full text search to big data on Apache Spark that's been loaded and transformed through the open source SynapseML library. In this walkthrough, you'll load invoice files into data frames, apply machine learning through SynapseML, then send it into a generated search index.
5+
6+
manager: nitinme
7+
author: HeidiSteen
8+
ms.author: heidist
9+
ms.service: cognitive-search
10+
ms.topic: how-to
11+
ms.date: 08/09/2022
12+
---
13+
14+
# Add search to AI-enriched data from Apache Spark using SynapseML
15+
16+
In this Azure Cognitive Search article, learn how to add data exploration and full text search to a SynapseML solution.
17+
18+
[SynapseML](/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/) is an open source library that supports massively parallel machine learning over big data. One of the ways in which machine learning is exposed is through *transformers* that perform specialized tasks. Transformers tap into a wide range of AI capabilities, but in this article, we'll focus on just those that call Cognitive Services and Cognitive Search.
19+
20+
In this walkthrough, you'll set up a workbook that does the following:
21+
22+
> [!div class="checklist"]
23+
> + Load various forms (invoices) into a data frame in an Apache Spark session
24+
> + Analyze them to determine their features
25+
> + Assemble the resulting output into a tabular data structure
26+
> + Write the output to a search index in Azure Cognitive Search
27+
> + Explore and search over the content you created
28+
29+
Although Azure Cognitive Search has native [AI enrichment](cognitive-search-concept-intro.md), this walkthrough shows you how to access AI capabilities outside of Cognitive Search. By using SynapseML instead of indexers or skills, you're not subject to data limits or any other constraint associated with those objects.
30+
31+
> [!TIP]
32+
> Watch a demo at [https://www.youtube.com/watch?v=iXnBLwp7f88](https://www.youtube.com/watch?v=iXnBLwp7f88). The demo expands on this walkthrough with more steps and visuals.
33+
34+
## Prerequisites
35+
36+
You'll need the `synapseml` library and several Azure resources. If possible, use the same subscription and region for your Azure resources and put everything into one resource group for simple cleanup later. The following links are for portal installs. The sample data is imported from a public site.
37+
38+
+ [Azure Cognitive Search](search-create-service-portal.md) (any tier) <sup>1</sup>
39+
+ [Azure Cognitive Services](/azure/cognitive-services/cognitive-services-apis-create-account?tabs=multiservice%2Cwindows#create-a-new-azure-cognitive-services-resource) (any tier) <sup>2</sup>
40+
+ [Azure Databricks](/azure/databricks/scenarios/quickstart-create-databricks-workspace-portal?tabs=azure-portal) (any tier) <sup>3</sup>
41+
42+
<sup>1</sup> You can use the free tier for this walkthrough but [choose a higher tier](search-sku-tier.md) if data volumes are large. You'll need the [API key](search-security-api-keys.md#find-existing-keys) for this resource.
43+
44+
<sup>2</sup> This walkthrough uses Azure Forms Recognizer and Azure Translator. In the instructions below, you'll provide a [Cognitive Services multi-service key](/azure/cognitive-services/cognitive-services-apis-create-account?tabs=multiservice%2Cwindows#get-the-keys-for-your-resource) and the region, and it'll work for both services.
45+
46+
<sup>3</sup> In this walkthrough, Azure Databricks provides the computing platform. You could also use Azure Synapse Analytics or any other computing platform supported by `synapseml`. The Azure Databricks article listed in the prerequisites includes multiple steps. For this walkthrough, follow only the instructions in "Create a workspace".
47+
48+
> [!NOTE]
49+
> All of the above resources support security features in the Microsoft Identity platform. For simplicity, this walkthrough assumes key-based authentication, using endpoints and keys copied from the portal pages of each service. If you implement this workflow in a production environment, or share the solution with others, remember to replace hard-coded keys with integrated security or encrypted keys.
50+
51+
## Create a Spark cluster and notebook
52+
53+
In this section, you'll create a cluster, install the `synapseml` library, and create a notebook to run the code.
54+
55+
1. In Azure portal, find your Azure Databricks workspace and select **Launch workspace**.
56+
57+
1. On the left menu, select **Compute**.
58+
59+
1. Select **Create cluster**.
60+
61+
1. Give the cluster a name, accept the default configuration, and then create the cluster. It takes several minutes to create the cluster.
62+
63+
1. Install the `synapseml` library after the cluster is created:
64+
65+
1. Select **Library** from the tabs at the top of the cluster's page.
66+
67+
1. Select **Install new**.
68+
69+
:::image type="content" source="media/search-synapseml-cognitive-services/install-library.png" alt-text="Screenshot of the Install New command." border="true":::
70+
71+
1. Select **Maven**.
72+
73+
1. In Coordinates, enter `com.microsoft.azure:synapseml_2.12:0.10.0`
74+
75+
1. Select **Install**.
76+
77+
1. On the left menu, select **Create** > **Notebook**.
78+
79+
:::image type="content" source="media/search-synapseml-cognitive-services/create-notebook.png" alt-text="Screenshot of the Create Notebook command." border="true":::
80+
81+
1. Give the notebook a name, select **Python** as the default language, and select the cluster that has the `synapseml` library.
82+
83+
1. Create seven consecutive cells. You'll paste code into each one.
84+
85+
:::image type="content" source="media/search-synapseml-cognitive-services/create-seven-cells.png" alt-text="Screenshot of the notebook with placeholder cells." border="true":::
86+
87+
## Set up dependencies
88+
89+
Paste the following code into the first cell of your notebook. Replace the placeholders with endpoints and access keys for each resource. No other modifications are required, so run the code when you're ready.
90+
91+
This code imports packages and sets up access to the Azure resources used in this workflow.
92+
93+
```python
94+
import os
95+
from pyspark.sql.functions import udf, trim, split, explode, col, monotonically_increasing_id, lit
96+
from pyspark.sql.types import StringType
97+
from synapse.ml.core.spark import FluentAPI
98+
99+
cognitive_services_key = "placeholder-cognitive-services-multi-service-key"
100+
cognitive_services_region = "placeholder-cognitive-services-region"
101+
102+
search_service = "placeholder-search-service-name"
103+
search_key = "placeholder-search-service-api-key"
104+
search_index = "placeholder-search-index-name"
105+
```
106+
107+
## Load data into Spark
108+
109+
Paste the following code into the second cell. No modifications are required, so run the code when you're ready.
110+
111+
This code loads a small number of external files from an Azure storage account that's used for demo purposes. The files are various invoices, and they're read into a data frame.
112+
113+
```python
114+
def blob_to_url(blob):
115+
[prefix, postfix] = blob.split("@")
116+
container = prefix.split("/")[-1]
117+
split_postfix = postfix.split("/")
118+
account = split_postfix[0]
119+
filepath = "/".join(split_postfix[1:])
120+
return "https://{}/{}/{}".format(account, container, filepath)
121+
122+
123+
df2 = (spark.read.format("binaryFile")
124+
.load("wasbs://[email protected]/form_subset/*")
125+
.select("path")
126+
.limit(10)
127+
.select(udf(blob_to_url, StringType())("path").alias("url"))
128+
.cache())
129+
130+
display(df2)
131+
```
132+
133+
## Apply form recognition
134+
135+
Paste the following code into the third cell. No modifications are required, so run the code when you're ready.
136+
137+
This code loads the [AnalyzeInvoices transformer](https://microsoft.github.io/SynapseML/docs/documentation/transformers/transformers_cognitive/#analyzeinvoices) and passes a reference to the data frame containing the invoices. It calls the pre-built [invoice model](/azure/applied-ai-services/form-recognizer/concept-invoice) of Azure Forms Analyzer.
138+
139+
```python
140+
from synapse.ml.cognitive import AnalyzeInvoices
141+
142+
analyzed_df = (AnalyzeInvoices()
143+
.setSubscriptionKey(cognitive_services_key)
144+
.setLocation(cognitive_services_region)
145+
.setImageUrlCol("url")
146+
.setOutputCol("invoices")
147+
.setErrorCol("errors")
148+
.setConcurrency(5)
149+
.transform(df2)
150+
.cache())
151+
152+
display(analyzed_df)
153+
```
154+
155+
## Apply data restructuring
156+
157+
Paste the following code into the fourth cell and run it. No modifications are required.
158+
159+
This code loads [FormOntologyLearner](https://mmlspark.blob.windows.net/docs/0.10.0/pyspark/synapse.ml.cognitive.html?highlight=formontologylearner#module-synapse.ml.cognitive.FormOntologyLearner), a transformer that analyzes the output of Form Recognizer transformers and infers a tabular data structure. The output of AnalyzeInvoices is dynamic and varies based on the features detected in your content. Furthermore, the AnalyzeInvoices transformer consolidates output into a single column. Because the output is dynamic and consolidated, it's difficult to use in downstream transformations that require more structure.
160+
161+
FormOntologyLearner extends the utility of the AnalyzeInvoices transformer by looking for patterns that can be used to create a tabular data structure. Organizing the output into multiple columns and rows makes the content consumable in other transformers, like AzureSearchWriter.
162+
163+
```python
164+
from synapse.ml.cognitive import FormOntologyLearner
165+
166+
itemized_df = (FormOntologyLearner()
167+
.setInputCol("invoices")
168+
.setOutputCol("extracted")
169+
.fit(analyzed_df)
170+
.transform(analyzed_df)
171+
.select("url", "extracted.*").select("*", explode(col("Items")).alias("Item"))
172+
.drop("Items").select("Item.*", "*").drop("Item"))
173+
174+
display(itemized_df)
175+
```
176+
177+
## Apply translations
178+
179+
Paste the following code into the fifth cell. No modifications are required, so run the code when you're ready.
180+
181+
This code loads [Translate](https://microsoft.github.io/SynapseML/docs/documentation/transformers/transformers_cognitive/#translate), a transformer that calls the Azure Translator service in Cognitive Services. The original text, which is in English in the "Description" column, is machine-translated into various languages. All of the output is consolidated into "output.translations" array.
182+
183+
```python
184+
from synapse.ml.cognitive import Translate
185+
186+
translated_df = (Translate()
187+
.setSubscriptionKey(cognitive_services_key)
188+
.setLocation(cognitive_services_region)
189+
.setTextCol("Description")
190+
.setErrorCol("TranslationError")
191+
.setOutputCol("output")
192+
.setToLanguage(["zh-Hans", "fr", "ru", "cy"])
193+
.setConcurrency(5)
194+
.transform(itemized_df)
195+
.withColumn("Translations", col("output.translations")[0])
196+
.drop("output", "TranslationError")
197+
.cache())
198+
199+
display(translated_df)
200+
```
201+
202+
> [!TIP]
203+
> To check for translated strings, scroll to the end of the rows.
204+
>
205+
> :::image type="content" source="media/search-synapseml-cognitive-services/translated-strings.png" alt-text="Screenshot of table output, showing the Translations column." border="true":::
206+
207+
## Apply search indexing
208+
209+
Paste the following code in the sixth cell and then run it. No modifications are required.
210+
211+
This code loads [AzureSearchWriter](https://microsoft.github.io/SynapseML/docs/documentation/transformers/transformers_cognitive/#azuresearch). It consumes a tabular dataset and infers a search index schema that defines one field for each column. The translations structure is an array, so it's articulated in the index as a complex collection with subfields for each language translation. The generated index will have a document key and use the default values for fields created using the [Create Index REST API](/rest/api/searchservice/create-index).
212+
213+
```python
214+
from synapse.ml.cognitive import *
215+
216+
(translated_df.withColumn("DocID", monotonically_increasing_id().cast("string"))
217+
.withColumn("SearchAction", lit("upload"))
218+
.writeToAzureSearch(
219+
subscriptionKey=search_key,
220+
actionCol="SearchAction",
221+
serviceName=search_service,
222+
indexName=search_index,
223+
keyCol="DocID",
224+
))
225+
```
226+
227+
## Query the index
228+
229+
Paste the following code into the seventh cell and then run it. No modifications are required, except that you might want to vary the [query syntax](query-simple-syntax.md) or [review these query examples](search-query-simple-examples.md) to further explore your content.
230+
231+
This code calls the [Search Documents REST API](/rest/api/searchservice/search-documents) that queries an index. This particular example is searching for the word "door". This query returns a count of the number of matching documents. It also returns just the contents of the "Description' and "Translations" fields. If you want to see the full list of fields, remove the "select" parameter.
232+
233+
```python
234+
import requests
235+
236+
url = "https://{}.search.windows.net/indexes/{}/docs/search?api-version=2020-06-30".format(search_service, search_index)
237+
requests.post(url, json={"search": "door", "count": "true", "select": "Description, Translations"}, headers={"api-key": search_key}).json()
238+
```
239+
240+
The following screenshot shows the cell output for above script.
241+
242+
:::image type="content" source="media/search-synapseml-cognitive-services/query-results.png" alt-text="Screenshot of query results showing the count, search string, and return fields." border="true":::
243+
244+
## Clean up resources
245+
246+
When you're working in your own subscription, at the end of a project, it's a good idea to remove the resources that you no longer need. Resources left running can cost you money. You can delete resources individually or delete the resource group to delete the entire set of resources.
247+
248+
You can find and manage resources in the portal, using the **All resources** or **Resource groups** link in the left-navigation pane.
249+
250+
## Next steps
251+
252+
In this walkthrough, you learned about the [AzureSearchWriter](https://microsoft.github.io/SynapseML/docs/documentation/transformers/transformers_cognitive/#azuresearch) transformer in SynapseML, which is a new way of creating and loading search indexes in Azure Cognitive Search. The transformer takes structured JSON as an input. The FormOntologyLearner can provide the necessary structure for output produced by the Forms Recognizer transformers in SynapseML.
253+
254+
As a next step, review the other SynapseML tutorials that produce transformed content you might want to explore through Azure Cognitive Search:
255+
256+
> [!div class="nextstepaction"]
257+
> [Tutorial: Text Analytics with Cognitive Service](/azure/synapse-analytics/machine-learning/tutorial-text-analytics-use-mmlspark)

0 commit comments

Comments
 (0)