Skip to content

Commit 7e25d36

Browse files
committed
refresh
1 parent 8726da3 commit 7e25d36

File tree

1 file changed

+57
-40
lines changed

1 file changed

+57
-40
lines changed

articles/ai-services/openai/how-to/integrate-synapseml.md

Lines changed: 57 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,55 @@
11
---
2-
title: 'How-to - Use Azure OpenAI Service with large datasets'
2+
title: 'Use Azure OpenAI Service with large datasets'
33
titleSuffix: Azure OpenAI
4-
description: Walkthrough on how to integrate Azure OpenAI with SynapseML and Apache Spark to apply large language models at a distributed scale.
4+
description: Learn how to integrate Azure OpenAI Service with SynapseML and Apache Spark to apply large language models at a distributed scale.
55
services: cognitive-services
66
manager: nitinme
77
ms.service: cognitive-services
88
ms.subservice: openai
99
ms.custom: build-2023, build-2023-dataai
1010
ms.topic: how-to
11-
ms.date: 08/04/2022
11+
ms.date: 08/29/2023
1212
author: ChrisHMSFT
1313
ms.author: chrhoder
1414
recommendations: false
1515
---
1616

1717
# Use Azure OpenAI with large datasets
1818

19-
Azure OpenAI can be used to solve a large number of natural language tasks through prompting the completion API. To make it easier to scale your prompting workflows from a few examples to large datasets of examples, we have integrated the Azure OpenAI Service with the distributed machine learning library [SynapseML](https://www.microsoft.com/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of prompts with the OpenAI service. This tutorial shows how to apply large language models at a distributed scale using Azure Open AI and Azure Synapse Analytics.
19+
Azure OpenAI can be used to solve a large number of natural language tasks through prompting the completion API. To make it easier to scale your prompting workflows from a few examples to large datasets of examples, Azure OpenAI Service is integrated with the distributed machine learning library [SynapseML](https://www.microsoft.com/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of prompts with Azure OpenAI Service. This tutorial shows how to apply large language models at a distributed scale by using Azure OpenAI and Azure Synapse Analytics.
2020

2121
## Prerequisites
2222

23-
- An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>
24-
- Access granted to Azure OpenAI in the desired Azure subscription
23+
- An Azure subscription. <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
24+
- Access granted to Azure OpenAI in the desired Azure subscription.
25+
- An Azure OpenAI resource. [Create a resource](create-resource.md?pivots=web-portal#create-a-resource).
26+
- An Apache Spark cluster with SynapseML installed. Create a [serverless Apache Spark pool](../../../synapse-analytics/get-started-analyze-spark.md#create-a-serverless-apache-spark-pool)
2527

26-
Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at <a href="https://aka.ms/oai/access" target="_blank">https://aka.ms/oai/access</a>. Open an issue on this repo to contact us if you have an issue.
27-
- An Azure OpenAI resource – [create a resource](create-resource.md?pivots=web-portal#create-a-resource)
28-
- An Apache Spark cluster with SynapseML installed - create a serverless Apache Spark pool [here](../../../synapse-analytics/get-started-analyze-spark.md#create-a-serverless-apache-spark-pool)
28+
> [!NOTE]
29+
> Currently, you must submit an application to access Azure OpenAI Service. To apply for access, complete <a href="https://aka.ms/oai/access" target="_blank">this form</a>. If you need assistance, open an issue on this repo to contact Microsoft.
30+
31+
Microsoft recommends that you [create an Azure Synapse workspace](../../../synapse-analytics/get-started-create-workspace.md). However, you can also use Azure Databricks, Azure HDInsight, Spark on Kubernetes, or the Python environment with the `pyspark` package.
32+
33+
## Import example code as a notebook
2934

30-
We recommend [creating a Synapse workspace](../../../synapse-analytics/get-started-create-workspace.md), but an Azure Databricks, HDInsight, or Spark on Kubernetes, or even a Python environment with the `pyspark` package, will also work.
35+
To use the example code in this article with your Spark cluster, you have two options:
36+
- Create a notebook in your Spark platform and copy the code into this notebook to run the demo.
37+
- Download the notebook and import it into Azure Synapse.
3138

32-
## Import this guide as a notebook
39+
1. [Download this demo as a notebook](https://github.com/microsoft/SynapseML/blob/master/docs/Explore%20Algorithms/OpenAI/OpenAI.ipynb). During the download process, select **Raw**, and then save the file.
3340

34-
The next step is to add this code into your Spark cluster. You can either create a notebook in your Spark platform and copy the code into this notebook to run the demo, or download the notebook and import it into Synapse Analytics.
41+
1. Import the notebook [into the Synapse Workspace](../../../synapse-analytics/spark/apache-spark-development-using-notebooks.md#create-a-notebook), or if you're using Azure Databricks, import the notebook [into the Azure Databricks Workspace](/azure/databricks/notebooks/notebooks-manage#create-a-notebook).
3542

36-
1. [Download this demo as a notebook](https://github.com/microsoft/SynapseML/blob/master/notebooks/features/cognitive_services/CognitiveServices%20-%20OpenAI.ipynb) (select Raw, then save the file)
37-
1. Import the notebook [into the Synapse Workspace](../../../synapse-analytics/spark/apache-spark-development-using-notebooks.md#create-a-notebook) or, if using Databricks, [into the Databricks Workspace](/azure/databricks/notebooks/notebooks-manage#create-a-notebook)
38-
1. Install SynapseML on your cluster. See the installation instructions for Synapse at the bottom of [the SynapseML website](https://microsoft.github.io/SynapseML/). This requires pasting another cell at the top of the notebook you imported
39-
1. Connect your notebook to a cluster and follow along, editing and running the cells below.
43+
1. Install SynapseML on your cluster. See the installation instructions for Azure Synapse at the bottom of [the SynapseML website](https://microsoft.github.io/SynapseML/). This task requires pasting another cell at the top of the notebook you imported.
44+
45+
1. Connect your notebook to a cluster and follow along with editing and running the cells later in this article.
4046

4147
## Fill in your service information
4248

43-
Next, edit the cell in the notebook to point to your service. In particular, set the `resource_name`, `deployment_name`, `location`, and `key` variables to the corresponding values for your Azure OpenAI resource.
49+
When the notebook is ready, you need to edit a few cells in your notebook to point to your service. Set the `resource_name`, `deployment_name`, `location`, and `key` variables to the corresponding values for your Azure OpenAI resource.
4450

4551
> [!IMPORTANT]
46-
> Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like [Azure Key Vault](../../../key-vault/general/overview.md). See the Azure AI services [security](../../security-features.md) article for more information.
52+
> Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like [Azure Key Vault](../../../key-vault/general/overview.md). For more information, see [Azure AI services security](../../security-features.md).
4753
4854
```python
4955
import os
@@ -59,9 +65,9 @@ assert key is not None and resource_name is not None
5965

6066
## Create a dataset of prompts
6167

62-
Next, create a dataframe consisting of a series of rows, with one prompt per row.
68+
The next step is to create a dataframe consisting of a series of rows, with one prompt per row.
6369

64-
You can also load data directly from Azure Data Lake Storage (ADLS) or other databases. For more information about loading and preparing Spark dataframes, see the [Apache Spark data loading guide](https://spark.apache.org/docs/latest/sql-data-sources.html).
70+
You can also load data directly from Azure Data Lake Storage or other databases. For more information about loading and preparing Spark dataframes, see the [Apache Spark Data Sources](https://spark.apache.org/docs/latest/sql-data-sources.html).
6571

6672
```python
6773
df = spark.createDataFrame(
@@ -75,7 +81,7 @@ df = spark.createDataFrame(
7581

7682
## Create the OpenAICompletion Apache Spark client
7783

78-
To apply the OpenAI Completion service to the dataframe that you just created, create an `OpenAICompletion` object that serves as a distributed client. Parameters of the service can be set either with a single value, or by a column of the dataframe with the appropriate setters on the `OpenAICompletion` object. Here, we're setting `maxTokens` to 200. A token is around four characters, and this limit applies to the sum of the prompt and the result. We're also setting the `promptCol` parameter with the name of the prompt column in the dataframe.
84+
To apply the Azure OpenAI Completion service to the dataframe, create an `OpenAICompletion` object that serves as a distributed client. Parameters of the service can be set either with a single value, or by a column of the dataframe with the appropriate setters on the `OpenAICompletion` object. In this example, you set the `maxTokens` parameter to 200. A token is around four characters, and this limit applies to the sum of the prompt and the result. You also set the `promptCol` parameter with the name of the prompt column in the dataframe.
7985

8086
```python
8187
from synapse.ml.cognitive import OpenAICompletion
@@ -94,7 +100,7 @@ completion = (
94100

95101
## Transform the dataframe with the OpenAICompletion client
96102

97-
Now that you have the dataframe and the completion client, you can transform your input dataset and add a column called `completions` with all of the information the service adds. We'll select out just the text for simplicity.
103+
After you have the dataframe and completion client, you can transform your input dataset and add a column called `completions` with all of the information the service adds. In this example, you select only the text for simplicity.
98104

99105
```python
100106
from pyspark.sql.functions import col
@@ -104,23 +110,34 @@ display(completed_df.select(
104110
col("prompt"), col("error"), col("completions.choices.text").getItem(0).alias("text")))
105111
```
106112

107-
Your output should look something like the following example; note that the completion text can vary.
113+
Your output should look something like the following example. Keep in mind that the completion text can vary so your output might look different.
108114

109-
| **prompt** | **error** | **text** |
110-
|------------|-----------| ---------|
111-
| Hello my name is | undefined | Makaveli I'm eighteen years old and I want to<br>be a rapper when I grow up I love writing and making music I'm from Los<br>Angeles, CA |
112-
| The best code is code that's | undefined | understandable This is a subjective statement,<br>and there is no definitive answer. |
113-
| SynapseML is | undefined | A machine learning algorithm that is able to learn how to predict the future outcome of events. |
115+
```output
116+
prompt error text
117+
------------------------------------------------------------------------------------------------------------------------------------------------------
118+
Hello my name is undefined Makaveli
119+
I'm eighteen years old and I want to be a rapper when I grow up
120+
I love writing and making music
121+
I'm from Los Angeles, CA
114122
115-
## Other usage examples
123+
The best code is code that's undefined understandable
124+
This is a subjective statement, and there is no definitive answer.
125+
126+
SynapseML is undefined A machine learning algorithm that is able to learn how to predict the future outcome of events.
127+
```
128+
129+
## Explore other usage scenarios
130+
131+
Let's review some other use case scenarios for working with Azure OpenAI Service and large datasets.
116132

117133
### Improve throughput with request batching
118134

119-
The example above makes several requests to the service, one for each prompt. To complete multiple prompts in a single request, use batch mode. First, in the `OpenAICompletion` object, instead of setting the Prompt column to "Prompt", specify "batchPrompt" for the BatchPrompt column.
120-
To do so, create a dataframe with a list of prompts per row.
135+
You can use Azure OpenAI Service with large datasets to improve throughput with request batching. In the previous example, you make several requests to the service, one for each prompt. To complete multiple prompts in a single request, you can use batch mode.
136+
137+
In the `OpenAICompletion` object, instead of setting the **Prompt** column to "prompt," you can specify "batchPrompt" to create the **BatchPrompt** column. To support this method, you create a dataframe with a list of prompts per row.
121138

122139
> [!NOTE]
123-
> There is currently a limit of 20 prompts in a single request and a limit of 2048 "tokens", or approximately 1500 words.
140+
> There's currently a limit of 20 prompts in a single request and a limit of 2048 "tokens," or approximately 1500 words.
124141
125142
```python
126143
batch_df = spark.createDataFrame(
@@ -131,7 +148,7 @@ batch_df = spark.createDataFrame(
131148
).toDF("batchPrompt")
132149
```
133150

134-
Next we create the `OpenAICompletion` object. Rather than setting the prompt column, set the batchPrompt column if your column is of type `Array[String]`.
151+
Next, you create the `OpenAICompletion` object. Rather than setting the "prompt" column, you set the "batchPrompt" column if your column is of type `Array[String]`.
135152

136153
```python
137154
batch_completion = (
@@ -146,27 +163,27 @@ batch_completion = (
146163
)
147164
```
148165

149-
In the call to transform, a request will then be made per row. Because there are multiple prompts in a single row, each request will be sent with all prompts in that row. The results will contain a row for each row in the request.
166+
In the call to `transform`, one request is made per row. Because there are multiple prompts in a single row, each request is sent with all prompts in that row. The results contain a row for each row in the request.
150167

151168
```python
152169
completed_batch_df = batch_completion.transform(batch_df).cache()
153170
display(completed_batch_df)
154171
```
155172

156173
> [!NOTE]
157-
> There is currently a limit of 20 prompts in a single request and a limit of 2048 "tokens", or approximately 1500 words.
174+
> There's currently a limit of 20 prompts in a single request and a limit of 2048 "tokens," or approximately 1500 words.
158175
159-
### Using an automatic mini-batcher
176+
### Use an automatic mini-batcher
160177

161-
If your data is in column format, you can transpose it to row format using SynapseML's `FixedMiniBatcherTransformer`.
178+
You can use Azure OpenAI Service with large datasets to transpose the data format. If your data is in column format, you can transpose it to row format by using the SynapseML `FixedMiniBatcherTransformer` object.
162179

163180
```python
164181
from pyspark.sql.types import StringType
165182
from synapse.ml.stages import FixedMiniBatchTransformer
166183
from synapse.ml.core.spark import FluentAPI
167184

168185
completed_autobatch_df = (df
169-
.coalesce(1) # Force a single partition so that our little 4-row dataframe makes a batch of size 4, you can remove this step for large datasets
186+
.coalesce(1) # Force a single partition so your little 4-row dataframe makes a batch of size 4 - you can remove this step for large datasets.
170187
.mlTransform(FixedMiniBatchTransformer(batchSize=4))
171188
.withColumnRenamed("prompt", "batchPrompt")
172189
.mlTransform(batch_completion))
@@ -176,7 +193,7 @@ display(completed_autobatch_df)
176193

177194
### Prompt engineering for translation
178195

179-
Azure OpenAI can solve many different natural language tasks through [prompt engineering](completions.md). Here, we show an example of prompting for language translation:
196+
Azure OpenAI can solve many different natural language tasks through [prompt engineering](completions.md). In this example, you can prompt for language translation:
180197

181198
```python
182199
translate_df = spark.createDataFrame(
@@ -191,7 +208,7 @@ display(completion.transform(translate_df))
191208

192209
### Prompt for question answering
193210

194-
Here, we prompt the GPT-3 model for general-knowledge question answering:
211+
Azure OpenAI also supports prompting the GPT-3 model for general-knowledge question answering:
195212

196213
```python
197214
qa_df = spark.createDataFrame(

0 commit comments

Comments
 (0)