Skip to content

Commit 563fe7d

Browse files
authored
Merge pull request #249474 from GitHubber17/141588-refresh-6
Azure OpenAI Freshness Pass - User Story: 141588
2 parents dde2604 + 9c6b4d4 commit 563fe7d

File tree

2 files changed

+126
-46
lines changed

2 files changed

+126
-46
lines changed
Lines changed: 126 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,141 @@
11
---
2-
title: 'How-to - Use Azure OpenAI Service with large datasets'
2+
title: 'Use Azure OpenAI Service with large datasets'
33
titleSuffix: Azure OpenAI
4-
description: Walkthrough on how to integrate Azure OpenAI with SynapseML and Apache Spark to apply large language models at a distributed scale.
4+
description: Learn how to integrate Azure OpenAI Service with SynapseML and Apache Spark to apply large language models at a distributed scale.
55
services: cognitive-services
66
manager: nitinme
77
ms.service: cognitive-services
88
ms.subservice: openai
99
ms.custom: build-2023, build-2023-dataai
1010
ms.topic: how-to
11-
ms.date: 08/24/2023
11+
ms.date: 09/01/2023
1212
author: ChrisHMSFT
1313
ms.author: chrhoder
1414
recommendations: false
1515
---
1616

1717
# Use Azure OpenAI with large datasets
1818

19-
Azure OpenAI can be used to solve a large number of natural language tasks through prompting the completion API. To make it easier to scale your prompting workflows from a few examples to large datasets of examples, we have integrated the Azure OpenAI Service with the distributed machine learning library [SynapseML](https://www.microsoft.com/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of prompts with the OpenAI service. This tutorial shows how to apply large language models at a distributed scale using Azure Open AI and Azure Synapse Analytics.
19+
Azure OpenAI can be used to solve a large number of natural language tasks through prompting the completion API. To make it easier to scale your prompting workflows from a few examples to large datasets of examples, Azure OpenAI Service is integrated with the distributed machine learning library [SynapseML](https://www.microsoft.com/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of prompts with Azure OpenAI Service.
20+
21+
This tutorial shows how to apply large language models at a distributed scale by using Azure OpenAI and Azure Synapse Analytics.
2022

2123
## Prerequisites
2224

23-
- An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>
24-
- Access granted to Azure OpenAI in the desired Azure subscription
25+
- An Azure subscription. <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.
2526

26-
Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at <a href="https://aka.ms/oai/access" target="_blank">https://aka.ms/oai/access</a>. Open an issue on this repo to contact us if you have an issue.
27-
- An Azure OpenAI resource – [create a resource](create-resource.md?pivots=web-portal#create-a-resource)
28-
- An Apache Spark cluster with SynapseML installed - create a serverless Apache Spark pool [here](../../../synapse-analytics/get-started-analyze-spark.md#create-a-serverless-apache-spark-pool)
27+
- Access granted to Azure OpenAI in your Azure subscription.
2928

30-
We recommend [creating a Synapse workspace](../../../synapse-analytics/get-started-create-workspace.md), but an Azure Databricks, HDInsight, or Spark on Kubernetes, or even a Python environment with the `pyspark` package, will also work.
29+
Currently, you must submit an application to access Azure OpenAI Service. To apply for access, complete <a href="https://aka.ms/oai/access" target="_blank">this form</a>. If you need assistance, open an issue on this repo to contact Microsoft.
3130

32-
## Import this guide as a notebook
31+
- An Azure OpenAI resource. [Create a resource](create-resource.md?pivots=web-portal#create-a-resource).
3332

34-
The next step is to add this code into your Spark cluster. You can either create a notebook in your Spark platform and copy the code into this notebook to run the demo, or download the notebook and import it into Synapse Analytics.
33+
- An Apache Spark cluster with SynapseML installed.
3534

36-
1. [Download this demo as a notebook](https://github.com/microsoft/SynapseML/blob/master/docs/Explore%20Algorithms/OpenAI/OpenAI.ipynb) (select Raw, then save the file)
37-
1. Import the notebook [into the Synapse Workspace](../../../synapse-analytics/spark/apache-spark-development-using-notebooks.md#create-a-notebook) or, if using Databricks, [into the Databricks Workspace](/azure/databricks/notebooks/notebooks-manage#create-a-notebook)
38-
1. Install SynapseML on your cluster. See the installation instructions for Synapse at the bottom of [the SynapseML website](https://microsoft.github.io/SynapseML/). This requires pasting another cell at the top of the notebook you imported
39-
1. Connect your notebook to a cluster and follow along, editing and running the cells below.
35+
- Create a [serverless Apache Spark pool](../../../synapse-analytics/get-started-analyze-spark.md#create-a-serverless-apache-spark-pool).
36+
- To install SynapseML for your Apache Spark cluster, see [Install SynapseML](#install-synapseml).
4037

41-
## Fill in your service information
38+
> [!NOTE]
39+
> This article is designed to work with the [Azure OpenAI Service legacy models](/azure/ai-services/openai/concepts/legacy-models) like `Text-Davinci-003`, which support prompt-based completions. Newer models like the current `GPT-3.5 Turbo` and `GPT-4` model series are designed to work with the new chat completion API that expects a specially formatted array of messages as input.
40+
>
41+
> The Azure OpenAI SynapseML integration supports the latest models via the [OpenAIChatCompletion()](https://github.com/microsoft/SynapseML/blob/0836e40efd9c48424e91aa10c8aa3fbf0de39f31/cognitive/src/main/scala/com/microsoft/azure/synapse/ml/cognitive/openai/OpenAIChatCompletion.scala#L24) transformer, which isn't demonstrated in this article. After the [release of the GPT-3.5 Turbo Instruct model](https://techcommunity.microsoft.com/t5/azure-ai-services-blog/announcing-updates-to-azure-openai-service-models/ba-p/3866757), the newer model will be the preferred model to use with this article.
4242
43-
Next, edit the cell in the notebook to point to your service. In particular, set the `resource_name`, `deployment_name`, `location`, and `key` variables to the corresponding values for your Azure OpenAI resource.
43+
We recommend that you [create an Azure Synapse workspace](../../../synapse-analytics/get-started-create-workspace.md). However, you can also use Azure Databricks, Azure HDInsight, Spark on Kubernetes, or the Python environment with the `pyspark` package.
4444

45-
> [!IMPORTANT]
46-
> Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like [Azure Key Vault](../../../key-vault/general/overview.md). See the Azure AI services [security](../../security-features.md) article for more information.
45+
## Use example code as a notebook
46+
47+
To use the example code in this article with your Apache Spark cluster, complete the following steps:
48+
49+
1. Prepare a new or existing notebook.
50+
51+
1. Connect your Apache Spark cluster with your notebook.
52+
53+
1. Install SynapseML for your Apache Spark cluster in your notebook.
54+
55+
1. Configure the notebook to work with your Azure OpenAI service resource.
56+
57+
### Prepare your notebook
58+
59+
You can create a new notebook in your Apache Spark platform, or you can import an existing notebook. After you have a notebook in place, you can add each snippet of example code in this article as a new cell in your notebook.
60+
61+
- To use a notebook in Azure Synapse Analytics, see [Create, develop, and maintain Synapse notebooks in Azure Synapse Analytics](../../../synapse-analytics/spark/apache-spark-development-using-notebooks.md).
62+
63+
- To use a notebook in Azure Databricks, see [Manage notebooks for Azure Databricks](/azure/databricks/notebooks/notebooks-manage).
64+
65+
- (Optional) Download [this demonstration notebook](https://github.com/microsoft/SynapseML/blob/master/docs/Explore%20Algorithms/OpenAI/OpenAI.ipynb) and connect it with your workspace. During the download process, select **Raw**, and then save the file.
66+
67+
### Connect your cluster
68+
69+
When you have a notebook ready, connect or _attach_ your notebook to an Apache Spark cluster.
70+
71+
### Install SynapseML
72+
73+
To run the exercises, you need to install SynapseML on your Apache Spark cluster. For more information, see [Install SynapseML](https://microsoft.github.io/SynapseML/docs/Get%20Started/Install%20SynapseML/) on the [SynapseML website](https://microsoft.github.io/SynapseML/).
74+
75+
To install SynapseML, create a new cell at the top of your notebook and run the following code.
76+
77+
- For a **Spark3.2 pool**, use the following code:
78+
79+
```python
80+
%%configure -f
81+
{
82+
"name": "synapseml",
83+
"conf": {
84+
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.2,org.apache.spark:spark-avro_2.12:3.3.1",
85+
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
86+
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
87+
"spark.yarn.user.classpath.first": "true",
88+
"spark.sql.parquet.enableVectorizedReader": "false",
89+
"spark.sql.legacy.replaceDatabricksSparkAvro.enabled": "true"
90+
}
91+
}
92+
```
93+
94+
- For a **Spark3.3 pool**, use the following code:
95+
96+
```python
97+
%%configure -f
98+
{
99+
"name": "synapseml",
100+
"conf": {
101+
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.11.2-spark3.3",
102+
"spark.jars.repositories": "https://mmlspark.azureedge.net/maven",
103+
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12,com.fasterxml.jackson.core:jackson-databind",
104+
"spark.yarn.user.classpath.first": "true",
105+
"spark.sql.parquet.enableVectorizedReader": "false"
106+
}
107+
}
108+
```
109+
110+
The connection process can take several minutes.
111+
112+
### Configure the notebook
113+
114+
Create a new code cell and run the following code to configure the notebook for your service. Set the `resource_name`, `deployment_name`, `location`, and `key` variables to the corresponding values for your Azure OpenAI resource.
47115

48116
```python
49117
import os
50118

51119
# Replace the following values with your Azure OpenAI resource information
52-
resource_name = "RESOURCE_NAME" # The name of your Azure OpenAI resource.
53-
deployment_name = "DEPLOYMENT_NAME" # The name of your Azure OpenAI deployment.
54-
location = "RESOURCE_LOCATION" # The location or region ID for your resource.
55-
key = "RESOURCE_API_KEY" # The key for your resource.
120+
resource_name = "<RESOURCE_NAME>" # The name of your Azure OpenAI resource.
121+
deployment_name = "<DEPLOYMENT_NAME>" # The name of your Azure OpenAI deployment.
122+
location = "<RESOURCE_LOCATION>" # The location or region ID for your resource.
123+
key = "<RESOURCE_API_KEY>" # The key for your resource.
56124

57125
assert key is not None and resource_name is not None
58126
```
59127

128+
Now you're ready to start running the example code.
129+
130+
> [!IMPORTANT]
131+
> Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like [Azure Key Vault](../../../key-vault/general/overview.md). For more information, see [Azure AI services security](../../security-features.md).
132+
133+
60134
## Create a dataset of prompts
61135

62-
Next, create a dataframe consisting of a series of rows, with one prompt per row.
136+
The first step is to create a dataframe consisting of a series of rows, with one prompt per row.
63137

64-
You can also load data directly from Azure Data Lake Storage (ADLS) or other databases. For more information about loading and preparing Spark dataframes, see the [Apache Spark data loading guide](https://spark.apache.org/docs/latest/sql-data-sources.html).
138+
You can also load data directly from Azure Data Lake Storage or other databases. For more information about loading and preparing Spark dataframes, see the [Apache Spark Data Sources](https://spark.apache.org/docs/latest/sql-data-sources.html).
65139

66140
```python
67141
df = spark.createDataFrame(
@@ -75,7 +149,9 @@ df = spark.createDataFrame(
75149

76150
## Create the OpenAICompletion Apache Spark client
77151

78-
To apply the OpenAI Completion service to the dataframe that you just created, create an `OpenAICompletion` object that serves as a distributed client. Parameters of the service can be set either with a single value, or by a column of the dataframe with the appropriate setters on the `OpenAICompletion` object. Here, we're setting `maxTokens` to 200. A token is around four characters, and this limit applies to the sum of the prompt and the result. We're also setting the `promptCol` parameter with the name of the prompt column in the dataframe.
152+
To apply Azure OpenAI Completion generation to the dataframe, create an `OpenAICompletion` object that serves as a distributed client. Parameters can be set either with a single value, or by a column of the dataframe with the appropriate setters on the `OpenAICompletion` object.
153+
154+
In this example, you set the `maxTokens` parameter to 200. A token is around four characters, and this limit applies to the sum of the prompt and the result. You also set the `promptCol` parameter with the name of the prompt column in the dataframe, such as **prompt**.
79155

80156
```python
81157
from synapse.ml.cognitive import OpenAICompletion
@@ -94,7 +170,7 @@ completion = (
94170

95171
## Transform the dataframe with the OpenAICompletion client
96172

97-
Now that you have the dataframe and the completion client, you can transform your input dataset and add a column called `completions` with all of the information the service adds. We'll select out just the text for simplicity.
173+
After you have the dataframe and completion client, you can transform your input dataset and add a column called `completions` with all of the text generated from the Azure OpenAI completion API. In this example, select only the text for simplicity.
98174

99175
```python
100176
from pyspark.sql.functions import col
@@ -104,23 +180,22 @@ display(completed_df.select(
104180
col("prompt"), col("error"), col("completions.choices.text").getItem(0).alias("text")))
105181
```
106182

107-
Your output should look something like the following example; note that the completion text can vary.
183+
The following image shows example output with completions in Azure Synapse Analytics Studio. Keep in mind that completions text can vary. Your output might look different.
108184

109-
| **prompt** | **error** | **text** |
110-
|------------|-----------| ---------|
111-
| Hello my name is | undefined | Makaveli I'm eighteen years old and I want to<br>be a rapper when I grow up I love writing and making music I'm from Los<br>Angeles, CA |
112-
| The best code is code that's | undefined | understandable This is a subjective statement,<br>and there is no definitive answer. |
113-
| SynapseML is | undefined | A machine learning algorithm that is able to learn how to predict the future outcome of events. |
185+
:::image type="content" source="../media/how-to/synapse-studio-transform-dataframe-output.png" alt-text="Screenshot that shows sample completions in Azure Synapse Analytics Studio." border="false":::
114186

115-
## Other usage examples
187+
## Explore other usage scenarios
188+
189+
Here are some other use cases for working with Azure OpenAI Service and large datasets.
116190

117191
### Improve throughput with request batching
118192

119-
The example above makes several requests to the service, one for each prompt. To complete multiple prompts in a single request, use batch mode. First, in the `OpenAICompletion` object, instead of setting the Prompt column to "Prompt", specify "batchPrompt" for the BatchPrompt column.
120-
To do so, create a dataframe with a list of prompts per row.
193+
You can use Azure OpenAI Service with large datasets to improve throughput with request batching. In the previous example, you make several requests to the service, one for each prompt. To complete multiple prompts in a single request, you can use batch mode.
194+
195+
In the `OpenAICompletion` object definition, you specify the `"batchPrompt"` value to configure the dataframe to use a **batchPrompt** column. Create the dataframe with a list of prompts for each row.
121196

122197
> [!NOTE]
123-
> There is currently a limit of 20 prompts in a single request and a limit of 2048 "tokens", or approximately 1500 words.
198+
> There's currently a limit of 20 prompts in a single request and a limit of 2048 tokens, or approximately 1500 words.
124199
125200
```python
126201
batch_df = spark.createDataFrame(
@@ -131,7 +206,7 @@ batch_df = spark.createDataFrame(
131206
).toDF("batchPrompt")
132207
```
133208

134-
Next we create the `OpenAICompletion` object. Rather than setting the prompt column, set the batchPrompt column if your column is of type `Array[String]`.
209+
Next, create the `OpenAICompletion` object. If your column is of type `Array[String]`, set the `batchPromptCol` value for the column heading, rather than the `promptCol` value.
135210

136211
```python
137212
batch_completion = (
@@ -146,27 +221,27 @@ batch_completion = (
146221
)
147222
```
148223

149-
In the call to transform, a request will then be made per row. Because there are multiple prompts in a single row, each request will be sent with all prompts in that row. The results will contain a row for each row in the request.
224+
In the call to `transform`, one request is made per row. Because there are multiple prompts in a single row, each request is sent with all prompts in that row. The results contain a row for each row in the request.
150225

151226
```python
152227
completed_batch_df = batch_completion.transform(batch_df).cache()
153228
display(completed_batch_df)
154229
```
155230

156231
> [!NOTE]
157-
> There is currently a limit of 20 prompts in a single request and a limit of 2048 "tokens", or approximately 1500 words.
232+
> There's currently a limit of 20 prompts in a single request and a limit of 2048 tokens, or approximately 1500 words.
158233
159-
### Using an automatic mini-batcher
234+
### Use an automatic mini-batcher
160235

161-
If your data is in column format, you can transpose it to row format using SynapseML's `FixedMiniBatcherTransformer`.
236+
You can use Azure OpenAI Service with large datasets to transpose the data format. If your data is in column format, you can transpose it to row format by using the SynapseML `FixedMiniBatcherTransformer` object.
162237

163238
```python
164239
from pyspark.sql.types import StringType
165240
from synapse.ml.stages import FixedMiniBatchTransformer
166241
from synapse.ml.core.spark import FluentAPI
167242

168243
completed_autobatch_df = (df
169-
.coalesce(1) # Force a single partition so that our little 4-row dataframe makes a batch of size 4, you can remove this step for large datasets
244+
.coalesce(1) # Force a single partition so your little 4-row dataframe makes a batch of size 4 - you can remove this step for large datasets.
170245
.mlTransform(FixedMiniBatchTransformer(batchSize=4))
171246
.withColumnRenamed("prompt", "batchPrompt")
172247
.mlTransform(batch_completion))
@@ -176,7 +251,7 @@ display(completed_autobatch_df)
176251

177252
### Prompt engineering for translation
178253

179-
Azure OpenAI can solve many different natural language tasks through [prompt engineering](completions.md). Here, we show an example of prompting for language translation:
254+
Azure OpenAI can solve many different natural language tasks through _prompt engineering_. For more information, see [Learn how to generate or manipulate text](completions.md). In this example, you can prompt for language translation:
180255

181256
```python
182257
translate_df = spark.createDataFrame(
@@ -191,7 +266,7 @@ display(completion.transform(translate_df))
191266

192267
### Prompt for question answering
193268

194-
Here, we prompt the GPT-3 model for general-knowledge question answering:
269+
Azure OpenAI also supports prompting the `Text-Davinci-003` model for general-knowledge question answering:
195270

196271
```python
197272
qa_df = spark.createDataFrame(
@@ -204,3 +279,8 @@ qa_df = spark.createDataFrame(
204279

205280
display(completion.transform(qa_df))
206281
```
282+
283+
## Next steps
284+
285+
- Learn how to work with the [GPT-35 Turbo and GPT-4 models](/azure/ai-services/openai/how-to/chatgpt?pivots=programming-language-chat-completions).
286+
- Learn more about the [Azure OpenAI Service models](../concepts/models.md).
44.1 KB
Loading

0 commit comments

Comments
 (0)