You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Batch Endpoints can be used to deploy expensive models, like language models, over text data. In this tutorial you'll learn how to deploy a model that can perform text summarization of long sequences of text using a model from HuggingFace.
20
+
Batch Endpoints can be used to deploy expensive models, like language models, over text data. In this tutorial, you learn how to deploy a model that can perform text summarization of long sequences of text using a model from HuggingFace. It also shows how to do inference optimization using HuggingFace `optimum` and `accelerate` libraries.
21
21
22
22
## About this sample
23
23
24
-
The model we are going to work with was built using the popular library transformers from HuggingFace along with [a pre-trained model from Facebook with the BART architecture](https://huggingface.co/facebook/bart-large-cnn). It was introduced in the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation](https://arxiv.org/abs/1910.13461). This model has the following constraints which are important to keep in mind for deployment:
24
+
The model we are going to work with was built using the popular library transformers from HuggingFace along with [a pre-trained model from Facebook with the BART architecture](https://huggingface.co/facebook/bart-large-cnn). It was introduced in the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation](https://arxiv.org/abs/1910.13461). This model has the following constraints, which are important to keep in mind for deployment:
25
25
26
26
* It can work with sequences up to 1024 tokens.
27
27
* It is trained for summarization of text in English.
@@ -84,7 +84,7 @@ model = ml_client.models.create_or_update(
84
84
85
85
We are going to create a batch endpoint named `text-summarization-batch` where to deploy the HuggingFace model to run text summarization on text files in English.
86
86
87
-
1. Decide on the name of the endpoint. The name of the endpoint will end-up in the URI associated with your endpoint. Because of that, __batch endpoint names need to be unique within an Azure region__. For example, there can be only one batch endpoint with the name `mybatchendpoint` in `westus2`.
87
+
1. Decide on the name of the endpoint. The name of the endpoint ends-up in the URI associated with your endpoint. Because of that, __batch endpoint names need to be unique within an Azure region__. For example, there can be only one batch endpoint with the name `mybatchendpoint` in `westus2`.
88
88
89
89
# [Azure CLI](#tab/cli)
90
90
@@ -135,13 +135,13 @@ We are going to create a batch endpoint named `text-summarization-batch` where t
135
135
136
136
## Creating the deployment
137
137
138
-
Let's create the deployment that will host the model:
138
+
Let's create the deployment that hosts the model:
139
139
140
-
1. We need to create a scoring script that can read the CSV files provided by the batch deployment and return the scores of the model with the summary. The following script does the following:
140
+
1. We need to create a scoring script that can read the CSV files provided by the batch deployment and return the scores of the model with the summary. The following script performs these actions:
141
141
142
142
> [!div class="checklist"]
143
143
> * Indicates an `init` function that detects the hardware configuration (CPU vs GPU) and loads the model accordingly. Both the model and the tokenizer are loaded in global variables. We are not using a `pipeline` object from HuggingFace to account for the limitation in the sequence lenghs of the model we are currently using.
144
-
> * Notice that we are doing performing model optimizations to improve the performance using `optimum` and accelerate libraries. If the model or hardware doesn't support it, we will run the deployment without such optimizations.
144
+
> * Notice that we are doing performing **model optimizations** to improve the performance using `optimum` and `accelerate` libraries. If the model or hardware doesn't support it, we will run the deployment without such optimizations.
145
145
> * Indicates a `run` function that is executed for each mini-batch the batch deployment provides.
146
146
> * The `run` function read the entire batch using the `datasets` library. The text we need to summarize is on the column `text`.
147
147
> * The `run` method iterates over each of the rows of the text and run the prediction. Since this is a very expensive model, running the prediction over entire files will result in an out-of-memory exception. Notice that the model is not execute with the `pipeline` object from `transformers`. This is done to account for long sequences of text and the limitation of 1024 tokens in the underlying model we are using.
@@ -164,7 +164,7 @@ Let's create the deployment that will host the model:
164
164
165
165
# [Azure CLI](#tab/cli)
166
166
167
-
The environment definition will be included in the deployment file.
167
+
The environment definition is included in the deployment file.
168
168
169
169
__deployment.yml__
170
170
@@ -186,7 +186,7 @@ Let's create the deployment that will host the model:
186
186
> [!IMPORTANT]
187
187
> The environment `torch200-transformers-gpu` we've created requires a CUDA 11.8 compatible hardware device to run Torch 2.0 and Ubuntu 20.04. If your GPU device doesn't support this version of CUDA, you can check the alternative `torch113-conda.yaml` conda environment (also available on the repository), which runs Torch 1.3 over Ubuntu 18.04 with CUDA 10.1. However, acceleration using the `optimum` and `accelerate` libraries won't be supported on this configuration.
188
188
189
-
1. Each deployment runs on compute clusters. They support both [Azure Machine Learning Compute clusters (AmlCompute)](./how-to-create-attach-compute-cluster.md) or [Kubernetes clusters](./how-to-attach-kubernetes-anywhere.md). In this example, our model can benefit from GPU acceleration, which is why we will use a GPU cluster.
189
+
1. Each deployment runs on compute clusters. They support both [Azure Machine Learning Compute clusters (AmlCompute)](./how-to-create-attach-compute-cluster.md) or [Kubernetes clusters](./how-to-attach-kubernetes-anywhere.md). In this example, our model can benefit from GPU acceleration, which is why we use a GPU cluster.
190
190
191
191
# [Azure CLI](#tab/cli)
192
192
@@ -208,7 +208,7 @@ Let's create the deployment that will host the model:
208
208
---
209
209
210
210
> [!NOTE]
211
-
> You are not charged for compute at this point as the cluster will remain at 0 nodes until a batch endpoint is invoked and a batch scoring job is submitted. Learn more about [manage and optimize cost for AmlCompute](./how-to-manage-optimize-cost.md#use-azure-machine-learning-compute-cluster-amlcompute).
211
+
> You are not charged for compute at this point as the cluster remains at 0 nodes until a batch endpoint is invoked and a batch scoring job is submitted. Learn more about [manage and optimize cost for AmlCompute](./how-to-manage-optimize-cost.md#use-azure-machine-learning-compute-cluster-amlcompute).
212
212
213
213
1. Now, let's create the deployment.
214
214
@@ -260,7 +260,7 @@ Let's create the deployment that will host the model:
260
260
> [!IMPORTANT]
261
261
> You will notice in this deployment a high value in `timeout` in the parameter `retry_settings`. The reason for it is due to the nature of the model we are running. This is a very expensive model and inference on a single row may take up to 60 seconds. The `timeout` parameters controls how much time the Batch Deployment should wait for the scoring script to finish processing each mini-batch. Since our model runs predictions row by row, processing a long file may take time. Also notice that the number of files per batch is set to 1 (`mini_batch_size=1`). This is again related to the nature of the work we are doing. Processing one file at a time per batch is expensive enough to justify it. You will notice this being a pattern in NLP processing.
262
262
263
-
1. Although you can invoke a specific deployment inside of an endpoint, you will usually want to invoke the endpoint itself and let the endpoint decide which deployment to use. Such deployment is named the "default" deployment. This gives you the possibility of changing the default deployment and hence changing the model serving the deployment without changing the contract with the user invoking the endpoint. Use the following instruction to update the default deployment:
263
+
1. Although you can invoke a specific deployment inside of an endpoint, you usually want to invoke the endpoint itself and let the endpoint decide which deployment to use. Such deployment is named the "default" deployment. This gives you the possibility of changing the default deployment and hence changing the model serving the deployment without changing the contract with the user invoking the endpoint. Use the following instruction to update the default deployment:
264
264
265
265
# [Azure CLI](#tab/cli)
266
266
@@ -304,7 +304,7 @@ For testing our endpoint, we are going to use a sample of the dataset [BillSum:
304
304
---
305
305
306
306
> [!TIP]
307
-
> Notice that by indicating a local path as an input, the data will be uploaded to Azure Machine Learning default's storage account.
307
+
> Notice that by indicating a local path as an input, the data is uploaded to Azure Machine Learning default's storage account.
308
308
309
309
4. A batch job is started as soon as the command returns. You can monitor the status of the job until it finishes:
310
310
@@ -341,14 +341,14 @@ As mentioned in some of the notes along this tutorial, processing text may have
341
341
> [!div class="checklist"]
342
342
> * Some NLP models may be very expensive in terms of memory and compute time. If this is the case, consider decreasing the number of files included on each mini-batch. In the example above, the number was taken to the minimum, 1 file per batch. While this may not be your case, take into consideration how many files your model can score at each time. Have in mind that the relationship between the size of the input and the memory footprint of your model may not be linear for deep learning models.
343
343
> * If your model can't even handle one file at a time (like in this example), consider reading the input data in rows/chunks. Implement batching at the row level if you need to achieve higher throughput or hardware utilization.
344
-
> * Set the `timeout` value of your deployment accordly to how expensive your model is and how much data you expect to process. Remember that the `timeout` indicates the time the batch deployment would wait for your scoring script to run for a given batch. If your batch have many files or files with many rows, this will impact the right value of this parameter.
344
+
> * Set the `timeout` value of your deployment accordly to how expensive your model is and how much data you expect to process. Remember that the `timeout` indicates the time the batch deployment would wait for your scoring script to run for a given batch. If your batch have many files or files with many rows, this impacts the right value of this parameter.
345
345
346
346
## Considerations for MLflow models that process text
347
347
348
348
The same considerations mentioned above apply to MLflow models. However, since you are not required to provide a scoring script for your MLflow model deployment, some of the recommendations mentioned may require a different approach.
349
349
350
350
* MLflow models in Batch Endpoints support reading tabular data as input data, which may contain long sequences of text. See [File's types support](how-to-mlflow-batch.md#files-types-support) for details about which file types are supported.
351
-
* Batch deployments will call your MLflow model's predict function with the content of an entire file in as Pandas dataframe. If your input data contains many rows, chances are that running a complex model (like the one presented in this tutorial) will result in an out-of-memory exception. If this is your case, you can consider:
351
+
* Batch deployments calls your MLflow model's predict function with the content of an entire file in as Pandas dataframe. If your input data contains many rows, chances are that running a complex model (like the one presented in this tutorial) results in an out-of-memory exception. If this is your case, you can consider:
352
352
* Customize how your model runs predictions and implement batching. To learn how to customize MLflow model's inference, see [Logging custom models](how-to-log-mlflow-models.md?#logging-custom-models).
353
353
* Author a scoring script and load your model using `mlflow.<flavor>.load_model()`. See [Using MLflow models with a scoring script](how-to-mlflow-batch.md#customizing-mlflow-models-deployments-with-a-scoring-script) for details.
0 commit comments