You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Batch Endpoints can be used to deploy expensive models, like language models, over text data. In this tutorial you'll learn how to deploy a model that can perform text summarization of long sequences of text using a model from HuggingFace.
20
+
Batch Endpoints can be used to deploy expensive models, like language models, over text data. In this tutorial you'll learn how to deploy a model that can perform text summarization of long sequences of text using a model from HuggingFace. It also shows how to do inference optimization using HuggingFace `optimum` and `accelerate` libraries.
21
21
22
22
## About this sample
23
23
@@ -141,7 +141,7 @@ Let's create the deployment that will host the model:
141
141
142
142
> [!div class="checklist"]
143
143
> * Indicates an `init` function that detects the hardware configuration (CPU vs GPU) and loads the model accordingly. Both the model and the tokenizer are loaded in global variables. We are not using a `pipeline` object from HuggingFace to account for the limitation in the sequence lenghs of the model we are currently using.
144
-
> * Notice that we are doing performing model optimizations to improve the performance using `optimum` and accelerate libraries. If the model or hardware doesn't support it, we will run the deployment without such optimizations.
144
+
> * Notice that we are doing performing **model optimizations** to improve the performance using `optimum` and `accelerate` libraries. If the model or hardware doesn't support it, we will run the deployment without such optimizations.
145
145
> * Indicates a `run` function that is executed for each mini-batch the batch deployment provides.
146
146
> * The `run` function read the entire batch using the `datasets` library. The text we need to summarize is on the column `text`.
147
147
> * The `run` method iterates over each of the rows of the text and run the prediction. Since this is a very expensive model, running the prediction over entire files will result in an out-of-memory exception. Notice that the model is not execute with the `pipeline` object from `transformers`. This is done to account for long sequences of text and the limitation of 1024 tokens in the underlying model we are using.
0 commit comments