You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-deploy-models-from-huggingface.md
+27-16Lines changed: 27 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ ms.topic: how-to
10
10
ms.reviewer: None
11
11
author: s-polly
12
12
ms.author: scottpolly
13
-
ms.date: 12/11/2024
13
+
ms.date: 07/17/2025
14
14
ms.collection: ce-skilling-ai-copilot
15
15
---
16
16
@@ -24,7 +24,7 @@ Microsoft has partnered with Hugging Face to bring open-source models from Huggi
24
24
25
25
## Benefits of using online endpoints for real-time inference
26
26
27
-
Managed online endpoints in Azure Machine Learning help you deploy models to powerful CPU and GPU machines in Azure in a turnkey manner. Managed online endpoints take care of serving, scaling, securing, and monitoring your models, freeing you from the overhead of setting up and managing the underlying infrastructure. The virtual machines are provisioned on your behalf when you deploy models. You can have multiple deployments behind and [split traffic or mirror traffic](./how-to-safely-rollout-online-endpoints.md) to those deployments. Mirror traffic helps you to test new versions of models on production traffic without releasing them production environments. Splitting traffic lets you gradually increase production traffic to new model versions while observing performance. [Auto scale](./how-to-autoscale-endpoints.md) lets you dynamically ramp up or ramp down resources based on workloads. You can configure scaling based on utilization metrics, a specific schedule or a combination of both. An example of scaling based on utilization metrics is to add nodes if CPU utilization goes higher than 70%. An example of schedule-based scaling is to add nodes based on peak business hours.
27
+
Managed online endpoints in Azure Machine Learning help you deploy models to powerful CPU and GPU machines in Azure in a turnkey manner. Managed online endpoints take care of serving, scaling, securing, and monitoring your models, freeing you from the overhead of setting up and managing the underlying infrastructure. The virtual machines are provisioned on your behalf when you deploy models. You can have multiple deployments and [split traffic or mirror traffic](./how-to-safely-rollout-online-endpoints.md) to those deployments. Mirror traffic helps you to test new versions of models on production traffic without releasing them production environments. Splitting traffic lets you gradually increase production traffic to new model versions while observing performance. [Auto scale](./how-to-autoscale-endpoints.md) lets you dynamically ramp up or ramp down resources based on workloads. You can configure scaling based on utilization metrics, a specific schedule, or a combination of both. An example of scaling based on utilization metrics is to add nodes if CPU utilization goes higher than 70%. An example of schedule-based scaling is to add nodes based on peak business hours.
28
28
29
29
## Deploy HuggingFace hub models using Studio
30
30
@@ -37,13 +37,13 @@ Choose the real-time deployment option to open the quick deploy dialog. Specify
37
37
* Select the instance type. This list of instances is filtered down to the ones that the model is expected to deploy without running out of memory.
38
38
* Select the number of instances. One instance is sufficient for testing but we recommend considering two or more instances for production.
39
39
* Optionally specify an endpoint and deployment name.
40
-
* Select deploy. You're then navigated to the endpoint page which, might take a few seconds. The deployment takes several minutes to complete based on the model size and instance type.
40
+
* Select deploy. You're then navigated to the endpoint page, which might take a few seconds. The deployment takes several minutes to complete based on the model size and instance type.
41
41
42
42
Note: If you want to deploy to en existing endpoint, select `More options` from the quick deploy dialog and use the full deployment wizard.
43
43
44
44
### Test the deployed model
45
45
46
-
Once the deployment completes, you can find the REST endpoint for the model in the endpoints page, which can be used to score the model. You find options to add more deployments, manage traffic and scaling the Endpoints hub. You also use the Test tab on the endpoint page to test the model with sample inputs. Sample inputs are available on the model page. You can find input format, parameters and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
46
+
Once the deployment completes, you can find the REST endpoint for the model in the endpoints page, which can be used to score the model. You find options to add more deployments, manage traffic, and scaling the Endpoints hub. You also use the Test tab on the endpoint page to test the model with sample inputs. Sample inputs are available on the model page. You can find input format, parameters, and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
47
47
48
48
## Deploy HuggingFace hub models using Python SDK
49
49
@@ -62,9 +62,19 @@ from azure.ai.ml.entities import (
Create a file with inputs that can be submitted to the online endpoint for scoring. The code sample in this section allows an input for the `fill-mask` type since we deployed the `bert-base-uncased` model. You can find input format, parameters and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
101
+
Create a file with inputs that can be submitted to the online endpoint for scoring. The code sample in this section allows an input for the `fill-mask` type since we deployed the `bert-base-uncased` model. You can find input format, parameters, and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
92
102
93
103
```python
94
104
import json
@@ -115,20 +125,21 @@ Browse the model catalog in Azure Machine Learning studio and find the model you
115
125
116
126
You need the `model` and `instance_type` to deploy the model. You can find the optimal CPU or GPU `instance_type` for a model by opening the quick deployment dialog from the model page in the model catalog. Make sure you use an `instance_type` for which you have quota.
117
127
118
-
The models shown in the catalog are listed from the `HuggingFace` registry. You deploy the `bert_base_uncased` model with the latest version in this example. The fully qualified `model` asset id based on the model name and registry is `azureml://registries/HuggingFace/models/bert-base-uncased/labels/latest`. We create the `deploy.yml` file used for the `az ml online-deployment create` command inline.
128
+
The models shown in the catalog are listed from the `HuggingFace` registry. You deploy the `bert_base_uncased` model with the latest version in this example. The fully qualified `model` asset ID based on the model name and registry is `azureml://registries/HuggingFace/models/bert-base-uncased/labels/latest`. We create the `deploy.yml` file used for the `az ml online-deployment create` command inline.
119
129
120
130
Create an online endpoint. Next, create the deployment.
121
131
122
132
```shell
123
133
# create endpoint
124
134
endpoint_name="hf-ep-"$(date +%s)
125
135
model_name="bert-base-uncased"
136
+
model_version="25"
126
137
az ml online-endpoint create --name $endpoint_name
@@ -139,7 +150,7 @@ az ml online-deployment create --file ./deploy.yml --workspace-name $workspace_n
139
150
140
151
### Test the deployed model
141
152
142
-
Create a file with inputs that can be submitted to the online endpoint for scoring. Hugging Face as a code sample input for the `fill-mask` type for our deployed model the `bert-base-uncased` model. You can find input format, parameters and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
153
+
Create a file with inputs that can be submitted to the online endpoint for scoring. Hugging Face as a code sample input for the `fill-mask` type for our deployed model the `bert-base-uncased` model. You can find input format, parameters, and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
143
154
144
155
```shell
145
156
scoring_file="./sample_score.json"
@@ -163,16 +174,16 @@ Follow this link to find [hugging face model example code](https://github.com/Az
163
174
HuggingFace hub has thousands of models with hundreds being updated each day. Only the most popular models in the collection are tested and others may fail with one of the below errors.
164
175
165
176
### Gated models
166
-
[Gated models](https://huggingface.co/docs/hub/models-gated) require users to agree to share their contact information and accept the model owners' terms and conditions in order to access the model. Attempting to deploy such models will fail with a `KeyError`.
177
+
[Gated models](https://huggingface.co/docs/hub/models-gated) require users to agree to share their contact information and accept the model owners' terms and conditions in order to access the model. Attempting to deploy such models fails with a `KeyError`.
167
178
168
179
### Models that need to run remote code
169
-
Models typically use code from the transformers SDK but some models run code from the model repo. Such models need to set the parameter `trust_remote_code` to `True`. Follow this link to learn more about using [remote code](https://huggingface.co/docs/transformers/custom_models#using-a-model-with-custom-code). Such models are not supported from keeping security in mind. Attempting to deploy such models will fail with the following error: `ValueError: Loading <model> requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.`
180
+
Models typically use code from the transformers SDK but some models run code from the model repo. Such models need to set the parameter `trust_remote_code` to `True`. Follow this link to learn more about using [remote code](https://huggingface.co/docs/transformers/custom_models#using-a-model-with-custom-code). Such models aren't supported from keeping security in mind. Attempting to deploy such models fails with the following error: `ValueError: Loading <model> requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.`
170
181
171
182
### Models with incorrect tokenizers
172
183
Incorrectly specified or missing tokenizer in the model package can result in `OSError: Can't load tokenizer for <model>` error.
173
184
174
185
### Missing libraries
175
-
Some models need additional python libraries. You can install missing libraries when running models locally. Models that need special libraries beyond the standard transformers libraries will fail with `ModuleNotFoundError` or `ImportError` error.
186
+
Some models need additional python libraries. You can install missing libraries when running models locally. Models that need special libraries beyond the standard transformers libraries fails with `ModuleNotFoundError` or `ImportError` error.
176
187
177
188
### Insufficient memory
178
189
If you see the `OutOfQuota: Container terminated due to insufficient memory`, try using a `instance_type` with more memory.
@@ -181,16 +192,16 @@ If you see the `OutOfQuota: Container terminated due to insufficient memory`, tr
181
192
182
193
**Where are the model weights stored?**
183
194
184
-
Hugging Face models are featured in the Azure Machine Learning model catalog through the `HuggingFace` registry. Hugging Face creates and manages this registry and is made available to Azure Machine Learning as a Community Registry. The model weights aren't hosted on Azure. The weights are downloaded directly from Hugging Face hub to the online endpoints in your workspace when these models deploy. `HuggingFace` registry in AzureML works as a catalog to help discover and deploy HuggingFace hub models in Azure Machine Learning.
195
+
Hugging Face models are featured in the Azure Machine Learning model catalog through the `HuggingFace` registry. Hugging Face creates and manages this registry and is made available to Azure Machine Learning as a Community Registry. The model weights aren't hosted on Azure. The weights are downloaded directly from Hugging Face hub to the online endpoints in your workspace when these models deploy. `HuggingFace` registry in Azure Machine Learning works as a catalog to help discover and deploy HuggingFace hub models in Azure Machine Learning.
185
196
186
197
**How to deploy the models for batch inference?**
187
198
Deploying these models to batch endpoints for batch inference is currently not supported.
188
199
189
-
**Can I use models from the `HuggingFace` registry as input to jobs so that I can finetune these models using transformers SDK?**
190
-
Since the model weights aren't stored in the `HuggingFace` registry, you cannot access model weights by using these models as inputs to jobs.
200
+
**Can I use models from the `HuggingFace` registry as input to jobs so that I can fine-tune these models using transformers SDK?**
201
+
Since the model weights aren't stored in the `HuggingFace` registry, you can't access model weights by using these models as inputs to jobs.
191
202
192
203
**How do I get support if my deployments fail or inference doesn't work as expected?**
193
-
`HuggingFace` is a community registry and that is not covered by Microsoft support. Review the deployment logs and find out if the issue is related to Azure Machine Learning platform or specific to HuggingFace transformers. Contact Microsoft support for any platform issues. Example, not being able to create online endpoint or authentication to endpoint REST API doesn't work. For transformers specific issues, use the [HuggingFace forum](https://discuss.huggingface.co/) or [HuggingFace support](https://huggingface.co/support).
204
+
`HuggingFace` is a community registry and that isn't covered by Microsoft support. Review the deployment logs and find out if the issue is related to Azure Machine Learning platform or specific to HuggingFace transformers. Contact Microsoft support for any platform issues. Example, not being able to create online endpoint or authentication to endpoint REST API doesn't work. For transformers specific issues, use the [HuggingFace forum](https://discuss.huggingface.co/) or [HuggingFace support](https://huggingface.co/support).
194
205
195
206
**What is a community registry?**
196
207
Community registries are Azure Machine Learning registries created by trusted Azure Machine Learning partners and available to all Azure Machine Learning users.
Copy file name to clipboardExpand all lines: articles/machine-learning/prompt-flow/get-started-prompt-flow.md
+7-4Lines changed: 7 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,11 +5,11 @@ description: Learn how to set up, create, evaluate, and deploy a prompt flow in
5
5
services: machine-learning
6
6
ms.service: azure-machine-learning
7
7
ms.subservice: prompt-flow
8
-
ms.topic: tutorial
8
+
ms.topic: how-to
9
9
author: s-polly
10
10
ms.author: scottpolly
11
-
ms.reviewer: yijunzhang
12
-
ms.date: 10/03/2024
11
+
ms.reviewer: sooryar
12
+
ms.date: 07/17/2025
13
13
ms.custom:
14
14
- ignite-2023
15
15
- build-2024
@@ -32,6 +32,9 @@ This article walks you through the main user journey of using prompt flow in Azu
32
32
33
33
A connection helps securely store and manage secret keys or other sensitive credentials required for interacting with Large Language Models (LLM) and other external tools such as Azure Content Safety. Connection resources are shared with all members in the workspace.
34
34
35
+
> [!NOTE]
36
+
> The LLM tool in prompt flow does not support reasoning models (such as OpenAI o1 or o3). For reasoning model integration, use the Python tool to call the model APIs directly. For more information, see [Call a reasoning model from the Python tool](tools-reference/python-tool.md#call-a-reasoning-model-from-the-python-tool)..
37
+
35
38
1. To check if you already have an Azure OpenAI connection, select **Prompt flow** from the Azure Machine Learning studio left menu and then select the **Connections** tab on the **Prompt flow** screen.
36
39
37
40
:::image type="content" source="./media/get-started-prompt-flow/connection-creation-entry-point.png" alt-text="Screenshot of the connections tab with create highlighted." lightbox = "./media/get-started-prompt-flow/connection-creation-entry-point.png":::
@@ -64,7 +67,7 @@ In the **Flows** tab of the **Prompt flow** home page, select **Create** to crea
64
67
65
68
In the **Explore gallery**, you can browse the built-in samples and select **View detail** on any tile to preview whether it's suitable for your scenario.
66
69
67
-
This tutorial uses the **Web Classification** sample to walk through the main user journey. Web Classification is a flow demonstrating multiclass classification with a LLM. Given a URL, the flow classifies the URL into a web category with just a few shots, simple summarization, and classification prompts. For example, given a URL `https://www.imdb.com`, it classifies the URL into `Movie`.
70
+
This tutorial uses the **Web Classification** sample to walk through the main user journey. Web Classification is a flow demonstrating multiclass classification with an LLM. Given a URL, the flow classifies the URL into a web category with just a few shots, simple summarization, and classification prompts. For example, given a URL `https://www.imdb.com`, it classifies the URL into `Movie`.
68
71
69
72
To clone the sample, select **Clone** on the **Web Classification** tile.
0 commit comments