Skip to content

Commit cb44768

Browse files
authored
Merge pull request #6111 from s-polly/stp_ml_freshness-7-16
PF and HuggingFace
2 parents b8383ac + ccff83a commit cb44768

File tree

5 files changed

+175
-113
lines changed

5 files changed

+175
-113
lines changed

articles/machine-learning/how-to-deploy-models-from-huggingface.md

Lines changed: 27 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ ms.topic: how-to
1010
ms.reviewer: None
1111
author: s-polly
1212
ms.author: scottpolly
13-
ms.date: 12/11/2024
13+
ms.date: 07/17/2025
1414
ms.collection: ce-skilling-ai-copilot
1515
---
1616

@@ -24,7 +24,7 @@ Microsoft has partnered with Hugging Face to bring open-source models from Huggi
2424
2525
## Benefits of using online endpoints for real-time inference
2626

27-
Managed online endpoints in Azure Machine Learning help you deploy models to powerful CPU and GPU machines in Azure in a turnkey manner. Managed online endpoints take care of serving, scaling, securing, and monitoring your models, freeing you from the overhead of setting up and managing the underlying infrastructure. The virtual machines are provisioned on your behalf when you deploy models. You can have multiple deployments behind and [split traffic or mirror traffic](./how-to-safely-rollout-online-endpoints.md) to those deployments. Mirror traffic helps you to test new versions of models on production traffic without releasing them production environments. Splitting traffic lets you gradually increase production traffic to new model versions while observing performance. [Auto scale](./how-to-autoscale-endpoints.md) lets you dynamically ramp up or ramp down resources based on workloads. You can configure scaling based on utilization metrics, a specific schedule or a combination of both. An example of scaling based on utilization metrics is to add nodes if CPU utilization goes higher than 70%. An example of schedule-based scaling is to add nodes based on peak business hours.
27+
Managed online endpoints in Azure Machine Learning help you deploy models to powerful CPU and GPU machines in Azure in a turnkey manner. Managed online endpoints take care of serving, scaling, securing, and monitoring your models, freeing you from the overhead of setting up and managing the underlying infrastructure. The virtual machines are provisioned on your behalf when you deploy models. You can have multiple deployments and [split traffic or mirror traffic](./how-to-safely-rollout-online-endpoints.md) to those deployments. Mirror traffic helps you to test new versions of models on production traffic without releasing them production environments. Splitting traffic lets you gradually increase production traffic to new model versions while observing performance. [Auto scale](./how-to-autoscale-endpoints.md) lets you dynamically ramp up or ramp down resources based on workloads. You can configure scaling based on utilization metrics, a specific schedule, or a combination of both. An example of scaling based on utilization metrics is to add nodes if CPU utilization goes higher than 70%. An example of schedule-based scaling is to add nodes based on peak business hours.
2828

2929
## Deploy HuggingFace hub models using Studio
3030

@@ -37,13 +37,13 @@ Choose the real-time deployment option to open the quick deploy dialog. Specify
3737
* Select the instance type. This list of instances is filtered down to the ones that the model is expected to deploy without running out of memory.
3838
* Select the number of instances. One instance is sufficient for testing but we recommend considering two or more instances for production.
3939
* Optionally specify an endpoint and deployment name.
40-
* Select deploy. You're then navigated to the endpoint page which, might take a few seconds. The deployment takes several minutes to complete based on the model size and instance type.
40+
* Select deploy. You're then navigated to the endpoint page, which might take a few seconds. The deployment takes several minutes to complete based on the model size and instance type.
4141

4242
Note: If you want to deploy to en existing endpoint, select `More options` from the quick deploy dialog and use the full deployment wizard.
4343

4444
### Test the deployed model
4545

46-
Once the deployment completes, you can find the REST endpoint for the model in the endpoints page, which can be used to score the model. You find options to add more deployments, manage traffic and scaling the Endpoints hub. You also use the Test tab on the endpoint page to test the model with sample inputs. Sample inputs are available on the model page. You can find input format, parameters and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
46+
Once the deployment completes, you can find the REST endpoint for the model in the endpoints page, which can be used to score the model. You find options to add more deployments, manage traffic, and scaling the Endpoints hub. You also use the Test tab on the endpoint page to test the model with sample inputs. Sample inputs are available on the model page. You can find input format, parameters, and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
4747

4848
## Deploy HuggingFace hub models using Python SDK
4949

@@ -62,9 +62,19 @@ from azure.ai.ml.entities import (
6262
Environment,
6363
CodeConfiguration,
6464
)
65+
66+
ml_client = MLClient(
67+
credential=DefaultAzureCredential(),
68+
subscription_id="<your-subscription-id>",
69+
resource_group_name="<your-resource-group>",
70+
workspace_name="<your-workspace-name>"
71+
)
72+
73+
6574
registry_name = "HuggingFace"
6675
model_name = "bert_base_uncased"
67-
model_id = f"azureml://registries/{registry_name}/models/{model_name}/labels/latest"
76+
model_version = "25"
77+
model_id = f"azureml://registries/{registry_name}/models/{model_name}/versions/{model_version}"
6878
```
6979
### Deploy the model
7080

@@ -88,7 +98,7 @@ ml_client.begin_create_or_update(endpoint).result()
8898

8999
### Test the deployed model
90100

91-
Create a file with inputs that can be submitted to the online endpoint for scoring. The code sample in this section allows an input for the `fill-mask` type since we deployed the `bert-base-uncased` model. You can find input format, parameters and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
101+
Create a file with inputs that can be submitted to the online endpoint for scoring. The code sample in this section allows an input for the `fill-mask` type since we deployed the `bert-base-uncased` model. You can find input format, parameters, and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
92102

93103
```python
94104
import json
@@ -115,20 +125,21 @@ Browse the model catalog in Azure Machine Learning studio and find the model you
115125

116126
You need the `model` and `instance_type` to deploy the model. You can find the optimal CPU or GPU `instance_type` for a model by opening the quick deployment dialog from the model page in the model catalog. Make sure you use an `instance_type` for which you have quota.
117127

118-
The models shown in the catalog are listed from the `HuggingFace` registry. You deploy the `bert_base_uncased` model with the latest version in this example. The fully qualified `model` asset id based on the model name and registry is `azureml://registries/HuggingFace/models/bert-base-uncased/labels/latest`. We create the `deploy.yml` file used for the `az ml online-deployment create` command inline.
128+
The models shown in the catalog are listed from the `HuggingFace` registry. You deploy the `bert_base_uncased` model with the latest version in this example. The fully qualified `model` asset ID based on the model name and registry is `azureml://registries/HuggingFace/models/bert-base-uncased/labels/latest`. We create the `deploy.yml` file used for the `az ml online-deployment create` command inline.
119129

120130
Create an online endpoint. Next, create the deployment.
121131

122132
```shell
123133
# create endpoint
124134
endpoint_name="hf-ep-"$(date +%s)
125135
model_name="bert-base-uncased"
136+
model_version="25"
126137
az ml online-endpoint create --name $endpoint_name
127138

128139
# create deployment file.
129140
cat <<EOF > ./deploy.yml
130141
name: demo
131-
model: azureml://registries/HuggingFace/models/$model_name/labels/latest
142+
model: azureml://registries/HuggingFace/models/$model_name/versions/$model_version
132143
endpoint_name: $endpoint_name
133144
instance_type: Standard_DS3_v2
134145
instance_count: 1
@@ -139,7 +150,7 @@ az ml online-deployment create --file ./deploy.yml --workspace-name $workspace_n
139150

140151
### Test the deployed model
141152

142-
Create a file with inputs that can be submitted to the online endpoint for scoring. Hugging Face as a code sample input for the `fill-mask` type for our deployed model the `bert-base-uncased` model. You can find input format, parameters and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
153+
Create a file with inputs that can be submitted to the online endpoint for scoring. Hugging Face as a code sample input for the `fill-mask` type for our deployed model the `bert-base-uncased` model. You can find input format, parameters, and sample inputs on the [Hugging Face hub inference API documentation](https://huggingface.co/docs/api-inference/detailed_parameters).
143154

144155
```shell
145156
scoring_file="./sample_score.json"
@@ -163,16 +174,16 @@ Follow this link to find [hugging face model example code](https://github.com/Az
163174
HuggingFace hub has thousands of models with hundreds being updated each day. Only the most popular models in the collection are tested and others may fail with one of the below errors.
164175

165176
### Gated models
166-
[Gated models](https://huggingface.co/docs/hub/models-gated) require users to agree to share their contact information and accept the model owners' terms and conditions in order to access the model. Attempting to deploy such models will fail with a `KeyError`.
177+
[Gated models](https://huggingface.co/docs/hub/models-gated) require users to agree to share their contact information and accept the model owners' terms and conditions in order to access the model. Attempting to deploy such models fails with a `KeyError`.
167178

168179
### Models that need to run remote code
169-
Models typically use code from the transformers SDK but some models run code from the model repo. Such models need to set the parameter `trust_remote_code` to `True`. Follow this link to learn more about using [remote code](https://huggingface.co/docs/transformers/custom_models#using-a-model-with-custom-code). Such models are not supported from keeping security in mind. Attempting to deploy such models will fail with the following error: `ValueError: Loading <model> requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.`
180+
Models typically use code from the transformers SDK but some models run code from the model repo. Such models need to set the parameter `trust_remote_code` to `True`. Follow this link to learn more about using [remote code](https://huggingface.co/docs/transformers/custom_models#using-a-model-with-custom-code). Such models aren't supported from keeping security in mind. Attempting to deploy such models fails with the following error: `ValueError: Loading <model> requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error.`
170181

171182
### Models with incorrect tokenizers
172183
Incorrectly specified or missing tokenizer in the model package can result in `OSError: Can't load tokenizer for <model>` error.
173184

174185
### Missing libraries
175-
Some models need additional python libraries. You can install missing libraries when running models locally. Models that need special libraries beyond the standard transformers libraries will fail with `ModuleNotFoundError` or `ImportError` error.
186+
Some models need additional python libraries. You can install missing libraries when running models locally. Models that need special libraries beyond the standard transformers libraries fails with `ModuleNotFoundError` or `ImportError` error.
176187

177188
### Insufficient memory
178189
If you see the `OutOfQuota: Container terminated due to insufficient memory`, try using a `instance_type` with more memory.
@@ -181,16 +192,16 @@ If you see the `OutOfQuota: Container terminated due to insufficient memory`, tr
181192

182193
**Where are the model weights stored?**
183194

184-
Hugging Face models are featured in the Azure Machine Learning model catalog through the `HuggingFace` registry. Hugging Face creates and manages this registry and is made available to Azure Machine Learning as a Community Registry. The model weights aren't hosted on Azure. The weights are downloaded directly from Hugging Face hub to the online endpoints in your workspace when these models deploy. `HuggingFace` registry in AzureML works as a catalog to help discover and deploy HuggingFace hub models in Azure Machine Learning.
195+
Hugging Face models are featured in the Azure Machine Learning model catalog through the `HuggingFace` registry. Hugging Face creates and manages this registry and is made available to Azure Machine Learning as a Community Registry. The model weights aren't hosted on Azure. The weights are downloaded directly from Hugging Face hub to the online endpoints in your workspace when these models deploy. `HuggingFace` registry in Azure Machine Learning works as a catalog to help discover and deploy HuggingFace hub models in Azure Machine Learning.
185196

186197
**How to deploy the models for batch inference?**
187198
Deploying these models to batch endpoints for batch inference is currently not supported.
188199

189-
**Can I use models from the `HuggingFace` registry as input to jobs so that I can finetune these models using transformers SDK?**
190-
Since the model weights aren't stored in the `HuggingFace` registry, you cannot access model weights by using these models as inputs to jobs.
200+
**Can I use models from the `HuggingFace` registry as input to jobs so that I can fine-tune these models using transformers SDK?**
201+
Since the model weights aren't stored in the `HuggingFace` registry, you can't access model weights by using these models as inputs to jobs.
191202

192203
**How do I get support if my deployments fail or inference doesn't work as expected?**
193-
`HuggingFace` is a community registry and that is not covered by Microsoft support. Review the deployment logs and find out if the issue is related to Azure Machine Learning platform or specific to HuggingFace transformers. Contact Microsoft support for any platform issues. Example, not being able to create online endpoint or authentication to endpoint REST API doesn't work. For transformers specific issues, use the [HuggingFace forum](https://discuss.huggingface.co/) or [HuggingFace support](https://huggingface.co/support).
204+
`HuggingFace` is a community registry and that isn't covered by Microsoft support. Review the deployment logs and find out if the issue is related to Azure Machine Learning platform or specific to HuggingFace transformers. Contact Microsoft support for any platform issues. Example, not being able to create online endpoint or authentication to endpoint REST API doesn't work. For transformers specific issues, use the [HuggingFace forum](https://discuss.huggingface.co/) or [HuggingFace support](https://huggingface.co/support).
194205

195206
**What is a community registry?**
196207
Community registries are Azure Machine Learning registries created by trusted Azure Machine Learning partners and available to all Azure Machine Learning users.

articles/machine-learning/prompt-flow/get-started-prompt-flow.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@ description: Learn how to set up, create, evaluate, and deploy a prompt flow in
55
services: machine-learning
66
ms.service: azure-machine-learning
77
ms.subservice: prompt-flow
8-
ms.topic: tutorial
8+
ms.topic: how-to
99
author: s-polly
1010
ms.author: scottpolly
11-
ms.reviewer: yijunzhang
12-
ms.date: 10/03/2024
11+
ms.reviewer: sooryar
12+
ms.date: 07/17/2025
1313
ms.custom:
1414
- ignite-2023
1515
- build-2024
@@ -32,6 +32,9 @@ This article walks you through the main user journey of using prompt flow in Azu
3232

3333
A connection helps securely store and manage secret keys or other sensitive credentials required for interacting with Large Language Models (LLM) and other external tools such as Azure Content Safety. Connection resources are shared with all members in the workspace.
3434

35+
> [!NOTE]
36+
> The LLM tool in prompt flow does not support reasoning models (such as OpenAI o1 or o3). For reasoning model integration, use the Python tool to call the model APIs directly. For more information, see [Call a reasoning model from the Python tool](tools-reference/python-tool.md#call-a-reasoning-model-from-the-python-tool)..
37+
3538
1. To check if you already have an Azure OpenAI connection, select **Prompt flow** from the Azure Machine Learning studio left menu and then select the **Connections** tab on the **Prompt flow** screen.
3639

3740
:::image type="content" source="./media/get-started-prompt-flow/connection-creation-entry-point.png" alt-text="Screenshot of the connections tab with create highlighted." lightbox = "./media/get-started-prompt-flow/connection-creation-entry-point.png":::
@@ -64,7 +67,7 @@ In the **Flows** tab of the **Prompt flow** home page, select **Create** to crea
6467

6568
In the **Explore gallery**, you can browse the built-in samples and select **View detail** on any tile to preview whether it's suitable for your scenario.
6669

67-
This tutorial uses the **Web Classification** sample to walk through the main user journey. Web Classification is a flow demonstrating multiclass classification with a LLM. Given a URL, the flow classifies the URL into a web category with just a few shots, simple summarization, and classification prompts. For example, given a URL `https://www.imdb.com`, it classifies the URL into `Movie`.
70+
This tutorial uses the **Web Classification** sample to walk through the main user journey. Web Classification is a flow demonstrating multiclass classification with an LLM. Given a URL, the flow classifies the URL into a web category with just a few shots, simple summarization, and classification prompts. For example, given a URL `https://www.imdb.com`, it classifies the URL into `Movie`.
6871

6972
To clone the sample, select **Clone** on the **Web Classification** tile.
7073

0 commit comments

Comments
 (0)