Skip to content

Commit 506a88a

Browse files
committed
Refreshes how-to-compile-hugging-face-models.md
1 parent d2da5a2 commit 506a88a

File tree

1 file changed

+28
-28
lines changed

1 file changed

+28
-28
lines changed

articles/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models.md

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -10,31 +10,32 @@ ms.author: jburchel
1010
ms.reviewer: samkemp
1111
author: jonburchel
1212
reviewer: samuel100
13-
ms.date: 07/03/2025
13+
ms.date: 10/01/2025
14+
ai-usage: ai-assisted
1415
---
1516

1617
# Compile Hugging Face models to run on Foundry Local
1718

1819
[!INCLUDE [foundry-local-preview](./../includes/foundry-local-preview.md)]
1920

20-
Foundry Local runs ONNX models on your device with high performance. While the model catalog offers _out-of-the-box_ precompiled options, you can use any model in the ONNX format.
21+
Foundry Local runs ONNX models on your device with high performance. Although the model catalog offers precompiled options out of the box, any model in the ONNX format works.
2122

22-
To compile existing models in Safetensor or PyTorch format into the ONNX format, you can use [Olive](https://microsoft.github.io/Olive). Olive is a tool that optimizes models to ONNX format, making them suitable for deployment in Foundry Local. It uses techniques like _quantization_ and _graph optimization_ to improve performance.
23+
Use [Olive](https://microsoft.github.io/Olive) to compile models in Safetensor or PyTorch format to ONNX. Olive optimizes models for ONNX, making them suitable for deployment in Foundry Local. It uses techniques like quantization and graph optimization to improve performance.
2324

24-
This guide shows you how to:
25+
This guide shows how to:
2526

2627
> [!div class="checklist"]
2728
>
28-
> - **Convert and optimize** models from Hugging Face to run in Foundry Local. You'll use the `Llama-3.2-1B-Instruct` model as an example, but you can use any generative AI model from Hugging Face.
29-
> - **Run** your optimized models with Foundry Local
29+
> - Convert and optimize models from Hugging Face to run in Foundry Local. The examples use the `Llama-3.2-1B-Instruct` model, but any generative AI model from Hugging Face works.
30+
> - Run your optimized models with Foundry Local.
3031
3132
## Prerequisites
3233

3334
- Python 3.10 or later
3435

3536
## Install Olive
3637

37-
[Olive](https://github.com/microsoft/olive) is a tool that optimizes models to ONNX format.
38+
[Olive](https://github.com/microsoft/olive) optimizes models and converts them to the ONNX format.
3839

3940
### [Bash](#tab/Bash)
4041

@@ -51,11 +52,11 @@ pip install olive-ai[auto-opt]
5152
---
5253

5354
> [!TIP]
54-
> For best results, install Olive in a virtual environment using [venv](https://docs.python.org/3/library/venv.html) or [conda](https://www.anaconda.com/docs/getting-started/miniconda/main).
55+
> Install Olive in a virtual environment with [venv](https://docs.python.org/3/library/venv.html) or [conda](https://www.anaconda.com/docs/getting-started/miniconda/main).
5556
5657
## Sign in to Hugging Face
5758

58-
You optimize the `Llama-3.2-1B-Instruct` model, which requires Hugging Face authentication:
59+
The `Llama-3.2-1B-Instruct` model requires Hugging Face authentication.
5960

6061
### [Bash](#tab/Bash)
6162

@@ -72,7 +73,7 @@ huggingface-cli login
7273
---
7374

7475
> [!NOTE]
75-
> You must first [create a Hugging Face token](https://huggingface.co/docs/hub/security-tokens) and [request model access](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) before proceeding.
76+
> [Create a Hugging Face token](https://huggingface.co/docs/hub/security-tokens) and [request model access](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) before proceeding.
7677
7778
## Compile the model
7879

@@ -111,7 +112,7 @@ olive auto-opt `
111112
---
112113

113114
> [!NOTE]
114-
> The compilation process takes approximately 60 seconds, plus extra time for model download.
115+
> The compilation process takes about 60 seconds, plus download time.
115116
116117
The command uses the following parameters:
117118

@@ -129,7 +130,7 @@ The command uses the following parameters:
129130
130131
### Step 2: Rename the output model
131132

132-
Olive places files in a generic `model` directory. Rename it to make it easier to use:
133+
Olive creates a generic `model` directory. Rename it for easier reuse:
133134

134135
### [Bash](#tab/Bash)
135136

@@ -151,7 +152,7 @@ Rename-Item -Path "model" -NewName "llama-3.2"
151152

152153
A chat template is a structured format that defines how input and output messages are processed for a conversational AI model. It specifies the roles (for example, system, user, assistant) and the structure of the conversation, ensuring that the model understands the context and generates appropriate responses.
153154

154-
Foundry Local requires a chat template JSON file called `inference_model.json` in order to generate the appropriate responses. The template properties are the model name and a `PromptTemplate` object, which contains a `{Content}` placeholder that Foundry Local injects at runtime with the user prompt.
155+
Foundry Local requires a chat template JSON file named `inference_model.json` to generate responses. The template includes the model name and a `PromptTemplate` object. The object contains a `{Content}` placeholder that Foundry Local injects at runtime with the user prompt.
155156

156157
```json
157158
{
@@ -163,10 +164,10 @@ Foundry Local requires a chat template JSON file called `inference_model.json` i
163164
}
164165
```
165166

166-
To create the chat template file, you can use the `apply_chat_template` method from the Hugging Face library:
167+
Create the chat template file with the `apply_chat_template` method from the Hugging Face library:
167168

168169
> [!NOTE]
169-
> The following example uses the Python Hugging Face library to create a chat template. The Hugging Face library is a dependency for Olive, so if you're using the same Python virtual environment you don't need to install. If you're using a different environment, install the library with `pip install transformers`.
170+
> This example uses the Hugging Face library (a dependency of Olive) to create a chat template. If you're using the same Python virtual environment, you don't need to install it. In a different environment, install it with `pip install transformers`.
170171
171172
```python
172173
# generate_inference_model.py
@@ -208,7 +209,7 @@ python generate_inference_model.py
208209

209210
## Run the model
210211

211-
You can run your compiled model using the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:
212+
Run your compiled model with the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:
212213

213214
### [Bash](#tab/Bash)
214215

@@ -226,10 +227,10 @@ foundry cache ls # should show llama-3.2
226227
---
227228

228229
> [!CAUTION]
229-
> Remember to change the model cache back to the default directory when you're done by running:
230+
> Change the model cache back to the default directory when you're done:
230231
>
231-
> ```bash
232-
> foundry cache cd ./foundry/cache/models.
232+
> ```bash
233+
> foundry cache cd ./foundry/cache/models
233234
> ```
234235
235236
@@ -250,26 +251,25 @@ foundry model run llama-3.2 --verbose
250251

251252
### Using the OpenAI Python SDK
252253

253-
The OpenAI Python SDK is a convenient way to interact with the Foundry Local REST API. You can install it using:
254+
Use the OpenAI Python SDK to interact with the Foundry Local REST API. Install it with:
254255

255256
```bash
256257
pip install openai
257258
pip install foundry-local-sdk
258259
```
259260

260-
Then, you can use the following code to run the model:
261+
Then run the model with the following code:
261262

262263
```python
263264
import openai
264265
from foundry_local import FoundryLocalManager
265266

266267
modelId = "llama-3.2"
267268

268-
# Create a FoundryLocalManager instance. This will start the Foundry
269-
# Local service if it is not already running and load the specified model.
269+
# Create a FoundryLocalManager instance. This starts the Foundry Local service if it's not already running and loads the specified model.
270270
manager = FoundryLocalManager(modelId)
271271

272-
# The remaining code us es the OpenAI Python SDK to interact with the local model.
272+
# The remaining code uses the OpenAI Python SDK to interact with the local model.
273273

274274
# Configure the client to use the local Foundry service
275275
client = openai.OpenAI(
@@ -291,17 +291,17 @@ for chunk in stream:
291291
```
292292

293293
> [!TIP]
294-
> You can use any language that supports HTTP requests. For more information, read the [Integrated inferencing SDKs with Foundry Local](../how-to/how-to-integrate-with-inference-sdks.md) article.
294+
> Use any language that supports HTTP requests. For more information, see [Integrated inferencing SDKs with Foundry Local](../how-to/how-to-integrate-with-inference-sdks.md).
295295
296-
## Finishing up
296+
## Reset the model cache
297297

298-
After you're done using the custom model, you should reset the model cache to the default directory using:
298+
After you finish using the custom model, reset the model cache to the default directory:
299299

300300
```bash
301301
foundry cache cd ./foundry/cache/models
302302
```
303303

304304
## Next steps
305305

306-
- [Learn more about Olive](https://microsoft.github.io/Olive/)
306+
- [Olive documentation](https://microsoft.github.io/Olive/)
307307
- [Integrate inferencing SDKs with Foundry Local](how-to-integrate-with-inference-sdks.md)

0 commit comments

Comments
 (0)