You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Foundry Local runs ONNX models on your device with high performance. While the model catalog offers _out-of-the-box_precompiled options, you can use any model in the ONNX format.
21
+
Foundry Local runs ONNX models on your device with high performance. Although the model catalog offers precompiled options out of the box, any model in the ONNX format works.
21
22
22
-
To compile existing models in Safetensor or PyTorch format into the ONNX format, you can use [Olive](https://microsoft.github.io/Olive). Olive is a tool that optimizes models to ONNX format, making them suitable for deployment in Foundry Local. It uses techniques like _quantization_ and _graph optimization_ to improve performance.
23
+
Use [Olive](https://microsoft.github.io/Olive) to compile models in Safetensor or PyTorch format to ONNX. Olive optimizes models for ONNX, making them suitable for deployment in Foundry Local. It uses techniques like quantization and graph optimization to improve performance.
23
24
24
-
This guide shows you how to:
25
+
This guide shows how to:
25
26
26
27
> [!div class="checklist"]
27
28
>
28
-
> -**Convert and optimize** models from Hugging Face to run in Foundry Local. You'll use the `Llama-3.2-1B-Instruct` model as an example, but you can use any generative AI model from Hugging Face.
29
-
> -**Run** your optimized models with Foundry Local
29
+
> - Convert and optimize models from Hugging Face to run in Foundry Local. The examples use the `Llama-3.2-1B-Instruct` model, but any generative AI model from Hugging Face works.
30
+
> - Run your optimized models with Foundry Local.
30
31
31
32
## Prerequisites
32
33
33
34
- Python 3.10 or later
34
35
35
36
## Install Olive
36
37
37
-
[Olive](https://github.com/microsoft/olive)is a tool that optimizes models to ONNX format.
38
+
[Olive](https://github.com/microsoft/olive)optimizes models and converts them to the ONNX format.
> For best results, install Olive in a virtual environment using[venv](https://docs.python.org/3/library/venv.html) or [conda](https://www.anaconda.com/docs/getting-started/miniconda/main).
55
+
> Install Olive in a virtual environment with[venv](https://docs.python.org/3/library/venv.html) or [conda](https://www.anaconda.com/docs/getting-started/miniconda/main).
55
56
56
57
## Sign in to Hugging Face
57
58
58
-
You optimize the `Llama-3.2-1B-Instruct` model, which requires Hugging Face authentication:
59
+
The `Llama-3.2-1B-Instruct` modelrequires Hugging Face authentication.
59
60
60
61
### [Bash](#tab/Bash)
61
62
@@ -72,7 +73,7 @@ huggingface-cli login
72
73
---
73
74
74
75
> [!NOTE]
75
-
> You must first [create a Hugging Face token](https://huggingface.co/docs/hub/security-tokens) and [request model access](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) before proceeding.
76
+
> [Create a Hugging Face token](https://huggingface.co/docs/hub/security-tokens) and [request model access](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) before proceeding.
76
77
77
78
## Compile the model
78
79
@@ -111,7 +112,7 @@ olive auto-opt `
111
112
---
112
113
113
114
> [!NOTE]
114
-
> The compilation process takes approximately 60 seconds, plus extra time for model download.
115
+
> The compilation process takes about 60 seconds, plus download time.
115
116
116
117
The command uses the following parameters:
117
118
@@ -129,7 +130,7 @@ The command uses the following parameters:
129
130
130
131
### Step 2: Rename the output model
131
132
132
-
Olive places files in a generic `model` directory. Rename it to make it easier to use:
133
+
Olive creates a generic `model` directory. Rename it for easier reuse:
A chat template is a structured format that defines how input and output messages are processed for a conversational AI model. It specifies the roles (for example, system, user, assistant) and the structure of the conversation, ensuring that the model understands the context and generates appropriate responses.
153
154
154
-
Foundry Local requires a chat template JSON file called`inference_model.json`in order to generate the appropriate responses. The template properties are the model name and a `PromptTemplate` object, which contains a `{Content}` placeholder that Foundry Local injects at runtime with the user prompt.
155
+
Foundry Local requires a chat template JSON file named`inference_model.json` to generate responses. The template includes the model name and a `PromptTemplate` object. The object contains a `{Content}` placeholder that Foundry Local injects at runtime with the user prompt.
155
156
156
157
```json
157
158
{
@@ -163,10 +164,10 @@ Foundry Local requires a chat template JSON file called `inference_model.json` i
163
164
}
164
165
```
165
166
166
-
To create the chat template file, you can use the `apply_chat_template` method from the Hugging Face library:
167
+
Create the chat template file with the `apply_chat_template` method from the Hugging Face library:
167
168
168
169
> [!NOTE]
169
-
> The following example uses the Python Hugging Face library to create a chat template. The Hugging Face library is a dependency for Olive, so if you're using the same Python virtual environment you don't need to install. If you're using a different environment, install the library with `pip install transformers`.
170
+
> This example uses the Hugging Face library (a dependency of Olive) to create a chat template. If you're using the same Python virtual environment, you don't need to install it. In a different environment, install it with `pip install transformers`.
You can run your compiled model using the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:
212
+
Run your compiled model with the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:
212
213
213
214
### [Bash](#tab/Bash)
214
215
@@ -226,10 +227,10 @@ foundry cache ls # should show llama-3.2
226
227
---
227
228
228
229
> [!CAUTION]
229
-
> Remember to change the model cache back to the default directory when you're done by running:
230
+
> Change the model cache back to the default directory when you're done:
230
231
>
231
-
> ```bash
232
-
> foundry cache cd ./foundry/cache/models.
232
+
> ```bash
233
+
> foundry cache cd ./foundry/cache/models
233
234
>```
234
235
235
236
@@ -250,26 +251,25 @@ foundry model run llama-3.2 --verbose
250
251
251
252
### Using the OpenAI Python SDK
252
253
253
-
The OpenAI Python SDK is a convenient way to interact with the Foundry Local REST API. You can install it using:
254
+
Use the OpenAI Python SDK to interact with the Foundry Local REST API. Install it with:
254
255
255
256
```bash
256
257
pip install openai
257
258
pip install foundry-local-sdk
258
259
```
259
260
260
-
Then, you can use the following code to run the model:
261
+
Then run the model with the following code:
261
262
262
263
```python
263
264
import openai
264
265
from foundry_local import FoundryLocalManager
265
266
266
267
modelId ="llama-3.2"
267
268
268
-
# Create a FoundryLocalManager instance. This will start the Foundry
269
-
# Local service if it is not already running and load the specified model.
269
+
# Create a FoundryLocalManager instance. This starts the Foundry Local service if it's not already running and loads the specified model.
270
270
manager = FoundryLocalManager(modelId)
271
271
272
-
# The remaining code us es the OpenAI Python SDK to interact with the local model.
272
+
# The remaining code uses the OpenAI Python SDK to interact with the local model.
273
273
274
274
# Configure the client to use the local Foundry service
275
275
client = openai.OpenAI(
@@ -291,17 +291,17 @@ for chunk in stream:
291
291
```
292
292
293
293
> [!TIP]
294
-
> You can use any language that supports HTTP requests. For more information, read the [Integrated inferencing SDKs with Foundry Local](../how-to/how-to-integrate-with-inference-sdks.md) article.
294
+
> Use any language that supports HTTP requests. For more information, see [Integrated inferencing SDKs with Foundry Local](../how-to/how-to-integrate-with-inference-sdks.md).
295
295
296
-
## Finishing up
296
+
## Reset the model cache
297
297
298
-
After you're done using the custom model, you should reset the model cache to the default directory using:
298
+
After you finish using the custom model, reset the model cache to the default directory:
299
299
300
300
```bash
301
301
foundry cache cd ./foundry/cache/models
302
302
```
303
303
304
304
## Next steps
305
305
306
-
-[Learn more about Olive](https://microsoft.github.io/Olive/)
0 commit comments