You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+7-14Lines changed: 7 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -182,7 +182,7 @@ python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy an
182
182
[skip default]: end
183
183
184
184
### Server
185
-
This mode exposes a REST API for interacting with a model.
185
+
This mode exposes a REST API for interacting with a model.
186
186
The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.
187
187
188
188
To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.
@@ -255,14 +255,14 @@ Use the "Max Response Tokens" slider to limit the maximum number of tokens gener
255
255
## Desktop/Server Execution
256
256
257
257
### AOTI (AOT Inductor)
258
-
[AOTI](https://pytorch.org/blog/pytorch2-2/) compiles models before execution for faster inference. The process creates a [DSO](https://en.wikipedia.org/wiki/Shared_library)model (represented by a file with extension `.so`)
258
+
[AOTI](https://pytorch.org/blog/pytorch2-2/) compiles models before execution for faster inference. The process creates a zipped PT2 file containing all the artifacts generated by AOTInductor, and a [.so](https://en.wikipedia.org/wiki/Shared_library) file with the runnable contents
259
259
that is then loaded for inference. This can be done with both Python and C++ enviroments.
260
260
261
261
The following example exports and executes the Llama3.1 8B Instruct
262
262
model. The first command compiles and performs the actual export.
@@ -274,12 +274,11 @@ case visit our [customization guide](docs/model_customization.md).
274
274
275
275
### Run in a Python Enviroment
276
276
277
-
To run in a python enviroment, use the generate subcommand like before, but include the dso file.
277
+
To run in a python enviroment, use the generate subcommand like before, but include the pt2 file.
278
278
279
279
```bash
280
-
python3 torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --prompt "Hello my name is"
280
+
python3 torchchat.py generate llama3.1 --aoti-package-path exportedModels/llama3_1_artifacts.pt2 --prompt "Hello my name is"
281
281
```
282
-
**Note:** Depending on which accelerator is used to generate the .dso file, the command may need the device specified: `--device (cuda | cpu)`.
283
282
284
283
285
284
### Run using our C++ Runner
@@ -289,17 +288,11 @@ To run in a C++ enviroment, we need to build the runner binary.
289
288
torchchat/utils/scripts/build_native.sh aoti
290
289
```
291
290
292
-
To compile the AOTI generated artifacts into a `.so`:
291
+
Then run the compiled executable, with the pt2.
293
292
```bash
294
-
make -C exportedModels/llama3_1_artifacts
293
+
cmake-out/aoti_run exportedModels/llama3_1_artifacts.pt2 -z `python3 torchchat.py where llama3.1`/tokenizer.model -l 3 -i "Once upon a time"
295
294
```
296
295
297
-
Then run the compiled executable, with the compiled DSO.
298
-
```bash
299
-
cmake-out/aoti_run exportedModels/llama3_1_artifacts/llama3_1_artifacts.so -z `python3 torchchat.py where llama3.1`/tokenizer.model -l 3 -i "Once upon a time"
300
-
```
301
-
**Note:** Depending on which accelerator is used to generate the .dso file, the runner may need the device specified: `-d (CUDA | CPU)`.
302
-
303
296
## Mobile Execution
304
297
305
298
[ExecuTorch](https://github.com/pytorch/executorch) enables you to optimize your model for execution on a
0 commit comments