Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Commit bde427d

Browse files
authored
Merge branch 'main' into pinbump1111
2 parents 5b91d46 + de2507b commit bde427d

File tree

5 files changed

+32
-78
lines changed

5 files changed

+32
-78
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ runner-et/cmake-out/*
1919
runner-aoti/cmake-out/*
2020
cmake-out/
2121

22+
# Example project Android Studio ignore
23+
torchchat/edge/android/torchchat/.idea/*
24+
25+
2226
# pte files
2327
*.pte
2428

docs/ADVANCED-USERS.md

Lines changed: 18 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@ Torchchat is currently in a pre-release state and under extensive development.
1818
[shell default]: TORCHCHAT_ROOT=${PWD} ./torchchat/utils/scripts/install_et.sh
1919

2020

21-
This is the advanced users guide, if you're looking to get started
21+
This is the advanced users' guide, if you're looking to get started
2222
with LLMs, please refer to the README at the root directory of the
2323
torchchat distro. This is an advanced user guide, so we will have
24-
many more concepts and options to discuss and taking advantage of them
24+
many more concepts and options to discuss and take advantage of them
2525
may take some effort.
2626

2727
We welcome community contributions of all kinds. If you find
@@ -41,7 +41,7 @@ While we strive to support a broad range of models, we can't test them
4141
all. We classify supported models as tested ✅, work in progress 🚧 or
4242
some restrictions ❹.
4343

44-
We invite community contributions of new model suport and test results!
44+
We invite community contributions of new model support and test results!
4545

4646
| Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile |
4747
|-----|--------|-------|-----|-----|-----|-----|
@@ -86,7 +86,7 @@ Server C++ runtime | n/a | run.cpp model.pte | ✅ |
8686
Mobile C++ runtime | n/a | app model.pte | ✅ |
8787
Mobile C++ runtime | n/a | app + AOTI | 🚧 |
8888

89-
**Getting help:** Each command implements the --help option to give addititonal information about available options:
89+
**Getting help:** Each command implements the --help option to give additional information about available options:
9090

9191
[skip default]: begin
9292
```
@@ -96,8 +96,8 @@ python3 torchchat.py [ export | generate | chat | eval | ... ] --help
9696

9797
Exported models can be loaded back into torchchat for chat or text
9898
generation, letting you experiment with the exported model and valid
99-
model quality. The python interface is the same in all cases and is
100-
used for testing nad test harnesses too.
99+
model quality. The Python interface is the same in all cases and is
100+
used for testing and test harnesses, too.
101101

102102
Torchchat comes with server C++ runtimes to execute AOT Inductor and
103103
ExecuTorch models. A mobile C++ runtimes allow you to deploy
@@ -115,7 +115,7 @@ Some common models are recognized by torchchat based on their filename
115115
through `Model.from_name()` to perform a fuzzy match against a
116116
table of known model architectures. Alternatively, you can specify the
117117
index into that table with the option `--params-table ${INDEX}` where
118-
the index is the lookup key key in the [the list of known
118+
the index is the lookup key in the [the list of known
119119
pconfigurations](https://github.com/pytorch/torchchat/tree/main/torchchat/model_params)
120120
For example, for the stories15M model, this would be expressed as
121121
`--params-table stories15M`. (We use the model constructor
@@ -237,7 +237,7 @@ which chooses the best 16-bit floating point type.
237237

238238
The virtual device fast and virtual floating point data types fast and
239239
fast16 are best used for eager/torch.compiled execution. For export,
240-
specify the your device choice for the target system with --device for
240+
specify your device choice for the target system with --device for
241241
AOTI-exported DSO models, and using ExecuTorch delegate selection for
242242
ExecuTorch-exported PTE models.
243243

@@ -250,8 +250,7 @@ python3 torchchat.py generate [--compile] --checkpoint-path ${MODEL_PATH} --prom
250250
To improve performance, you can compile the model with `--compile`
251251
trading off the time to first token processed with time per token. To
252252
improve performance further, you may also compile the prefill with
253-
`--compile_prefill`. This will increase further compilation times though. The
254-
`--compile-prefill` option is not compatible with `--prefill-prefill`.
253+
`--compile-prefill`. This will increase further compilation times though.
255254

256255
Parallel prefill is not yet supported by exported models, and may be
257256
supported in a future release.
@@ -265,7 +264,7 @@ the introductory README.
265264
In addition to running eval on models in eager mode and JIT-compiled
266265
mode with `torch.compile()`, you can also load dso and pte models back
267266
into the PyTorch to evaluate the accuracy of exported model objects
268-
(e.g., after applying quantization or other traqnsformations to
267+
(e.g., after applying quantization or other transformations to
269268
improve speed or reduce model size).
270269

271270
Loading exported models back into a Python-based Pytorch allows you to
@@ -297,14 +296,14 @@ for ExecuTorch.)
297296

298297
We export the stories15M model with the following command for
299298
execution with the ExecuTorch runtime (and enabling execution on a
300-
wide range of community and vendor supported backends):
299+
wide range of community and vendor-supported backends):
301300

302301
```
303302
python3 torchchat.py export --checkpoint-path ${MODEL_PATH} --output-pte-path ${MODEL_NAME}.pte
304303
```
305304

306305
Alternatively, we may generate a native instruction stream binary
307-
using AOT Inductor for CPU oor GPUs (the latter using Triton for
306+
using AOT Inductor for CPU or GPUs (the latter using Triton for
308307
optimizations such as operator fusion):
309308

310309
```
@@ -319,10 +318,10 @@ the exported model artifact back into a model container with a
319318
compatible API surface for the `model.forward()` function. This
320319
enables users to test, evaluate and exercise the exported model
321320
artifact with familiar interfaces, and in conjunction with
322-
pre-exiisting Python model unit tests and common environments such as
321+
pre-existing Python model unit tests and common environments such as
323322
Jupyter notebooks and/or Google colab.
324323

325-
Here is how to load an exported model into the python environment on the example of using an exported model with `generate.oy`.
324+
Here is how to load an exported model into the Python environment using an exported model with the `generate` command.
326325

327326
```
328327
python3 torchchat.py generate --checkpoint-path ${MODEL_PATH} --pte-path ${MODEL_NAME}.pte --device cpu --prompt "Once upon a time"
@@ -452,7 +451,7 @@ strategies:
452451
You can find instructions for quantizing models in
453452
[docs/quantization.md](file:///./quantization.md). Advantageously,
454453
quantization is available in eager mode as well as during export,
455-
enabling you to do an early exploration of your quantization setttings
454+
enabling you to do an early exploration of your quantization settings
456455
in eager mode. However, final accuracy should always be confirmed on
457456
the actual execution target, since all targets have different build
458457
processes, compilers, and kernel implementations with potentially
@@ -464,9 +463,8 @@ significant impact on accuracy.
464463

465464
## Native (Stand-Alone) Execution of Exported Models
466465

467-
Refer to the [README](README.md] for an introduction toNative
468-
execution on servers, desktops and laptops is described under
469-
[runner-build.md]. Mobile and Edge executipon for Android and iOS are
466+
Refer to the [README](README.md] for an introduction to native
467+
execution on servers, desktops, and laptops. Mobile and Edge execution for Android and iOS are
470468
described under [torchchat/edge/docs/Android.md] and [torchchat/edge/docs/iOS.md], respectively.
471469

472470

@@ -475,7 +473,7 @@ described under [torchchat/edge/docs/Android.md] and [torchchat/edge/docs/iOS.md
475473

476474
PyTorch and ExecuTorch support a broad range of devices for running
477475
PyTorch with python (using either eager or eager + `torch.compile`) or
478-
in a python-free environment with AOT Inductor and ExecuTorch.
476+
in a Python-free environment with AOT Inductor and ExecuTorch.
479477

480478

481479
| Hardware | OS | Eager | Eager + Compile | AOT Compile | ET Runtime |
@@ -499,58 +497,6 @@ in a python-free environment with AOT Inductor and ExecuTorch.
499497
*Key*: n/t -- not tested
500498

501499

502-
## Runtime performance with Llama 7B, in tokens per second (4b quantization)
503-
504-
| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime |
505-
|-----|------|-----|-----|-----|-----|
506-
| x86 | Linux | ? | ? | ? | ? |
507-
| x86 | macOS | ? | ? | ? | ? |
508-
| aarch64 | Linux | ? | ? | ? | ? |
509-
| aarch64 | macOS | ? | ? | ? | ? |
510-
| AMD GPU | Linux | ? | ? | ? | ? |
511-
| Nvidia GPU | Linux | ? | ? | ? | ? |
512-
| MPS | macOS | ? | ? | ? | ? |
513-
| MPS | iOS | ? | ? | ? | ? |
514-
| aarch64 | Android | ? | ? | ? | ? |
515-
| Mobile GPU (Vulkan) | Android | ? | ? | ? | ? |
516-
| CoreML | iOS | | ? | ? | ? | ? |
517-
| Hexagon DSP | Android | | ? | ? | ? | ? |
518-
| Raspberry Pi 4/5 | Raspbian | ? | ? | ? | ? |
519-
| Raspberry Pi 4/5 | Android | ? | ? | ? | ? |
520-
| ARM 32b (up to v7) | any | | ? | ? | ? | ? |
521-
522-
523-
## Runtime performance with Llama3, in tokens per second (4b quantization)
524-
525-
| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime |
526-
|-----|------|-----|-----|-----|-----|
527-
| x86 | Linux | ? | ? | ? | ? |
528-
| x86 | macOS | ? | ? | ? | ? |
529-
| aarch64 | Linux | ? | ? | ? | ? |
530-
| aarch64 | macOS | ? | ? | ? | ? |
531-
| AMD GPU | Linux | ? | ? | ? | ? |
532-
| Nvidia GPU | Linux | ? | ? | ? | ? |
533-
| MPS | macOS | ? | ? | ? | ? |
534-
| MPS | iOS | ? | ? | ? | ? |
535-
| aarch64 | Android | ? | ? | ? | ? |
536-
| Mobile GPU (Vulkan) | Android | ? | ? | ? | ? |
537-
| CoreML | iOS | | ? | ? | ? | ? |
538-
| Hexagon DSP | Android | | ? | ? | ? | ? |
539-
| Raspberry Pi 4/5 | Raspbian | ? | ? | ? | ? |
540-
| Raspberry Pi 4/5 | Android | ? | ? | ? | ? |
541-
| ARM 32b (up to v7) | any | | ? | ? | ? | ? |
542-
543-
544-
545-
546-
# CONTRIBUTING to torchchat
547-
548-
We welcome any feature requests, bug reports, or pull requests from
549-
the community. See the [CONTRIBUTING](CONTRIBUTING.md) for
550-
instructions how to contribute to torchchat.
551-
552-
553-
554500
# LICENSE
555501

556502
Torchchat is released under the [BSD 3 license](./LICENSE). However

docs/multimodal.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ python3 torchchat.py server llama3.2-11B
4141
```
4242
[skip default]: end
4343

44+
[shell default]: python3 torchchat.py server llama3.2-11B & server_pid=$!
45+
46+
4447
In another terminal, query the server using `curl`. This query might take a few minutes to respond.
4548

4649
<details>
@@ -50,7 +53,6 @@ Setting `stream` to "true" in the request emits a response in chunks. If `stream
5053

5154
**Example Input + Output**
5255

53-
[skip default]: begin
5456
```
5557
curl http://127.0.0.1:5000/v1/chat/completions \
5658
-H "Content-Type: application/json" \
@@ -74,12 +76,14 @@ curl http://127.0.0.1:5000/v1/chat/completions \
7476
"max_tokens": 300
7577
}'
7678
```
77-
79+
[skip default]: begin
7880
```
7981
{"id": "chatcmpl-cb7b39af-a22e-4f71-94a8-17753fa0d00c", "choices": [{"message": {"role": "assistant", "content": "The image depicts a simple black and white cartoon-style drawing of an animal face. It features a profile view, complete with two ears, expressive eyes, and a partial snout. The animal looks to the left, with its eye and mouth implied, suggesting that the drawn face might belong to a rabbit, dog, or pig. The graphic face has a bold black outline and a smaller, solid black nose. A small circle, forming part of the face, has a white background with two black quirkly short and long curved lines forming an outline of what was likely a mouth, complete with two teeth. The presence of the curve lines give the impression that the animal is smiling or speaking. Grey and black shadows behind the right ear and mouth suggest that this face is looking left and upwards. Given the prominent outline of the head and the outline of the nose, it appears that the depicted face is most likely from the side profile of a pig, although the ears make it seem like a dog and the shape of the nose makes it seem like a rabbit. Overall, it seems that this image, possibly part of a character illustration, is conveying a playful or expressive mood through its design and positioning."}, "finish_reason": "stop"}], "created": 1727487574, "model": "llama3.2", "system_fingerprint": "cpu_torch.float16", "object": "chat.completion"}%
8082
```
8183
[skip default]: end
8284

85+
[shell default]: kill ${server_pid}
86+
8387
</details>
8488

8589
## Browser

docs/native-execution.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@ The 'llama runner' is a native standalone application capable of
1616
running a model exported and compiled ahead-of-time with either
1717
Executorch (ET) or AOT Inductor (AOTI). Which model format to use
1818
depends on your requirements and preferences. Executorch models are
19-
optimized for portability across a range of decices, including mobile
19+
optimized for portability across a range of devices, including mobile
2020
and edge devices. AOT Inductor models are optimized for a particular
2121
target architecture, which may result in better performance and
2222
efficiency.
2323

2424
Building the runners is straightforward with the included cmake build
2525
files and is covered in the next sections. We will showcase the
26-
runners using ~~stories15M~~ llama2 7B and llama3.
26+
runners using llama2 7B and llama3.
2727

2828
## What can you do with torchchat's llama runner for native execution?
2929

@@ -160,7 +160,7 @@ and native execution environments, respectively.
160160

161161
After exporting a model, you will want to verify that the model
162162
delivers output of high quality, and works as expected. Both can be
163-
achieved with the Python environment. All torchchat Python comands
163+
achieved with the Python environment. All torchchat Python commands
164164
can work with exported models. Instead of loading the model from a
165165
checkpoint or GGUF file, use the `--dso-path model.so` and
166166
`--pte-path model.pte` for loading both types of exported models. This

torchchat/edge/android/torchchat/app/build.gradle.kts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ dependencies {
5757
implementation("androidx.constraintlayout:constraintlayout:2.2.0-alpha12")
5858
implementation("com.facebook.fbjni:fbjni:0.5.1")
5959
implementation("com.google.code.gson:gson:2.8.6")
60-
implementation(files("libs/executorch-llama.aar"))
60+
implementation(files("libs/executorch.aar"))
6161
implementation("com.google.android.material:material:1.12.0")
6262
implementation("androidx.activity:activity:1.9.0")
6363
testImplementation("junit:junit:4.13.2")

0 commit comments

Comments
 (0)