-
Notifications
You must be signed in to change notification settings - Fork 349
Update TorchAO README inference section before PTC #3206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jerryzh168
wants to merge
1
commit into
main
Choose a base branch
from
update-readme-10-2025
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+54
−84
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ | |
|
||
## 📣 Latest News | ||
|
||
- [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/)! | ||
- [Jun 25] Our [TorchAO paper](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025! | ||
- [May 25] QAT is now integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) for fine-tuning ([docs](https://docs.axolotl.ai/docs/qat.html))! | ||
- [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale | ||
|
@@ -56,13 +57,6 @@ TorchAO is an easy to use quantization library for native PyTorch. TorchAO works | |
|
||
Check out our [docs](https://docs.pytorch.org/ao/main/) for more details! | ||
|
||
From the team that brought you the fast series: | ||
* 9.5x inference speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai) | ||
* 10x inference speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2) | ||
* 3x inference speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) (new: [flux-fast](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/)) | ||
* 2.7x inference speedup for FAIR’s Seamless M4T-v2 model with [seamlessv2-fast](https://pytorch.org/blog/accelerating-generative-ai-4/) | ||
|
||
|
||
## 🚀 Quick Start | ||
|
||
First, install TorchAO. We recommend installing the latest stable version: | ||
|
@@ -73,20 +67,9 @@ pip install torchao | |
Quantize your model weights to int4! | ||
```python | ||
from torchao.quantization import Int4WeightOnlyConfig, quantize_ | ||
quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1)) | ||
``` | ||
Compared to a `torch.compiled` bf16 baseline, your quantized model should be significantly smaller and faster on a single A100 GPU: | ||
```bash | ||
int4 model size: 1.25 MB | ||
bfloat16 model size: 4.00 MB | ||
compression ratio: 3.2 | ||
|
||
bf16 mean time: 30.393 ms | ||
int4 mean time: 4.410 ms | ||
speedup: 6.9x | ||
quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")) | ||
``` | ||
See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details. Alternatively, try quantizing your favorite model using our [HuggingFace space](https://huggingface.co/spaces/pytorch/torchao-my-repo)! | ||
|
||
See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details. | ||
|
||
## 🛠 Installation | ||
|
||
|
@@ -103,13 +86,14 @@ pip install torchao | |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 | ||
|
||
# Different CUDA versions | ||
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6 | ||
pip install torchao --index-url https://download.pytorch.org/whl/cu128 # CUDA 12.8 | ||
pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only | ||
|
||
# For developers | ||
USE_CUDA=1 python setup.py develop | ||
USE_CPP=0 python setup.py develop | ||
``` | ||
|
||
</details> | ||
|
||
Please see the [torchao compability table](https://github.com/pytorch/ao/issues/2919) for version requirements for dependencies. | ||
|
@@ -120,57 +104,64 @@ TorchAO is integrated into some of the leading open-source libraries including: | |
|
||
* HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. reordered a bit to put more commonly used ones earlier |
||
* HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md) | ||
* vLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html) | ||
* Integration with [FBGEMM](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai) for SOTA kernels on server GPUs | ||
* Integration with [ExecuTorch](https://github.com/pytorch/executorch/) for edge device deployment | ||
* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html) | ||
* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md) | ||
* HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https://huggingface.co/docs/peft/en/developer_guides/quantization#torchao-pytorch-architecture-optimization) | ||
* Mobius HQQ backend leveraged our int4 kernels to get [195 tok/s on a 4090](https://github.com/mobiusml/hqq#faster-inference) | ||
* TorchTune for our NF4 [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes | ||
* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md) | ||
* VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html) | ||
* SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341). | ||
* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html) | ||
|
||
* SGLang for LLM serving: [usage](https://docs.sglang.ai/advanced_features/quantization.html#online-quantization) | ||
|
||
## 🔎 Inference | ||
|
||
TorchAO delivers substantial performance gains with minimal code changes: | ||
|
||
- **Int4 weight-only**: [1.89x throughput with 58.1% less memory](torchao/quantization/README.md) on Llama-3-8B | ||
- **Float8 dynamic quantization**: [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality | ||
- **Int4 weight-only**: [1.73x speedup with 65% less memory](https://huggingface.co/pytorch/gemma-3-12b-it-INT4) for Gemma3-12b-it on H100 with slight impact on accuracy | ||
- **Float8 dynamic quantization**: [1.5-1.6x speedup on gemma-3-27b-it](https://huggingface.co/pytorch/gemma-3-27b-it-FP8/blob/main/README.md#results-h100-machine) and [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality | ||
- **Int8 activation quantization and int4 weight quantization**: Quantized Qwen3-4B running with 14.8 tokens/s with 3379 MB memory usage on iPhone 15 Pro through [ExecuTorch](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4#running-in-a-mobile-app) | ||
- **Int4 + 2:4 Sparsity**: [2.37x throughput with 67.7% memory reduction](torchao/sparsity/README.md) on Llama-3-8B | ||
|
||
Quantize any model with `nn.Linear` layers in just one line (Option 1), or load the quantized model directly from HuggingFace using our integration with HuggingFace transformers (Option 2): | ||
|
||
#### Option 1: Direct TorchAO API | ||
|
||
```python | ||
from torchao.quantization.quant_api import quantize_, Int4WeightOnlyConfig | ||
quantize_(model, Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1)) | ||
``` | ||
|
||
#### Option 2: HuggingFace Integration | ||
|
||
Following is our recommended flow for quantization and deployment: | ||
```python | ||
from transformers import TorchAoConfig, AutoModelForCausalLM | ||
from torchao.quantization.quant_api import Int4WeightOnlyConfig | ||
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow | ||
|
||
# Create quantization configuration | ||
quantization_config = TorchAoConfig(quant_type=Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1)) | ||
quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())) | ||
|
||
# Load and automatically quantize | ||
quantized_model = AutoModelForCausalLM.from_pretrained( | ||
"microsoft/Phi-4-mini-instruct", | ||
"Qwen/Qwen3-32B", | ||
dtype="auto", | ||
device_map="auto", | ||
quantization_config=quantization_config | ||
) | ||
``` | ||
|
||
#### Deploy quantized models in vLLM with one command: | ||
Alternative quantization API to use when the above doesn't work is `quantize_` API in [quick start guide](https://docs.pytorch.org/ao/main/quick_start.html). | ||
|
||
Serving with vllm on 1xH100 machine: | ||
```shell | ||
# Server | ||
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3 | ||
jerryzh168 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
```shell | ||
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 | ||
# Client | ||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ | ||
"model": "pytorch/Qwen3-32B-FP8", | ||
"messages": [ | ||
{"role": "user", "content": "Give me a short introduction to large language models."} | ||
], | ||
"temperature": 0.6, | ||
"top_p": 0.95, | ||
"top_k": 20, | ||
"max_tokens": 32768 | ||
}' | ||
``` | ||
|
||
With this quantization flow, we achieve **67% VRAM reduction and 12-20% speedup** on A100 GPUs while maintaining model quality. For more detail, see this [step-by-step quantization guide](https://huggingface.co/pytorch/Phi-4-mini-instruct-int4wo-hqq#quantization-recipe). We also release some pre-quantized models [here](https://huggingface.co/pytorch). | ||
We also support deployment to edge devices through ExecuTorch, for more detail, see [quantization and serving guide](https://docs.pytorch.org/ao/main/serving.html). We also release pre-quantized models [here](https://huggingface.co/pytorch). | ||
|
||
## 🚅 Training | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing these since toy model memory/latency is not meaningful, to make our README shorter