diff --git a/README.md b/README.md
index 248193d415..3b410d48dc 100644
--- a/README.md
+++ b/README.md
@@ -26,6 +26,7 @@
- [Oct 20] MXFP8 MoE training prototype achieved **~1.45x speedup** for MoE layer in Llama4 Scout, and **~1.25x** speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the [docs](./torchao/prototype/moe_training/) to try it out.
- [Sept 25] MXFP8 training achieved [1.28x speedup on Crusoe B200 cluster](https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/) with virtually identical loss curve to bfloat16!
+- [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/)!
- [Jun 25] Our [TorchAO paper](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025!
- [May 25] QAT is now integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) for fine-tuning ([docs](https://docs.axolotl.ai/docs/qat.html))!
- [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale
@@ -59,13 +60,6 @@ TorchAO is an easy to use quantization library for native PyTorch. TorchAO works
Check out our [docs](https://docs.pytorch.org/ao/main/) for more details!
-From the team that brought you the fast series:
-* 9.5x inference speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
-* 10x inference speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
-* 3x inference speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) (new: [flux-fast](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/))
-* 2.7x inference speedup for FAIR’s Seamless M4T-v2 model with [seamlessv2-fast](https://pytorch.org/blog/accelerating-generative-ai-4/)
-
-
## 🚀 Quick Start
First, install TorchAO. We recommend installing the latest stable version:
@@ -76,20 +70,9 @@ pip install torchao
Quantize your model weights to int4!
```python
from torchao.quantization import Int4WeightOnlyConfig, quantize_
-quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1))
-```
-Compared to a `torch.compiled` bf16 baseline, your quantized model should be significantly smaller and faster on a single A100 GPU:
-```bash
-int4 model size: 1.25 MB
-bfloat16 model size: 4.00 MB
-compression ratio: 3.2
-
-bf16 mean time: 30.393 ms
-int4 mean time: 4.410 ms
-speedup: 6.9x
+quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
```
-See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details. Alternatively, try quantizing your favorite model using our [HuggingFace space](https://huggingface.co/spaces/pytorch/torchao-my-repo)!
-
+See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details.
## 🛠Installation
@@ -103,16 +86,18 @@ pip install torchao
```
# Nightly
- pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
+ pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128
# Different CUDA versions
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6
+ pip install torchao --index-url https://download.pytorch.org/whl/cu129 # CUDA 12.9
pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only
# For developers
USE_CUDA=1 python setup.py develop
USE_CPP=0 python setup.py develop
```
+
Please see the [torchao compability table](https://github.com/pytorch/ao/issues/2919) for version requirements for dependencies.
@@ -123,57 +108,64 @@ TorchAO is integrated into some of the leading open-source libraries including:
* HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865)
* HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md)
+* vLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
+* Integration with [FBGEMM](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai) for SOTA kernels on server GPUs
+* Integration with [ExecuTorch](https://github.com/pytorch/executorch/) for edge device deployment
+* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
+* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
* HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https://huggingface.co/docs/peft/en/developer_guides/quantization#torchao-pytorch-architecture-optimization)
-* Mobius HQQ backend leveraged our int4 kernels to get [195 tok/s on a 4090](https://github.com/mobiusml/hqq#faster-inference)
* TorchTune for our NF4 [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes
-* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
-* VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
-* SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341).
-* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
-
+* SGLang for LLM serving: [usage](https://docs.sglang.ai/advanced_features/quantization.html#online-quantization)
## 🔎 Inference
TorchAO delivers substantial performance gains with minimal code changes:
-- **Int4 weight-only**: [1.89x throughput with 58.1% less memory](torchao/quantization/README.md) on Llama-3-8B
-- **Float8 dynamic quantization**: [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality
+- **Int4 weight-only**: [1.73x speedup with 65% less memory](https://huggingface.co/pytorch/gemma-3-12b-it-INT4) for Gemma3-12b-it on H100 with slight impact on accuracy
+- **Float8 dynamic quantization**: [1.5-1.6x speedup on gemma-3-27b-it](https://huggingface.co/pytorch/gemma-3-27b-it-FP8/blob/main/README.md#results-h100-machine) and [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality
+- **Int8 activation quantization and int4 weight quantization**: Quantized Qwen3-4B running with 14.8 tokens/s with 3379 MB memory usage on iPhone 15 Pro through [ExecuTorch](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4#running-in-a-mobile-app)
- **Int4 + 2:4 Sparsity**: [2.37x throughput with 67.7% memory reduction](torchao/sparsity/README.md) on Llama-3-8B
-Quantize any model with `nn.Linear` layers in just one line (Option 1), or load the quantized model directly from HuggingFace using our integration with HuggingFace transformers (Option 2):
-
-#### Option 1: Direct TorchAO API
-
-```python
-from torchao.quantization.quant_api import quantize_, Int4WeightOnlyConfig
-quantize_(model, Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1))
-```
-
-#### Option 2: HuggingFace Integration
-
+Following is our recommended flow for quantization and deployment:
```python
from transformers import TorchAoConfig, AutoModelForCausalLM
-from torchao.quantization.quant_api import Int4WeightOnlyConfig
+from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
# Create quantization configuration
-quantization_config = TorchAoConfig(quant_type=Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1))
+quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))
# Load and automatically quantize
quantized_model = AutoModelForCausalLM.from_pretrained(
- "microsoft/Phi-4-mini-instruct",
+ "Qwen/Qwen3-32B",
dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
```
-#### Deploy quantized models in vLLM with one command:
+Alternative quantization API to use when the above doesn't work is `quantize_` API in [quick start guide](https://docs.pytorch.org/ao/main/quick_start.html).
+
+Serving with vllm on 1xH100 machine:
+```shell
+# Server
+VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3
+```
```shell
-vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
+# Client
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+ "model": "pytorch/Qwen3-32B-FP8",
+ "messages": [
+ {"role": "user", "content": "Give me a short introduction to large language models."}
+ ],
+ "temperature": 0.6,
+ "top_p": 0.95,
+ "top_k": 20,
+ "max_tokens": 32768
+}'
```
-With this quantization flow, we achieve **67% VRAM reduction and 12-20% speedup** on A100 GPUs while maintaining model quality. For more detail, see this [step-by-step quantization guide](https://huggingface.co/pytorch/Phi-4-mini-instruct-int4wo-hqq#quantization-recipe). We also release some pre-quantized models [here](https://huggingface.co/pytorch).
+We also support deployment to edge devices through ExecuTorch, for more detail, see [quantization and serving guide](https://docs.pytorch.org/ao/main/serving.html). We also release pre-quantized models [here](https://huggingface.co/pytorch).
## 🚅 Training
diff --git a/docs/source/api_ref_quantization.rst b/docs/source/api_ref_quantization.rst
index c163a4b06a..d5a041b504 100644
--- a/docs/source/api_ref_quantization.rst
+++ b/docs/source/api_ref_quantization.rst
@@ -14,7 +14,6 @@ Main Quantization APIs
:nosignatures:
quantize_
- autoquant
Inference APIs for quantize\_
-------------------------------
@@ -27,13 +26,9 @@ Inference APIs for quantize\_
Float8DynamicActivationInt4WeightConfig
Float8DynamicActivationFloat8WeightConfig
Float8WeightOnlyConfig
- Float8StaticActivationFloat8WeightConfig
Int8DynamicActivationInt4WeightConfig
- GemliteUIntXWeightOnlyConfig
Int8WeightOnlyConfig
Int8DynamicActivationInt8WeightConfig
- UIntXWeightOnlyConfig
- FPXWeightOnlyConfig
.. currentmodule:: torchao.quantization
@@ -51,19 +46,4 @@ Quantization Primitives
safe_int_mm
int_scaled_matmul
MappingType
- ZeroPointDomain
TorchAODType
-
-..
- TODO: delete these?
-
-Other
------
-
-.. autosummary::
- :toctree: generated/
- :nosignatures:
-
- to_linear_activation_quantized
- swap_linear_with_smooth_fq_linear
- smooth_fq_linear_to_inference
diff --git a/docs/source/quick_start.rst b/docs/source/quick_start.rst
index 52947b7622..e08a95194b 100644
--- a/docs/source/quick_start.rst
+++ b/docs/source/quick_start.rst
@@ -2,20 +2,8 @@ Quick Start Guide
-----------------
In this quick start guide, we will explore how to perform basic quantization using torchao.
-First, install the latest stable torchao release::
-
- pip install torchao
-
-If you prefer to use the nightly release, you can install torchao using the following
-command instead::
-
- pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121
-
-torchao is compatible with the latest 3 major versions of PyTorch, which you will also
-need to install (`detailed instructions `__)::
-
- pip install torch
+Follow `torchao installation and compatibility guide `__ to install torchao and compatible pytorch.
First Quantization Example
==========================
@@ -55,9 +43,8 @@ for efficient mixed dtype matrix multiplication:
.. code:: py
- # torch 2.4+ only
from torchao.quantization import Int4WeightOnlyConfig, quantize_
- quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1))
+ quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
The quantized model is now ready to use! Note that the quantization
logic is inserted through tensor subclasses, so there is no change
diff --git a/docs/source/serving.rst b/docs/source/serving.rst
index d95132ded7..f97eb01bac 100644
--- a/docs/source/serving.rst
+++ b/docs/source/serving.rst
@@ -15,7 +15,7 @@ Post-training Quantization with HuggingFace
-------------------------------------------
HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.
-Please check out our `HF Integration Docs `_ for examples on how to use quantization and sparsity in Transformers and Diffusers.
+Please check out our `HF Integration Docs `_ for examples on how to use quantization and sparsity in Transformers and Diffusers and `TorchAOConfig Reference `_ for all available torchao configs to use.
Serving and Inference
--------------------
@@ -29,19 +29,19 @@ First, install vLLM with torchao support:
.. code-block:: bash
- pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
- pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
+ pip install vllm --pre --extra-index-url https://download.pytorch.org/whl/nightly/vllm/
+ pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128
To serve in vLLM, we're using the model we quantized and pushed to Hugging Face hub in the previous step :ref:`Post-training Quantization with HuggingFace`.
.. code-block:: bash
# Server
- vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+ vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
- "model": "pytorch/Phi-4-mini-instruct-float8dq",
+ "model": "pytorch/Phi-4-mini-instruct-FP8",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
@@ -271,8 +271,8 @@ Evaluate quantized models using lm-evaluation-harness:
# Evaluate baseline model
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
- # Evaluate torchao-quantized model (float8dq)
- lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
+ # Evaluate torchao-quantized model (FP8)
+ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-FP8 --tasks hellaswag --device cuda:0 --batch_size 8
Memory Benchmarking
^^^^^^^^^^^^^^^^^
@@ -283,8 +283,8 @@ For Phi-4-mini-instruct, when quantized with float8 dynamic quant, we can reduce
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
- # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
- model_id = "pytorch/Phi-4-mini-instruct-float8dq"
+ # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-FP8"
+ model_id = "pytorch/Phi-4-mini-instruct-FP8"
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -328,7 +328,7 @@ Output:
Peak Memory Usage: 5.70 GB
+-------------------+---------------------+------------------------------+
-| Benchmark | Phi-4 mini-instruct | Phi-4-mini-instruct-float8dq |
+| Benchmark | Phi-4 mini-instruct | Phi-4-mini-instruct-FP8 |
+===================+=====================+==============================+
| Peak Memory (GB) | 8.91 | 5.70 (36% reduction) |
+-------------------+---------------------+------------------------------+
@@ -342,10 +342,10 @@ Latency Benchmarking
.. code-block:: bash
# baseline
- python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
+ vllm bench latency --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
- # float8dq
- VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
+ # FP8
+ VLLM_DISABLE_COMPILE_CACHE=1 vllm bench latency --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-FP8 --batch-size 1
Serving Benchmarking
"""""""""""""""""""""
@@ -372,13 +372,13 @@ We benchmarked the throughput in a serving environment.
# Server:
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client:
- python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
+ vllm bench serve --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
- # For float8dq
+ # For FP8
# Server:
- VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client:
- python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
+ vllm bench serve --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-FP8 --num-prompts 1
Results (H100 machine)
"""""""""""""""""""""