Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 37 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@

## 📣 Latest News

- [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/)!
- [Jun 25] Our [TorchAO paper](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025!
- [May 25] QAT is now integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) for fine-tuning ([docs](https://docs.axolotl.ai/docs/qat.html))!
- [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale
Expand Down Expand Up @@ -56,13 +57,6 @@ TorchAO is an easy to use quantization library for native PyTorch. TorchAO works

Check out our [docs](https://docs.pytorch.org/ao/main/) for more details!

From the team that brought you the fast series:
* 9.5x inference speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
* 10x inference speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
* 3x inference speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) (new: [flux-fast](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/))
* 2.7x inference speedup for FAIR’s Seamless M4T-v2 model with [seamlessv2-fast](https://pytorch.org/blog/accelerating-generative-ai-4/)


## 🚀 Quick Start

First, install TorchAO. We recommend installing the latest stable version:
Expand All @@ -73,20 +67,9 @@ pip install torchao
Quantize your model weights to int4!
```python
from torchao.quantization import Int4WeightOnlyConfig, quantize_
quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1))
```
Compared to a `torch.compiled` bf16 baseline, your quantized model should be significantly smaller and faster on a single A100 GPU:
Copy link
Contributor Author

@jerryzh168 jerryzh168 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing these since toy model memory/latency is not meaningful, to make our README shorter

```bash
int4 model size: 1.25 MB
bfloat16 model size: 4.00 MB
compression ratio: 3.2

bf16 mean time: 30.393 ms
int4 mean time: 4.410 ms
speedup: 6.9x
quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
```
See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details. Alternatively, try quantizing your favorite model using our [HuggingFace space](https://huggingface.co/spaces/pytorch/torchao-my-repo)!

See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details.

## 🛠 Installation

Expand All @@ -103,13 +86,14 @@ pip install torchao
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

# Different CUDA versions
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6
pip install torchao --index-url https://download.pytorch.org/whl/cu128 # CUDA 12.8
pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only

# For developers
USE_CUDA=1 python setup.py develop
USE_CPP=0 python setup.py develop
```

</details>

Please see the [torchao compability table](https://github.com/pytorch/ao/issues/2919) for version requirements for dependencies.
Expand All @@ -120,57 +104,64 @@ TorchAO is integrated into some of the leading open-source libraries including:

* HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reordered a bit to put more commonly used ones earlier

* HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md)
* vLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
* Integration with [FBGEMM](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai) for SOTA kernels on server GPUs
* Integration with [ExecuTorch](https://github.com/pytorch/executorch/) for edge device deployment
* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
* HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https://huggingface.co/docs/peft/en/developer_guides/quantization#torchao-pytorch-architecture-optimization)
* Mobius HQQ backend leveraged our int4 kernels to get [195 tok/s on a 4090](https://github.com/mobiusml/hqq#faster-inference)
* TorchTune for our NF4 [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes
* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
* VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
* SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341).
* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)

* SGLang for LLM serving: [usage](https://docs.sglang.ai/advanced_features/quantization.html#online-quantization)

## 🔎 Inference

TorchAO delivers substantial performance gains with minimal code changes:

- **Int4 weight-only**: [1.89x throughput with 58.1% less memory](torchao/quantization/README.md) on Llama-3-8B
- **Float8 dynamic quantization**: [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality
- **Int4 weight-only**: [1.73x speedup with 65% less memory](https://huggingface.co/pytorch/gemma-3-12b-it-INT4) for Gemma3-12b-it on H100 with slight impact on accuracy
- **Float8 dynamic quantization**: [1.5-1.6x speedup on gemma-3-27b-it](https://huggingface.co/pytorch/gemma-3-27b-it-FP8/blob/main/README.md#results-h100-machine) and [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality
- **Int8 activation quantization and int4 weight quantization**: Quantized Qwen3-4B running with 14.8 tokens/s with 3379 MB memory usage on iPhone 15 Pro through [ExecuTorch](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4#running-in-a-mobile-app)
- **Int4 + 2:4 Sparsity**: [2.37x throughput with 67.7% memory reduction](torchao/sparsity/README.md) on Llama-3-8B

Quantize any model with `nn.Linear` layers in just one line (Option 1), or load the quantized model directly from HuggingFace using our integration with HuggingFace transformers (Option 2):

#### Option 1: Direct TorchAO API

```python
from torchao.quantization.quant_api import quantize_, Int4WeightOnlyConfig
quantize_(model, Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1))
```

#### Option 2: HuggingFace Integration

Following is our recommended flow for quantization and deployment:
```python
from transformers import TorchAoConfig, AutoModelForCausalLM
from torchao.quantization.quant_api import Int4WeightOnlyConfig
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow

# Create quantization configuration
quantization_config = TorchAoConfig(quant_type=Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1))
quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))

# Load and automatically quantize
quantized_model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-4-mini-instruct",
"Qwen/Qwen3-32B",
dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
```

#### Deploy quantized models in vLLM with one command:
Alternative quantization API to use when the above doesn't work is `quantize_` API in [quick start guide](https://docs.pytorch.org/ao/main/quick_start.html).

Serving with vllm on 1xH100 machine:
```shell
# Server
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3
```

```shell
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "pytorch/Qwen3-32B-FP8",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
```

With this quantization flow, we achieve **67% VRAM reduction and 12-20% speedup** on A100 GPUs while maintaining model quality. For more detail, see this [step-by-step quantization guide](https://huggingface.co/pytorch/Phi-4-mini-instruct-int4wo-hqq#quantization-recipe). We also release some pre-quantized models [here](https://huggingface.co/pytorch).
We also support deployment to edge devices through ExecuTorch, for more detail, see [quantization and serving guide](https://docs.pytorch.org/ao/main/serving.html). We also release pre-quantized models [here](https://huggingface.co/pytorch).

## 🚅 Training

Expand Down
20 changes: 0 additions & 20 deletions docs/source/api_ref_quantization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ Main Quantization APIs
:nosignatures:

quantize_
autoquant

Inference APIs for quantize\_
-------------------------------
Expand All @@ -27,13 +26,9 @@ Inference APIs for quantize\_
Float8DynamicActivationInt4WeightConfig
Float8DynamicActivationFloat8WeightConfig
Float8WeightOnlyConfig
Float8StaticActivationFloat8WeightConfig
Int8DynamicActivationInt4WeightConfig
GemliteUIntXWeightOnlyConfig
Int8WeightOnlyConfig
Int8DynamicActivationInt8WeightConfig
UIntXWeightOnlyConfig
FPXWeightOnlyConfig

.. currentmodule:: torchao.quantization

Expand All @@ -51,19 +46,4 @@ Quantization Primitives
safe_int_mm
int_scaled_matmul
MappingType
ZeroPointDomain
TorchAODType

..
TODO: delete these?

Other
-----

.. autosummary::
:toctree: generated/
:nosignatures:

to_linear_activation_quantized
swap_linear_with_smooth_fq_linear
smooth_fq_linear_to_inference
5 changes: 2 additions & 3 deletions docs/source/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ First, install the latest stable torchao release::
If you prefer to use the nightly release, you can install torchao using the following
command instead::

pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

torchao is compatible with the latest 3 major versions of PyTorch, which you will also
need to install (`detailed instructions <https://pytorch.org/get-started/locally/>`__)::
Expand Down Expand Up @@ -55,9 +55,8 @@ for efficient mixed dtype matrix multiplication:

.. code:: py

# torch 2.4+ only
from torchao.quantization import Int4WeightOnlyConfig, quantize_
quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1))
quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))

The quantized model is now ready to use! Note that the quantization
logic is inserted through tensor subclasses, so there is no change
Expand Down
30 changes: 15 additions & 15 deletions docs/source/serving.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Post-training Quantization with HuggingFace
-------------------------------------------

HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.
Please check out our `HF Integration Docs <torchao_hf_integration.html>`_ for examples on how to use quantization and sparsity in Transformers and Diffusers.
Please check out our `HF Integration Docs <torchao_hf_integration.html>`_ for examples on how to use quantization and sparsity in Transformers and Diffusers and `TorchAOConfig Reference <api_ref_quantization.html#inference-apis-for-quantize>`_ for all available torchao configs to use.

Serving and Inference
--------------------
Expand All @@ -37,11 +37,11 @@ To serve in vLLM, we're using the model we quantized and pushed to Hugging Face
.. code-block:: bash

# Server
vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3

# Client
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "pytorch/Phi-4-mini-instruct-float8dq",
"model": "pytorch/Phi-4-mini-instruct-FP8",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
Expand Down Expand Up @@ -271,8 +271,8 @@ Evaluate quantized models using lm-evaluation-harness:
# Evaluate baseline model
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

# Evaluate torchao-quantized model (float8dq)
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
# Evaluate torchao-quantized model (FP8)
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-FP8 --tasks hellaswag --device cuda:0 --batch_size 8

Memory Benchmarking
^^^^^^^^^^^^^^^^^
Expand All @@ -283,8 +283,8 @@ For Phi-4-mini-instruct, when quantized with float8 dynamic quant, we can reduce
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
model_id = "pytorch/Phi-4-mini-instruct-float8dq"
# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-FP8"
model_id = "pytorch/Phi-4-mini-instruct-FP8"
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Expand Down Expand Up @@ -328,7 +328,7 @@ Output:
Peak Memory Usage: 5.70 GB

+-------------------+---------------------+------------------------------+
| Benchmark | Phi-4 mini-instruct | Phi-4-mini-instruct-float8dq |
| Benchmark | Phi-4 mini-instruct | Phi-4-mini-instruct-FP8 |
+===================+=====================+==============================+
| Peak Memory (GB) | 8.91 | 5.70 (36% reduction) |
+-------------------+---------------------+------------------------------+
Expand All @@ -342,10 +342,10 @@ Latency Benchmarking
.. code-block:: bash

# baseline
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
vllm bench latency --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

# float8dq
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
# FP8
VLLM_DISABLE_COMPILE_CACHE=1 vllm bench latency --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-FP8 --batch-size 1

Serving Benchmarking
"""""""""""""""""""""
Expand All @@ -372,13 +372,13 @@ We benchmarked the throughput in a serving environment.
# Server:
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client:
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
vllm bench serve --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

# For float8dq
# For FP8
# Server:
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3
# Client:
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
vllm bench serve --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-FP8 --num-prompts 1

Results (H100 machine)
"""""""""""""""""""""
Expand Down
Loading