From f7762ad1c52cbfa593b0a5d4076fe1ac6f3da88f Mon Sep 17 00:00:00 2001 From: Jerry Zhang Date: Fri, 17 Oct 2025 15:11:54 -0700 Subject: [PATCH] Update TorchAO README inference section before PTC Summary: att Test Plan: visual inspection Reviewers: Subscribers: Tasks: Tags: --- README.md | 83 +++++++++++++--------------- docs/source/api_ref_quantization.rst | 20 ------- docs/source/quick_start.rst | 5 +- docs/source/serving.rst | 30 +++++----- 4 files changed, 54 insertions(+), 84 deletions(-) diff --git a/README.md b/README.md index ad3e0b6f97..36325b6669 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ ## 📣 Latest News +- [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/)! - [Jun 25] Our [TorchAO paper](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025! - [May 25] QAT is now integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) for fine-tuning ([docs](https://docs.axolotl.ai/docs/qat.html))! - [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale @@ -56,13 +57,6 @@ TorchAO is an easy to use quantization library for native PyTorch. TorchAO works Check out our [docs](https://docs.pytorch.org/ao/main/) for more details! -From the team that brought you the fast series: -* 9.5x inference speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai) -* 10x inference speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2) -* 3x inference speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) (new: [flux-fast](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/)) -* 2.7x inference speedup for FAIR’s Seamless M4T-v2 model with [seamlessv2-fast](https://pytorch.org/blog/accelerating-generative-ai-4/) - - ## 🚀 Quick Start First, install TorchAO. We recommend installing the latest stable version: @@ -73,20 +67,9 @@ pip install torchao Quantize your model weights to int4! ```python from torchao.quantization import Int4WeightOnlyConfig, quantize_ -quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1)) -``` -Compared to a `torch.compiled` bf16 baseline, your quantized model should be significantly smaller and faster on a single A100 GPU: -```bash -int4 model size: 1.25 MB -bfloat16 model size: 4.00 MB -compression ratio: 3.2 - -bf16 mean time: 30.393 ms -int4 mean time: 4.410 ms -speedup: 6.9x +quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")) ``` -See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details. Alternatively, try quantizing your favorite model using our [HuggingFace space](https://huggingface.co/spaces/pytorch/torchao-my-repo)! - +See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details. ## 🛠 Installation @@ -103,13 +86,14 @@ pip install torchao pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 # Different CUDA versions - pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6 + pip install torchao --index-url https://download.pytorch.org/whl/cu128 # CUDA 12.8 pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only # For developers USE_CUDA=1 python setup.py develop USE_CPP=0 python setup.py develop ``` + Please see the [torchao compability table](https://github.com/pytorch/ao/issues/2919) for version requirements for dependencies. @@ -120,57 +104,64 @@ TorchAO is integrated into some of the leading open-source libraries including: * HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865) * HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md) +* vLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html) +* Integration with [FBGEMM](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai) for SOTA kernels on server GPUs +* Integration with [ExecuTorch](https://github.com/pytorch/executorch/) for edge device deployment +* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html) +* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md) * HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https://huggingface.co/docs/peft/en/developer_guides/quantization#torchao-pytorch-architecture-optimization) -* Mobius HQQ backend leveraged our int4 kernels to get [195 tok/s on a 4090](https://github.com/mobiusml/hqq#faster-inference) * TorchTune for our NF4 [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes -* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md) -* VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html) -* SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341). -* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html) - +* SGLang for LLM serving: [usage](https://docs.sglang.ai/advanced_features/quantization.html#online-quantization) ## 🔎 Inference TorchAO delivers substantial performance gains with minimal code changes: -- **Int4 weight-only**: [1.89x throughput with 58.1% less memory](torchao/quantization/README.md) on Llama-3-8B -- **Float8 dynamic quantization**: [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality +- **Int4 weight-only**: [1.73x speedup with 65% less memory](https://huggingface.co/pytorch/gemma-3-12b-it-INT4) for Gemma3-12b-it on H100 with slight impact on accuracy +- **Float8 dynamic quantization**: [1.5-1.6x speedup on gemma-3-27b-it](https://huggingface.co/pytorch/gemma-3-27b-it-FP8/blob/main/README.md#results-h100-machine) and [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality +- **Int8 activation quantization and int4 weight quantization**: Quantized Qwen3-4B running with 14.8 tokens/s with 3379 MB memory usage on iPhone 15 Pro through [ExecuTorch](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4#running-in-a-mobile-app) - **Int4 + 2:4 Sparsity**: [2.37x throughput with 67.7% memory reduction](torchao/sparsity/README.md) on Llama-3-8B -Quantize any model with `nn.Linear` layers in just one line (Option 1), or load the quantized model directly from HuggingFace using our integration with HuggingFace transformers (Option 2): - -#### Option 1: Direct TorchAO API - -```python -from torchao.quantization.quant_api import quantize_, Int4WeightOnlyConfig -quantize_(model, Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1)) -``` - -#### Option 2: HuggingFace Integration - +Following is our recommended flow for quantization and deployment: ```python from transformers import TorchAoConfig, AutoModelForCausalLM -from torchao.quantization.quant_api import Int4WeightOnlyConfig +from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow # Create quantization configuration -quantization_config = TorchAoConfig(quant_type=Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1)) +quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())) # Load and automatically quantize quantized_model = AutoModelForCausalLM.from_pretrained( - "microsoft/Phi-4-mini-instruct", + "Qwen/Qwen3-32B", dtype="auto", device_map="auto", quantization_config=quantization_config ) ``` -#### Deploy quantized models in vLLM with one command: +Alternative quantization API to use when the above doesn't work is `quantize_` API in [quick start guide](https://docs.pytorch.org/ao/main/quick_start.html). + +Serving with vllm on 1xH100 machine: +```shell +# Server +VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3 +``` ```shell -vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 +# Client +curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "pytorch/Qwen3-32B-FP8", + "messages": [ + {"role": "user", "content": "Give me a short introduction to large language models."} + ], + "temperature": 0.6, + "top_p": 0.95, + "top_k": 20, + "max_tokens": 32768 +}' ``` -With this quantization flow, we achieve **67% VRAM reduction and 12-20% speedup** on A100 GPUs while maintaining model quality. For more detail, see this [step-by-step quantization guide](https://huggingface.co/pytorch/Phi-4-mini-instruct-int4wo-hqq#quantization-recipe). We also release some pre-quantized models [here](https://huggingface.co/pytorch). +We also support deployment to edge devices through ExecuTorch, for more detail, see [quantization and serving guide](https://docs.pytorch.org/ao/main/serving.html). We also release pre-quantized models [here](https://huggingface.co/pytorch). ## 🚅 Training diff --git a/docs/source/api_ref_quantization.rst b/docs/source/api_ref_quantization.rst index c163a4b06a..d5a041b504 100644 --- a/docs/source/api_ref_quantization.rst +++ b/docs/source/api_ref_quantization.rst @@ -14,7 +14,6 @@ Main Quantization APIs :nosignatures: quantize_ - autoquant Inference APIs for quantize\_ ------------------------------- @@ -27,13 +26,9 @@ Inference APIs for quantize\_ Float8DynamicActivationInt4WeightConfig Float8DynamicActivationFloat8WeightConfig Float8WeightOnlyConfig - Float8StaticActivationFloat8WeightConfig Int8DynamicActivationInt4WeightConfig - GemliteUIntXWeightOnlyConfig Int8WeightOnlyConfig Int8DynamicActivationInt8WeightConfig - UIntXWeightOnlyConfig - FPXWeightOnlyConfig .. currentmodule:: torchao.quantization @@ -51,19 +46,4 @@ Quantization Primitives safe_int_mm int_scaled_matmul MappingType - ZeroPointDomain TorchAODType - -.. - TODO: delete these? - -Other ------ - -.. autosummary:: - :toctree: generated/ - :nosignatures: - - to_linear_activation_quantized - swap_linear_with_smooth_fq_linear - smooth_fq_linear_to_inference diff --git a/docs/source/quick_start.rst b/docs/source/quick_start.rst index 52947b7622..fc644acf4c 100644 --- a/docs/source/quick_start.rst +++ b/docs/source/quick_start.rst @@ -9,7 +9,7 @@ First, install the latest stable torchao release:: If you prefer to use the nightly release, you can install torchao using the following command instead:: - pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121 + pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 torchao is compatible with the latest 3 major versions of PyTorch, which you will also need to install (`detailed instructions `__):: @@ -55,9 +55,8 @@ for efficient mixed dtype matrix multiplication: .. code:: py - # torch 2.4+ only from torchao.quantization import Int4WeightOnlyConfig, quantize_ - quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1)) + quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq")) The quantized model is now ready to use! Note that the quantization logic is inserted through tensor subclasses, so there is no change diff --git a/docs/source/serving.rst b/docs/source/serving.rst index d95132ded7..1e4805e0b1 100644 --- a/docs/source/serving.rst +++ b/docs/source/serving.rst @@ -15,7 +15,7 @@ Post-training Quantization with HuggingFace ------------------------------------------- HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading. -Please check out our `HF Integration Docs `_ for examples on how to use quantization and sparsity in Transformers and Diffusers. +Please check out our `HF Integration Docs `_ for examples on how to use quantization and sparsity in Transformers and Diffusers and `TorchAOConfig Reference `_ for all available torchao configs to use. Serving and Inference -------------------- @@ -37,11 +37,11 @@ To serve in vLLM, we're using the model we quantized and pushed to Hugging Face .. code-block:: bash # Server - vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3 + vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3 # Client curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "pytorch/Phi-4-mini-instruct-float8dq", + "model": "pytorch/Phi-4-mini-instruct-FP8", "messages": [ {"role": "user", "content": "Give me a short introduction to large language models."} ], @@ -271,8 +271,8 @@ Evaluate quantized models using lm-evaluation-harness: # Evaluate baseline model lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8 - # Evaluate torchao-quantized model (float8dq) - lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8 + # Evaluate torchao-quantized model (FP8) + lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-FP8 --tasks hellaswag --device cuda:0 --batch_size 8 Memory Benchmarking ^^^^^^^^^^^^^^^^^ @@ -283,8 +283,8 @@ For Phi-4-mini-instruct, when quantized with float8 dynamic quant, we can reduce import torch from transformers import AutoModelForCausalLM, AutoTokenizer - # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq" - model_id = "pytorch/Phi-4-mini-instruct-float8dq" + # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-FP8" + model_id = "pytorch/Phi-4-mini-instruct-FP8" quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(model_id) @@ -328,7 +328,7 @@ Output: Peak Memory Usage: 5.70 GB +-------------------+---------------------+------------------------------+ -| Benchmark | Phi-4 mini-instruct | Phi-4-mini-instruct-float8dq | +| Benchmark | Phi-4 mini-instruct | Phi-4-mini-instruct-FP8 | +===================+=====================+==============================+ | Peak Memory (GB) | 8.91 | 5.70 (36% reduction) | +-------------------+---------------------+------------------------------+ @@ -342,10 +342,10 @@ Latency Benchmarking .. code-block:: bash # baseline - python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1 + vllm bench latency --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1 - # float8dq - VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1 + # FP8 + VLLM_DISABLE_COMPILE_CACHE=1 vllm bench latency --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-FP8 --batch-size 1 Serving Benchmarking """"""""""""""""""""" @@ -372,13 +372,13 @@ We benchmarked the throughput in a serving environment. # Server: vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3 # Client: - python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1 + vllm bench serve --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1 - # For float8dq + # For FP8 # Server: - VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3 + VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3 # Client: - python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1 + vllm bench serve --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-FP8 --num-prompts 1 Results (H100 machine) """""""""""""""""""""