You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6
83
-
pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only
84
-
85
-
# For developers
86
-
USE_CUDA=1 python setup.py develop
87
-
```
88
-
89
-
</details>
90
-
91
74
Quantize your model weights to int4!
92
75
```
93
76
from torchao.quantization import Int4WeightOnlyConfig, quantize_
@@ -106,14 +89,40 @@ speedup: 6.9x
106
89
For the full model setup and benchmark details, check out our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html). Alternatively, try quantizing your favorite model using our [HuggingFace space](https://huggingface.co/spaces/pytorch/torchao-my-repo)!
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6
108
+
pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only
109
+
110
+
# For developers
111
+
USE_CUDA=1 python setup.py develop
112
+
USE_CPP=0 python setup.py develop
113
+
```
114
+
</details>
115
+
116
+
109
117
## 🔗 Integrations
110
118
111
119
TorchAO is integrated into some of the leading open-source libraries including:
112
120
113
121
* HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865)
114
122
* HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md)
123
+
* HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https://huggingface.co/docs/peft/en/developer_guides/quantization#torchao-pytorch-architecture-optimization)
115
124
* Mobius HQQ backend leveraged our int4 kernels to get [195 tok/s on a 4090](https://github.com/mobiusml/hqq#faster-inference)
116
-
* TorchTune for our [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes
125
+
* TorchTune for our NF4 [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes
117
126
* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
118
127
* VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
119
128
* SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341).
0 commit comments