pytorch · liangel-02 · Sep 12, 2025 · Aug 27, 2025 · Sep 3, 2025 · stevhliu
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -45,6 +45,7 @@ for an overall introduction to the library and recent highlight and updates.
    finetuning
    serving
    torchao_vllm_integration
+   torchao_hf_integration
    serialization
    static_quantization
    subclass_basic

diff --git a/docs/source/output.png b/docs/source/output.png
diff --git a/docs/source/serving.rst b/docs/source/serving.rst
@@ -15,38 +15,7 @@ Post-training Quantization with HuggingFace
 -------------------------------------------
 
 HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.
-
-.. code-block:: bash
-
-    pip install git+https://github.com/huggingface/transformers@main
-    pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
-    pip install torch
-    pip install accelerate
-
-For this example, we'll use ``Float8DynamicActivationFloat8WeightConfig`` on the Phi-4 mini-instruct model.
-
-.. code-block:: python
-
-    import torch
-    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
-    from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
-
-    model_id = "microsoft/Phi-4-mini-instruct"
-
-    quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
-    quantization_config = TorchAoConfig(quant_type=quant_config)
-    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
-    tokenizer = AutoTokenizer.from_pretrained(model_id)
-
-    # Push the model to hub
-    USER_ID = "YOUR_USER_ID"
-    MODEL_NAME = model_id.split("/")[-1]
-    save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
-    quantized_model.push_to_hub(save_to, safe_serialization=False)
-    tokenizer.push_to_hub(save_to)
-
-.. note::
-    For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
+Please check out our `HF Integration Docs <torchao_hf_integration.html>`_ for examples on how to use quantization and sparsity in Transformers and Diffusers.
 
 Serving and Inference
 --------------------

diff --git a/docs/source/torchao_hf_integration.md b/docs/source/torchao_hf_integration.md
@@ -0,0 +1,128 @@
+(torchao_hf_integration)=
+# Hugging Face Integration
+
+```{contents}
+:local:
+:depth: 2
+```
+
+(usage-examples)=
+## Quick Start: Usage Example
+
+First, install the required packages.
+
+```bash
+pip install git+https://github.com/huggingface/transformers@main
+pip install git+https://github.com/huggingface/diffusers@main
+pip install torchao
+pip install torch
+pip install accelerate
+```
+
+(quantizing-models-transformers)=
+### 1. Quantizing Models with Transformers
+
+Below is an example of using `Float8DynamicActivationInt4WeightConfig` on the Llama-3.2-1B model.
+
+```python
+from transformers import TorchAoConfig, AutoModelForCausalLM
+from torchao.quantization import Float8DynamicActivationInt4WeightConfig
+
+# Create quantization configuration
+quantization_config = TorchAoConfig(
+    quant_type=Float8DynamicActivationInt4WeightConfig(group_size=128, use_hqq=True)
+)
+
+# Load and automatically quantize the model
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-1B",
+    torch_dtype="auto",
+    device_map="auto",
+    quantization_config=quantization_config
+)
+```
+```{seealso}
+For inference examples and recommended quantization methods based on different hardwares (i.e. A100 GPU, H100 GPU, CPU), see [HF-Torchao Docs (Quantization Examples)](https://huggingface.co/docs/transformers/main/en/quantization/torchao#quantization-examples).
+
+For inference using vLLM, please see [(Part 3) Serving on vLLM, SGLang, ExecuTorch](https://docs.pytorch.org/ao/main/serving.html) for a full end-to-end tutorial.
+```
+
+(quantizing-models-diffusers)=
+### 2. Quantizing Models with Diffusers
+
+Below is an example of how we can integrate with Diffusers.
+
+```python
+from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig
+
+model_id = "black-forest-labs/Flux.1-Dev"
+dtype = torch.bfloat16
+
+quantization_config = TorchAoConfig("int8wo")
+transformer = FluxTransformer2DModel.from_pretrained(
+    model_id,
+    subfolder="transformer",
+    quantization_config=quantization_config,
+    torch_dtype=dtype,
+)
+pipe = FluxPipeline.from_pretrained(
+    model_id,
+    transformer=transformer,
+    torch_dtype=dtype,
+)
+pipe.to("cuda")
+
+prompt = "A cat holding a sign that says hello world"
+image = pipe(prompt, num_inference_steps=4, guidance_scale=0.0).images[0]
+image.save("output.png")
+```
+
+```{note}
+Example Output:
+![alt text](output.png "Model Output")
+```
+
+```{seealso}
+Please refer to [HF-TorchAO-Diffuser Docs](https://huggingface.co/docs/diffusers/en/quantization/torchao) for more examples and benchmarking results.
+```
+
+(saving-models)=
+## Saving the Model
+
+After we quantize the model, we can save it.
+
+```python
+# Save quantized model (see below for safe_serialization enablement progress)
+with tempfile.TemporaryDirectory() as tmp_dir:
+    model.save_pretrained(tmp_dir, safe_serialization=False)
+
+# optional: push to hub (uncomment the following lines)
+# save_to = "your-username/Llama-3.2-1B-int4"
+# model.push_to_hub(save_to, safe_serialization=False)
+
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
+tokenizer.push_to_hub(save_to)
+```
+
+**Current Status of Safetensors support**: TorchAO quantized models cannot yet be serialized with safetensors due to tensor subclass limitations. When saving quantized models, you must use `safe_serialization=False`.
+
+```python
+# don't serialize model with Safetensors
+output_dir = "llama3-8b-int4wo-128"
+quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False)
+```
+
+**Workaround**: For production use, save models with `safe_serialization=False` when pushing to Hugging Face Hub.
+
+**Future Work**: The TorchAO team is actively working on safetensors support for tensor subclasses. Track progress [here](https://github.com/pytorch/ao/issues/2338) and [here](https://github.com/pytorch/ao/pull/2881).
+
+(Supported-Quantization-Types)=
+## Supported Quantization Types
+
+Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation.
+
+Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory. This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations. However, this may come at a quality tradeoff at times, so it is recommended to test different models thoroughly.
+
+```{note}
+Please refer to the [torchao docs](https://docs.pytorch.org/ao/main/api_ref_quantization.html) for supported quantization types.
+```
diff --git a/output.png b/output.png