diff --git a/docs/source/index.rst b/docs/source/index.rst index d05f2bd60a..0a96600b70 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -45,6 +45,7 @@ for an overall introduction to the library and recent highlight and updates. finetuning serving torchao_vllm_integration + torchao_hf_integration serialization static_quantization subclass_basic diff --git a/docs/source/output.png b/docs/source/output.png new file mode 100644 index 0000000000..cf7ebfeccd Binary files /dev/null and b/docs/source/output.png differ diff --git a/docs/source/serving.rst b/docs/source/serving.rst index 9efa905b0d..d639a78093 100644 --- a/docs/source/serving.rst +++ b/docs/source/serving.rst @@ -15,38 +15,7 @@ Post-training Quantization with HuggingFace ------------------------------------------- HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading. - -.. code-block:: bash - - pip install git+https://github.com/huggingface/transformers@main - pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 - pip install torch - pip install accelerate - -For this example, we'll use ``Float8DynamicActivationFloat8WeightConfig`` on the Phi-4 mini-instruct model. - -.. code-block:: python - - import torch - from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig - from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow - - model_id = "microsoft/Phi-4-mini-instruct" - - quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()) - quantization_config = TorchAoConfig(quant_type=quant_config) - quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config) - tokenizer = AutoTokenizer.from_pretrained(model_id) - - # Push the model to hub - USER_ID = "YOUR_USER_ID" - MODEL_NAME = model_id.split("/")[-1] - save_to = f"{USER_ID}/{MODEL_NAME}-float8dq" - quantized_model.push_to_hub(save_to, safe_serialization=False) - tokenizer.push_to_hub(save_to) - -.. note:: - For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs `_. +Please check out our `HF Integration Docs `_ for examples on how to use quantization and sparsity in Transformers and Diffusers. Serving and Inference -------------------- diff --git a/docs/source/torchao_hf_integration.md b/docs/source/torchao_hf_integration.md new file mode 100644 index 0000000000..8ab5020133 --- /dev/null +++ b/docs/source/torchao_hf_integration.md @@ -0,0 +1,128 @@ +(torchao_hf_integration)= +# Hugging Face Integration + +```{contents} +:local: +:depth: 2 +``` + +(usage-examples)= +## Quick Start: Usage Example + +First, install the required packages. + +```bash +pip install git+https://github.com/huggingface/transformers@main +pip install git+https://github.com/huggingface/diffusers@main +pip install torchao +pip install torch +pip install accelerate +``` + +(quantizing-models-transformers)= +### 1. Quantizing Models with Transformers + +Below is an example of using `Float8DynamicActivationInt4WeightConfig` on the Llama-3.2-1B model. + +```python +from transformers import TorchAoConfig, AutoModelForCausalLM +from torchao.quantization import Float8DynamicActivationInt4WeightConfig + +# Create quantization configuration +quantization_config = TorchAoConfig( + quant_type=Float8DynamicActivationInt4WeightConfig(group_size=128, use_hqq=True) +) + +# Load and automatically quantize the model +model = AutoModelForCausalLM.from_pretrained( + "meta-llama/Llama-3.2-1B", + torch_dtype="auto", + device_map="auto", + quantization_config=quantization_config +) +``` +```{seealso} +For inference examples and recommended quantization methods based on different hardwares (i.e. A100 GPU, H100 GPU, CPU), see [HF-Torchao Docs (Quantization Examples)](https://huggingface.co/docs/transformers/main/en/quantization/torchao#quantization-examples). + +For inference using vLLM, please see [(Part 3) Serving on vLLM, SGLang, ExecuTorch](https://docs.pytorch.org/ao/main/serving.html) for a full end-to-end tutorial. +``` + +(quantizing-models-diffusers)= +### 2. Quantizing Models with Diffusers + +Below is an example of how we can integrate with Diffusers. + +```python +from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig + +model_id = "black-forest-labs/Flux.1-Dev" +dtype = torch.bfloat16 + +quantization_config = TorchAoConfig("int8wo") +transformer = FluxTransformer2DModel.from_pretrained( + model_id, + subfolder="transformer", + quantization_config=quantization_config, + torch_dtype=dtype, +) +pipe = FluxPipeline.from_pretrained( + model_id, + transformer=transformer, + torch_dtype=dtype, +) +pipe.to("cuda") + +prompt = "A cat holding a sign that says hello world" +image = pipe(prompt, num_inference_steps=4, guidance_scale=0.0).images[0] +image.save("output.png") +``` + +```{note} +Example Output: +![alt text](output.png "Model Output") +``` + +```{seealso} +Please refer to [HF-TorchAO-Diffuser Docs](https://huggingface.co/docs/diffusers/en/quantization/torchao) for more examples and benchmarking results. +``` + +(saving-models)= +## Saving the Model + +After we quantize the model, we can save it. + +```python +# Save quantized model (see below for safe_serialization enablement progress) +with tempfile.TemporaryDirectory() as tmp_dir: + model.save_pretrained(tmp_dir, safe_serialization=False) + +# optional: push to hub (uncomment the following lines) +# save_to = "your-username/Llama-3.2-1B-int4" +# model.push_to_hub(save_to, safe_serialization=False) + +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") +tokenizer.push_to_hub(save_to) +``` + +**Current Status of Safetensors support**: TorchAO quantized models cannot yet be serialized with safetensors due to tensor subclass limitations. When saving quantized models, you must use `safe_serialization=False`. + +```python +# don't serialize model with Safetensors +output_dir = "llama3-8b-int4wo-128" +quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False) +``` + +**Workaround**: For production use, save models with `safe_serialization=False` when pushing to Hugging Face Hub. + +**Future Work**: The TorchAO team is actively working on safetensors support for tensor subclasses. Track progress [here](https://github.com/pytorch/ao/issues/2338) and [here](https://github.com/pytorch/ao/pull/2881). + +(Supported-Quantization-Types)= +## Supported Quantization Types + +Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation. + +Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory. This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations. However, this may come at a quality tradeoff at times, so it is recommended to test different models thoroughly. + +```{note} +Please refer to the [torchao docs](https://docs.pytorch.org/ao/main/api_ref_quantization.html) for supported quantization types. +``` diff --git a/output.png b/output.png new file mode 100644 index 0000000000..cf7ebfeccd Binary files /dev/null and b/output.png differ