Skip to content

Commit fef12b9

Browse files
committed
changes
1 parent d01aafb commit fef12b9

File tree

4 files changed

+38
-320
lines changed

4 files changed

+38
-320
lines changed

docs/source/output.png

1.32 MB
Loading

docs/source/serving.rst

Lines changed: 1 addition & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -15,38 +15,7 @@ Post-training Quantization with HuggingFace
1515
-------------------------------------------
1616

1717
HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.
18-
19-
.. code-block:: bash
20-
21-
pip install git+https://github.com/huggingface/transformers@main
22-
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
23-
pip install torch
24-
pip install accelerate
25-
26-
For this example, we'll use ``Float8DynamicActivationFloat8WeightConfig`` on the Phi-4 mini-instruct model.
27-
28-
.. code-block:: python
29-
30-
import torch
31-
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
32-
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
33-
34-
model_id = "microsoft/Phi-4-mini-instruct"
35-
36-
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
37-
quantization_config = TorchAoConfig(quant_type=quant_config)
38-
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
39-
tokenizer = AutoTokenizer.from_pretrained(model_id)
40-
41-
# Push the model to hub
42-
USER_ID = "YOUR_USER_ID"
43-
MODEL_NAME = model_id.split("/")[-1]
44-
save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
45-
quantized_model.push_to_hub(save_to, safe_serialization=False)
46-
tokenizer.push_to_hub(save_to)
47-
48-
.. note::
49-
For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
18+
Please check out our `HF Integration Docs <torchao_hf_integration.html>`_ for examples on how to use quantization and sparsity in Transformers and Diffusers.
5019

5120
Serving and Inference
5221
--------------------

0 commit comments

Comments
 (0)