Skip to content

Commit b1eb4b7

Browse files
dsikkagemini-code-assist[bot]brian-dellabetta
authored
[Examples] Add NVFP4 Example README.md (#1731)
SUMMARY: - Summarize what the examples do --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Brian Dellabetta <[email protected]>
1 parent f92d98a commit b1eb4b7

File tree

2 files changed

+98
-1
lines changed

2 files changed

+98
-1
lines changed
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# `fp4` Quantization
2+
3+
`llm-compressor` supports quantizing weights and activations to `fp4` for memory savings and inference acceleration with `vLLM`. In particular, `nvfp4` is supported - a 4-bit floating point encoding format introduced with the NVIDIA Blackwell GPU architecture.
4+
5+
## Installation
6+
7+
To get started, install:
8+
9+
```bash
10+
git clone https://github.com/vllm-project/llm-compressor.git
11+
cd llm-compressor
12+
pip install -e .
13+
```
14+
15+
## Quickstart
16+
17+
The example includes an end-to-end script for applying the quantization algorithm.
18+
19+
```bash
20+
python3 llama3_example.py
21+
```
22+
23+
The resulting model `Meta-Llama-3-8B-Instruct-NVFP4` is ready to be loaded into vLLM.
24+
Note: if running inference on a machine that is < SM100, vLLM will not run activation
25+
quantization, only weight-only quantization.
26+
27+
## Code Walkthough
28+
29+
Now, we will step though the code in the example:
30+
1) Load model
31+
2) Prepare calibration data
32+
3) Apply quantization
33+
34+
### 1) Load Model
35+
36+
Load the model using `AutoModelForCausalLM` for handling quantized saving and loading.
37+
38+
```python
39+
from transformers import AutoTokenizer, AutoModelForCausalLM
40+
41+
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
42+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
43+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
44+
```
45+
46+
### 2) Prepare Calibration Data
47+
48+
Prepare the calibration data. `nvfp4` quantization generates per-tensor global scales and per-group (size 16) local quantization scales for the weights, as well as per-tensor global scales for the activations. Per-group local activation quantization scales are generated dynamically during inference time. We need some sample data to calibrate the global activation scales. Typically, a small number of samples is sufficient. In this example, we use a sample size of 20.
49+
50+
It is useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea. In our case, we are quantizing an instruction-tuned generic model, so we will use the `ultrachat` dataset.
51+
52+
### 3) Apply Quantization
53+
54+
With the dataset ready, we will now apply quantization.
55+
56+
We first select the quantization algorithm.
57+
58+
In our case, we will apply the default QuantizationModifier recipe for `nvfp4` to all linear layers.
59+
> See the `Recipes` documentation for more information on making complex recipes
60+
61+
```python
62+
from llmcompressor import oneshot
63+
from llmcompressor.modifiers.quantization import QuantizationModifier
64+
65+
# Configure the quantization algorithm to run.
66+
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])
67+
68+
# Apply quantization.
69+
oneshot(
70+
model=model,
71+
dataset=ds,
72+
recipe=recipe,
73+
max_seq_length=MAX_SEQUENCE_LENGTH,
74+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
75+
)
76+
77+
# Save to disk compressed.
78+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
79+
model.save_pretrained(SAVE_DIR, save_compressed=True)
80+
tokenizer.save_pretrained(SAVE_DIR)
81+
```
82+
83+
We have successfully created an `nvfp4` model!
84+
85+
# Quantizing MoEs
86+
87+
To quantize MoEs, a few additional steps are required. An example quantizing Llama4 can be found under `llama4_example.py`. Here, we replace all `Llama4TextMoe` modules by calling `replace_modules_for_calibration`. This replacement allows us to:
88+
89+
1. Linearize the model to enable quantization and execution in vLLM. This is required as the native model definition does not include `torch.nn.Linear` layers in its MoE blocks, a requirement for LLM Compressor to run quantization.
90+
2. Ensure experts are quantized correctly as not all experts are activated during calibration
91+
92+
Similarly, an example quantizing the Qwen3-30B-A3B model can be found under `qwen_30b_a3b.py`. This model does not require additional linearization as required by the Llama4 model. However, similar to Llama4, in order to ensure the experts are quantized correctly, we can pass in `calibrate_moe_context` which temporarily updates the model definition to use `Qwen3MoeSparseMoeBlock` which updates how the forward pass is handled in the MoE block during calibration. Feel free to update the definition under `llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with this behavior and evaluate its impact on quantization performance.
93+
94+

examples/quantization_w4a4_fp4/qwen_30b_a3b.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,10 @@ def tokenize(sample):
6060

6161
# Apply quantization.
6262
# We see `calibrate_moe_context` to True to update all `Qwen3MoeSparseMoeBlock`
63-
# during calibration
63+
# during calibration.
64+
# Feel free to update the definition under
65+
# llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with
66+
# this behaviour and evaluate its impact on quantization performance
6467
oneshot(
6568
model=model,
6669
dataset=ds,

0 commit comments

Comments
 (0)