|
| 1 | +# `fp4` Quantization |
| 2 | + |
| 3 | +`llm-compressor` supports quantizing weights and activations to `fp4` for memory savings and inference acceleration with `vLLM`. In particular, `nvfp4` is supported - a 4-bit floating point encoding format introduced with the NVIDIA Blackwell GPU architecture. |
| 4 | + |
| 5 | +## Installation |
| 6 | + |
| 7 | +To get started, install: |
| 8 | + |
| 9 | +```bash |
| 10 | +git clone https://github.com/vllm-project/llm-compressor.git |
| 11 | +cd llm-compressor |
| 12 | +pip install -e . |
| 13 | +``` |
| 14 | + |
| 15 | +## Quickstart |
| 16 | + |
| 17 | +The example includes an end-to-end script for applying the quantization algorithm. |
| 18 | + |
| 19 | +```bash |
| 20 | +python3 llama3_example.py |
| 21 | +``` |
| 22 | + |
| 23 | +The resulting model `Meta-Llama-3-8B-Instruct-NVFP4` is ready to be loaded into vLLM. |
| 24 | +Note: if running inference on a machine that is < SM100, vLLM will not run activation |
| 25 | +quantization, only weight-only quantization. |
| 26 | + |
| 27 | +## Code Walkthough |
| 28 | + |
| 29 | +Now, we will step though the code in the example: |
| 30 | +1) Load model |
| 31 | +2) Prepare calibration data |
| 32 | +3) Apply quantization |
| 33 | + |
| 34 | +### 1) Load Model |
| 35 | + |
| 36 | +Load the model using `AutoModelForCausalLM` for handling quantized saving and loading. |
| 37 | + |
| 38 | +```python |
| 39 | +from transformers import AutoTokenizer, AutoModelForCausalLM |
| 40 | + |
| 41 | +MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" |
| 42 | +model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") |
| 43 | +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
| 44 | +``` |
| 45 | + |
| 46 | +### 2) Prepare Calibration Data |
| 47 | + |
| 48 | +Prepare the calibration data. `nvfp4` quantization generates per-tensor global scales and per-group (size 16) local quantization scales for the weights, as well as per-tensor global scales for the activations. Per-group local activation quantization scales are generated dynamically during inference time. We need some sample data to calibrate the global activation scales. Typically, a small number of samples is sufficient. In this example, we use a sample size of 20. |
| 49 | + |
| 50 | +It is useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea. In our case, we are quantizing an instruction-tuned generic model, so we will use the `ultrachat` dataset. |
| 51 | + |
| 52 | +### 3) Apply Quantization |
| 53 | + |
| 54 | +With the dataset ready, we will now apply quantization. |
| 55 | + |
| 56 | +We first select the quantization algorithm. |
| 57 | + |
| 58 | +In our case, we will apply the default QuantizationModifier recipe for `nvfp4` to all linear layers. |
| 59 | +> See the `Recipes` documentation for more information on making complex recipes |
| 60 | +
|
| 61 | +```python |
| 62 | +from llmcompressor import oneshot |
| 63 | +from llmcompressor.modifiers.quantization import QuantizationModifier |
| 64 | + |
| 65 | +# Configure the quantization algorithm to run. |
| 66 | +recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"]) |
| 67 | + |
| 68 | +# Apply quantization. |
| 69 | +oneshot( |
| 70 | + model=model, |
| 71 | + dataset=ds, |
| 72 | + recipe=recipe, |
| 73 | + max_seq_length=MAX_SEQUENCE_LENGTH, |
| 74 | + num_calibration_samples=NUM_CALIBRATION_SAMPLES, |
| 75 | +) |
| 76 | + |
| 77 | +# Save to disk compressed. |
| 78 | +SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4" |
| 79 | +model.save_pretrained(SAVE_DIR, save_compressed=True) |
| 80 | +tokenizer.save_pretrained(SAVE_DIR) |
| 81 | +``` |
| 82 | + |
| 83 | +We have successfully created an `nvfp4` model! |
| 84 | + |
| 85 | +# Quantizing MoEs |
| 86 | + |
| 87 | +To quantize MoEs, a few additional steps are required. An example quantizing Llama4 can be found under `llama4_example.py`. Here, we replace all `Llama4TextMoe` modules by calling `replace_modules_for_calibration`. This replacement allows us to: |
| 88 | + |
| 89 | +1. Linearize the model to enable quantization and execution in vLLM. This is required as the native model definition does not include `torch.nn.Linear` layers in its MoE blocks, a requirement for LLM Compressor to run quantization. |
| 90 | +2. Ensure experts are quantized correctly as not all experts are activated during calibration |
| 91 | + |
| 92 | +Similarly, an example quantizing the Qwen3-30B-A3B model can be found under `qwen_30b_a3b.py`. This model does not require additional linearization as required by the Llama4 model. However, similar to Llama4, in order to ensure the experts are quantized correctly, we can pass in `calibrate_moe_context` which temporarily updates the model definition to use `Qwen3MoeSparseMoeBlock` which updates how the forward pass is handled in the MoE block during calibration. Feel free to update the definition under `llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with this behavior and evaluate its impact on quantization performance. |
| 93 | + |
| 94 | + |
0 commit comments