Skip to content

Commit c892006

Browse files
authored
Merge branch 'main' into kylesayrs/cache-util
2 parents eadcb73 + dbc4bc5 commit c892006

File tree

27 files changed

+754
-39
lines changed

27 files changed

+754
-39
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
1818

1919
Some of the exciting new features include:
2020

21+
* **DeepSeekV3-style Block Quantization Support**: This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py).
2122
* **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
2223
* **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
2324
* **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.

docs/getting-started/compress.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,4 +59,58 @@ oneshot(
5959
)
6060
```
6161

62-
When you run the above code, the compressed model is saved to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing.
62+
When you run the above code, the compressed model is saved to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing.
63+
64+
## Memory requirements for LLM Compressor
65+
66+
When compressing a model you should be aware that the memory requirements are dependent on model size and the algorithm used, such as GPTQ/SparseGPT.
67+
68+
This section will go through how to calculate the CPU and GPU memory requirements for each algorithm using several popular models, an 8B, a 684B, and a model with vision capabilities, as examples.
69+
70+
The GPTQ/SparseGPT requires a large amount of auxiliary memory. GPTQ/SparseGPT allocates an auxiliary hessian matrix for any layers that are onloaded to the GPU. This is because the hessian matrices have to be almost as large as the weights they are trying to represent.
71+
72+
Also, larger models, like DeepSeek R1 use a large amount of CPU memory, and models with large vision towers, such as command A, may use large amounts of GPU memory.
73+
74+
### Things to note when calculating memory requirements for LLM Compressor:
75+
76+
1. A 1B model uses 2Gb of memory to load:
77+
```
78+
mem(1B parameters) ~= (1B parameters) * (2 bytes / parameter) = 2B bytes ~= 2Gb
79+
```
80+
81+
2. How text decoder layers and vision tower layers are loaded on to GPU differs significantly.
82+
83+
In the case of text decoder layers, LLM Compressor dynamically loads one layer at a time into the GPU for computation. The rest of the model remains in CPU memory.
84+
85+
However, vision tower layers are loaded onto GPU all at once. Unlike the text model, vision towers are not split up into individual layers before onloading to the GPU. This can create a GPU memory bottleneck for models whose vision towers are larger than their text layers.
86+
87+
At this time LLM Compressor does not quantise the vision tower as quantization is generally not worth the tradeoff between latency/throughput and accuracy loss.
88+
89+
3. LLM Compressor does not currently support tensor parallelism for compression. Supporting this feature will allow layers to be sharded across GPUs, leading to reduced memory usage per GPU and faster compression.
90+
91+
### QuantizationModifier or Round-To-Nearest (RTN)
92+
93+
The quantization modifier, RTN, does not require any additional memory beyond the storage needed for its quantization parameters (scales/zeros).
94+
95+
If we ignore these scales and zero points from our calculation, we can estimate the following memory requirements:
96+
97+
98+
| Model| CPU requirements | GPU requirements |
99+
|--------|-------------|----------------------------|
100+
| **Meta-Llama-3-8B-Instruct** | mem(8B params) ~= 16Gb | mem(1 Layer) ~= 0.5Gb |
101+
| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) ~= 22.4Gb|
102+
| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B) ~= 1.3Gb |
103+
104+
### GPT Quantization(GPTQ)/ Sparse GPT
105+
106+
The GPTQ/ SparseGPT algorithms differ from the RTN in that they must also allocate an auxiliary hessian matrices for any layers that are onloaded to the GPU.
107+
108+
This hessian matrix is used to increase the accuracy recovery of the algorithm, and is approximately the same size as the original weights.
109+
110+
| Model| CPU requirements | GPU requirements |
111+
|--------|-------------|----------------------------|
112+
| **Meta-Llama-3-8B-Instruct** |mem(8B params) ~= 16Gb | mem(1 Layer) * 2 ~= 1Gb |
113+
| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) * 2 ~= 44.8Gb |
114+
| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B)*2 ~= 2.6Gb |
115+
116+

docs/guides/compression_formats.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Compression Formats
2+
3+
The following table outlines the possible quantization and sparsity
4+
compression formats that are applied to a model during compression.
5+
The formats are determined according to the quantization scheme and
6+
sparsity type. For more details on the quantization schemes, see
7+
`guides/compression_schemes.md`.
8+
9+
10+
| Quantization | Sparsity | Quant Compressor | Sparsity Compressor |
11+
|---------------|----------|----------------------|---------------------|
12+
| W8A8 - int | None | int_quantized | Dense |
13+
| W8A8 - float | None | float_quantized | Dense |
14+
| W4A16 - float | None | nvfp4_pack_quantized | Dense |
15+
| W4A4 - float | None | nvfp4_pack_quantized | Dense |
16+
| W4A16 - int | None | pack_quantized | Dense |
17+
| W8A16 - int | None | pack_quantized | Dense |
18+
| W8A16 - float | None | naive_quantized | Dense |
19+
| W8A8 - int | 2:4 | int_quantized | Sparse24 |
20+
| W8A8 - float | 2:4 | float_quantized | Sparse24 |
21+
| W4A16 - int | 2:4 | marlin_24 | Dense |
22+
| W8A16 - int | 2:4 | marlin_24 | Dense |
23+
| W8A16 - float | 2:4 | naive_quantized | Dense |

examples/big_models_with_sequential_onloading/llama3.3_70b.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
from datasets import load_dataset
22
from transformers import AutoModelForCausalLM, AutoTokenizer
33

4+
from llmcompressor import oneshot
45
from llmcompressor.modifiers.quantization import GPTQModifier
56
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
6-
from llmcompressor.transformers import oneshot
77
from llmcompressor.utils import dispatch_for_generation
88

99
# Select model and load it.

examples/multimodal_vision/qwen_2_5_vl_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
from qwen_vl_utils import process_vision_info
77
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
88

9+
from llmcompressor import oneshot
910
from llmcompressor.modifiers.quantization import GPTQModifier
10-
from llmcompressor.transformers import oneshot
1111
from llmcompressor.utils import dispatch_for_generation
1212

1313
# Load model.

examples/quantization_w4a16/llama3_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
from datasets import load_dataset
22
from transformers import AutoModelForCausalLM, AutoTokenizer
33

4+
from llmcompressor import oneshot
45
from llmcompressor.modifiers.quantization import GPTQModifier
5-
from llmcompressor.transformers import oneshot
66
from llmcompressor.utils import dispatch_for_generation
77

88
# Select model and load it.
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# `fp4` Quantization
2+
3+
`llm-compressor` supports quantizing weights and activations to `fp4` for memory savings and inference acceleration with `vLLM`. In particular, `nvfp4` is supported - a 4-bit floating point encoding format introduced with the NVIDIA Blackwell GPU architecture.
4+
5+
## Installation
6+
7+
To get started, install:
8+
9+
```bash
10+
git clone https://github.com/vllm-project/llm-compressor.git
11+
cd llm-compressor
12+
pip install -e .
13+
```
14+
15+
## Quickstart
16+
17+
The example includes an end-to-end script for applying the quantization algorithm.
18+
19+
```bash
20+
python3 llama3_example.py
21+
```
22+
23+
The resulting model `Meta-Llama-3-8B-Instruct-NVFP4` is ready to be loaded into vLLM.
24+
Note: if running inference on a machine that is < SM100, vLLM will not run activation
25+
quantization, only weight-only quantization.
26+
27+
## Code Walkthough
28+
29+
Now, we will step though the code in the example:
30+
1) Load model
31+
2) Prepare calibration data
32+
3) Apply quantization
33+
34+
### 1) Load Model
35+
36+
Load the model using `AutoModelForCausalLM` for handling quantized saving and loading.
37+
38+
```python
39+
from transformers import AutoTokenizer, AutoModelForCausalLM
40+
41+
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
42+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
43+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
44+
```
45+
46+
### 2) Prepare Calibration Data
47+
48+
Prepare the calibration data. `nvfp4` quantization generates per-tensor global scales and per-group (size 16) local quantization scales for the weights, as well as per-tensor global scales for the activations. Per-group local activation quantization scales are generated dynamically during inference time. We need some sample data to calibrate the global activation scales. Typically, a small number of samples is sufficient. In this example, we use a sample size of 20.
49+
50+
It is useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea. In our case, we are quantizing an instruction-tuned generic model, so we will use the `ultrachat` dataset.
51+
52+
### 3) Apply Quantization
53+
54+
With the dataset ready, we will now apply quantization.
55+
56+
We first select the quantization algorithm.
57+
58+
In our case, we will apply the default QuantizationModifier recipe for `nvfp4` to all linear layers.
59+
> See the `Recipes` documentation for more information on making complex recipes
60+
61+
```python
62+
from llmcompressor import oneshot
63+
from llmcompressor.modifiers.quantization import QuantizationModifier
64+
65+
# Configure the quantization algorithm to run.
66+
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])
67+
68+
# Apply quantization.
69+
oneshot(
70+
model=model,
71+
dataset=ds,
72+
recipe=recipe,
73+
max_seq_length=MAX_SEQUENCE_LENGTH,
74+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
75+
)
76+
77+
# Save to disk compressed.
78+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
79+
model.save_pretrained(SAVE_DIR, save_compressed=True)
80+
tokenizer.save_pretrained(SAVE_DIR)
81+
```
82+
83+
We have successfully created an `nvfp4` model!
84+
85+
# Quantizing MoEs
86+
87+
To quantize MoEs, a few additional steps are required. An example quantizing Llama4 can be found under `llama4_example.py`. Here, we replace all `Llama4TextMoe` modules by calling `replace_modules_for_calibration`. This replacement allows us to:
88+
89+
1. Linearize the model to enable quantization and execution in vLLM. This is required as the native model definition does not include `torch.nn.Linear` layers in its MoE blocks, a requirement for LLM Compressor to run quantization.
90+
2. Ensure experts are quantized correctly as not all experts are activated during calibration
91+
92+
Similarly, an example quantizing the Qwen3-30B-A3B model can be found under `qwen_30b_a3b.py`. This model does not require additional linearization as required by the Llama4 model. However, similar to Llama4, in order to ensure the experts are quantized correctly, we can pass in `calibrate_moe_context` which temporarily updates the model definition to use `Qwen3MoeSparseMoeBlock` which updates how the forward pass is handled in the MoE block during calibration. Feel free to update the definition under `llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with this behavior and evaluate its impact on quantization performance.
93+
94+

examples/quantization_w4a4_fp4/qwen_30b_a3b.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,10 @@ def tokenize(sample):
6060

6161
# Apply quantization.
6262
# We see `calibrate_moe_context` to True to update all `Qwen3MoeSparseMoeBlock`
63-
# during calibration
63+
# during calibration.
64+
# Feel free to update the definition under
65+
# llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with
66+
# this behaviour and evaluate its impact on quantization performance
6467
oneshot(
6568
model=model,
6669
dataset=ds,

examples/quantization_w8a8_fp8/fp8_block_example.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,9 @@
1616
# * quantize the weights to fp8 with per channel via ptq
1717
# * quantize the activations to fp8 with dynamic per token
1818
recipe = QuantizationModifier(
19-
targets="Linear", scheme="FP8_BLOCK", ignore=["lm_head", "re:.*mlp.gate$"],
19+
targets="Linear",
20+
scheme="FP8_BLOCK",
21+
ignore=["lm_head", "re:.*mlp.gate$"],
2022
)
2123

2224
# Apply quantization.

examples/quantizing_moe/deepseek_r1_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
from datasets import load_dataset
22
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
33

4+
from llmcompressor import oneshot
45
from llmcompressor.modeling import replace_modules_for_calibration
56
from llmcompressor.modifiers.quantization import GPTQModifier
6-
from llmcompressor.transformers import oneshot
77

88
# Select model and load it.
99

0 commit comments

Comments
 (0)