Skip to content

Commit 06d5967

Browse files
authored
Merge branch 'main' into kylesayrs/transform-quip-modifier
2 parents f86e3ac + 29f4d56 commit 06d5967

File tree

3 files changed

+25
-18
lines changed

3 files changed

+25
-18
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
1818

1919
Some of the exciting new features include:
2020

21+
* **DeepSeekV3-style Block Quantization Support**: This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py).
2122
* **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
2223
* **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
2324
* **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.

docs/guides/compression_formats.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Compression Formats
2+
3+
The following table outlines the possible quantization and sparsity
4+
compression formats that are applied to a model during compression.
5+
The formats are determined according to the quantization scheme and
6+
sparsity type. For more details on the quantization schemes, see
7+
`guides/compression_schemes.md`.
8+
9+
10+
| Quantization | Sparsity | Quant Compressor | Sparsity Compressor |
11+
|---------------|----------|----------------------|---------------------|
12+
| W8A8 - int | None | int_quantized | Dense |
13+
| W8A8 - float | None | float_quantized | Dense |
14+
| W4A16 - float | None | nvfp4_pack_quantized | Dense |
15+
| W4A4 - float | None | nvfp4_pack_quantized | Dense |
16+
| W4A16 - int | None | pack_quantized | Dense |
17+
| W8A16 - int | None | pack_quantized | Dense |
18+
| W8A16 - float | None | naive_quantized | Dense |
19+
| W8A8 - int | 2:4 | int_quantized | Sparse24 |
20+
| W8A8 - float | 2:4 | float_quantized | Sparse24 |
21+
| W4A16 - int | 2:4 | marlin_24 | Dense |
22+
| W8A16 - int | 2:4 | marlin_24 | Dense |
23+
| W8A16 - float | 2:4 | naive_quantized | Dense |

src/llmcompressor/transformers/compression/quantization_format.py

Lines changed: 1 addition & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,24 +18,7 @@ def infer_quantization_format(
1818
Infers the quantization format for a model based on its state and provided
1919
compression arguments.
2020
21-
The following table outlines the possible quantization and sparsity formats
22-
along with their corresponding compressor formats:
23-
24-
+---------------+----------+----------------------+---------------------+
25-
| Quantization | Sparsity | Quant Compressor | Sparsity Compressor |
26-
| | | Format | Format |
27-
+---------------+----------+----------------------+---------------------+
28-
| W8A8 - int | None | int_quantized | Dense |
29-
| W8A8 - float | None | float_quantized | Dense |
30-
| W4A16 - int | None | pack_quantized | Dense |
31-
| W8A16 - int | None | pack_quantized | Dense |
32-
| W8A16 - float | None | naive_quantized | Dense |
33-
| W8A8 - int | 2:4 | int_quantized | Sparse24 |
34-
| W8A8 - float | 2:4 | float_quantized | Sparse24 |
35-
| W4A16 - int | 2:4 | marlin_24 | Dense |
36-
| W8A16 - int | 2:4 | marlin_24 | Dense |
37-
| W8A16 - float | 2:4 | naive_quantized | Dense |
38-
+---------------+----------+----------------------+---------------------+
21+
For a summary of the formats, see `docs/guides/compression_formats.md`.
3922
4023
:param model: model to check for quantization, if the model is not quantized no
4124
quantization format is returned

0 commit comments

Comments
 (0)