Skip to content

Commit 0c0ead3

Browse files
authored
1 parent a9847e0 commit 0c0ead3

File tree

2 files changed

+13
-16
lines changed

2 files changed

+13
-16
lines changed

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,12 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
3737

3838
Some of the exciting new features include:
3939

40-
* **Batched Calibration Support**: LLM Compressor now supports calibration with batch sizes > 1. A new [`batch_size`](src/llmcompressor/args/dataset_arguments.py#L70) argument has been added to the `dataset_arguments` enabling the option to improve quantization speed. Default `batch_size` is currently set to 1
40+
* **Updated offloading and model loading support**: Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](docs/guides/big_models_and_distributed/model_loading.md).
41+
* **Distributed GPTQ Support**: GPTQ now supports Distributed Data Parallel (DDP) functionality to significantly improve calibration runtime. An example using DDP with GPTQ can be found [here](examples/quantization_w4a16/llama3_ddp_example.py).
42+
* **Updated FP4 Microscale Support**: GPTQ now supports FP4 quantization schemes, including both [MXFP4](examples/quantization_w4a16_fp4/mxfp4/llama3_example.py) and [NVFP4](examples/quantization_w4a4_fp4/llama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models
4143
* **New Model-Free PTQ Pathway**: A new model-free PTQ pathway has been added to LLM Compressor, called [`model_free_ptq`](src/llmcompressor/entrypoints/model_free/__init__.py#L36). This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where `oneshot` may fail. This pathway is currently supported for data-free pathways only i.e FP8 quantization and was leveraged to quantize the [Mistral Large 3 model](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512). Additional [examples](examples/model_free_ptq) have been added illustrating how LLM Compressor can be used for Kimi K2
4244
* **Extended KV Cache and Attention Quantization Support**: LLM Compressor now supports attention quantization. KV Cache quantization, which previously only supported per-tensor scales, has been extended to support any quantization scheme including a new `per-head` quantization scheme. Support for these checkpoints is on-going in vLLM and scripts to get started have been added to the [experimental folder](experimental/attention)
43-
* **Generalized AWQ Support**: The AWQModifier has been updated to support quantization schemes beyond W4A16 (e.g W4AFp8). In particular, AWQ no longer constrains that the quantization config needs to have the same settings for `group_size`, `symmetric`, and `num_bits` for each config_group
44-
* **AutoRound Quantization Support**: Added [`AutoRoundModifier`](examples/autoround) for quantization using [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced post-training algorithm that optimizes rounding and clipping ranges through sign-gradient descent. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance
45-
* **Experimental MXFP4 Support**: Models can now be quantized using an [`MXFP4`](https://github.com/vllm-project/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py#L208) pre-set scheme. Examples can be found under the [experimental folder](experimental/mxfp4/llama3_mxfp4.py). This pathway is still experimental as support and validation with vLLM is still a WIP.
46-
* **R3 Transform Support**: LLM Compressor now supports applying transforms to attention in the style of SpinQuant's R3 rotation. Note: this feature is currently not yet supported in vLLM. An example applying R3 can be found in the [experimental folder](experimental/attention/llama3_attention_r3_nvfp4.py)
45+
4746

4847
### Supported Formats
4948
* Activation Quantization: W8A8 (int8 and fp8)

docs/index.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -23,24 +23,22 @@ For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md
2323

2424
Review the [LLM Compressor v0.9.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.9.0) for details about new features. Highlights include:
2525

26-
!!! info "Batched Calibration Support"
27-
LLM Compressor now supports calibration with batch sizes > 1. A new batch_size argument has been added to the dataset_arguments enabling the option to improve quantization speed. Default batch_size is currently set to 1
26+
!!! info "Updated offloading and model loading support"
27+
Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](guides/big_models_and_distributed/model_loading.md)
28+
29+
!!! info "Distributed GPTQ Support"
30+
GPTQ now supports Distributed Data Parallel (DDP) functionality to significantly improve calibration runtime. An example using DDP with GPTQ can be found [here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_ddp_example.py)
31+
32+
!!! info "Updated FP4 Microscale Support"
33+
GPTQ now supports FP4 quantization schemes, including both [MXFP4](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16_fp4/mxfp4/llama3_example.py) and [NVFP4](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models
34+
2835

2936
!!! info "New Model-Free PTQ Pathway"
3037
A new model-free PTQ pathway has been added to LLM Compressor, called model_free_ptq. This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where oneshot may fail. This pathway is currently supported for data-free pathways only, such as FP8 quantization and was leveraged to quantize the Mistral Large 3 model. Additional examples have been added illustrating how LLM Compressor can be used for Kimi K2
3138

3239
!!! info "Extended KV Cache and Attention Quantization Support"
3340
LLM Compressor now supports attention quantization. KV Cache quantization, which previously only supported per-tensor scales, has been extended to support any quantization scheme including a new per-head quantization scheme. Support for these checkpoints is ongoing in vLLM and scripts to get started have been added to the [experimental](https://github.com/vllm-project/llm-compressor/tree/main/experimental) folder
3441

35-
!!! info "Generalized AWQ Support"
36-
The `AWQModifier` has been updated to support quantization schemes beyond W4A16 (e.g., W4AFp8). In particular, AWQ no longer constrains that the quantization config needs to have the same settings for group_size, symmetric, and num_bits for each config_group
37-
38-
!!! info "AutoRound Quantization Support"
39-
Added AutoRoundModifier for quantization using AutoRound, an advanced post-training algorithm that optimizes rounding and clipping ranges through sign-gradient descent. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance
40-
41-
!!! info "Experimental MXFP4 Support"
42-
Models can now be quantized using an MXFP4 pre-set scheme. Examples can be found under the experimental folder. This pathway is still experimental as support and validation with vLLM is still a WIP.
43-
4442
## Supported algorithms and techniques
4543

4644
| Algorithm | Description | Use Case |

0 commit comments

Comments
 (0)