Skip to content

Releases: vllm-project/llm-compressor

v0.7.1

21 Aug 21:37
6304ecf
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.7.0...0.7.1

v0.7.0

20 Aug 21:40
679a704
Compare
Choose a tag to compare
lc

LLM Compressor v0.7.0 release notes

This LLM Compressor v0.7.0 release introduces the following new features and enhancements:

  • Transforms support, including QuIP and SpinQuant algorithms
  • Apply multiple compressors to a single model for mixed-precision quantization
  • Support for DeepSeekV3-style block FP8 quantization
  • Expanded Mixture of Experts (MoE) calibration support, including support with NVFP4 quantization
  • Llama4 quantization support with vLLM compatibility
  • Configurable observer arguments
  • Simplified and unified Recipe classes for easier usage and debugging

Introducing Transforms ✨

LLM Compressor now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization. Transforms allow rotating weights or activations into spaces with smaller dynamic ranges, reducing quantization error.

Two algorithms are supported in this release:

  • QuIP transforms inject transforms before and after weights to assist with weight-only quantization
  • SpinQuant transforms inject transforms whose inverses span across multiple weights, assisting in both weight and activation quantization. In this release, fused R1 and R2 (i.e. offline) transforms are available. The full lifecycle has been validated to confirm that the models produced by LLM Compressor match the performance outlined in the original SpinQuant paper. Learned rotations and online R3 and R4 rotations will be added in a future release.

The functionality for both algorithms available through the new QuIPModifier and SpinQuantModifier classes.

Applying multiple compressors to a single model

LLM Compressor now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization. This provides finer control over per-layer quantization, allowing more precise handling of layers that are especially sensitive to certain quantization types.

Models with more than one compressor applied have their format set to mixed-precision in the config.json file. Additionally, each config_group now includes a format key that specifies the format used for the layers targeted by that group.

Support for DeepSeekV3-style block FP8 quantization

You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference. The changes encompass the fundamental implementation of block-wise quantization, robust handling of quantization parameters, updated documentation, and a practical example to guide users in applying this new compression scheme.

Mixture of Experts support

LLM Compressor now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization. Forward passes of the MoE models can be controlled during calibration by adding custom modules to the replace_modules_for_calibration function which permanently changes the MoE module or moe_calibration_context function to temporarily update modules during calibration.

Llama4 quantization

LLama4 quantization is now supported in LLM Compressor. To be quantized and runnable in vLLM, Llama4TextMoe modules are permanently replaced using the replace_modules_for_calibration method which linearizes the modules. This allows the model to be quantized to schemes including WN16 with GPTQ and NVFP4.

Simplified and updated Recipe classes

Recipe classes have been updated with the following features:

  • Merged multiple recipe-related classes into a single, unified Recipe class
  • Simplified modifier creation, lifecycle management, and parsing logic
  • Improved serialization and deserialization for clarity and maintainability
  • Reduced redundant stages and arguments handling for easier debugging and usage

Configurable Observer arguments

Observer arguments can now be configured as a dict through the observer_kwargs quantization argument, which can be set through oneshot recipes.

v0.6.0.1

28 Jul 19:05
0461bf9
Compare
Choose a tag to compare

What's Changed

Full Changelog: 0.6.0...0.6.0.1

v0.6.0

24 Jun 15:22
c052d2c
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.5.2...0.6.0

v0.5.2

24 Jun 01:47
c1c8541
Compare
Choose a tag to compare

What's Changed

Read more

v0.5.1

29 Apr 01:34
ef175d7
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.5.0...0.5.1

v0.5.0

03 Apr 13:23
25b1138
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.4.1...0.5.0

v0.4.1

20 Feb 13:21
6a1ba3c
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.4.0...0.4.1

v0.4.0

16 Jan 03:12
829af5b
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.3.1...0.4.0

v0.3.1

12 Dec 13:25
c3608a0
Compare
Choose a tag to compare

What's Changed

Full Changelog: 0.3.0...0.3.1