|
13 | 13 | <img alt="LLM Compressor Flow" src="assets/llmcompressor-user-flows.png" width="100%" style="max-width: 100%;"/>
|
14 | 14 | </p>
|
15 | 15 |
|
| 16 | +## New in this release |
| 17 | + |
| 18 | +Review the [LLM Compressor v0.8.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.8.0) for details about new features. Highlights include: |
| 19 | + |
| 20 | +!!! info "Support for multiple modifiers in oneshot compression runs" |
| 21 | + LLM Compressor now supports using multiple modifiers in oneshot compression runs such as applying both AWQ and GPTQ in a single model. |
| 22 | + |
| 23 | + Using multiple modifiers is an advanced usage of LLM Compressor and an active area of research. See [Non-uniform Quantization](examples/quantization_non_uniform/) for more detail and example usage. |
| 24 | + |
| 25 | +!!! info "Quantization and calibration support for Qwen3 models" |
| 26 | + Quantization and calibration support for Qwen3 Next models has been added to LLM Compressor. |
| 27 | + |
| 28 | + LLM Compressor now supports quantization for Qwen3 Next and Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release. |
| 29 | + |
| 30 | + Examples for NVFP4 and FP8 quantization have been added for the Qwen3-Next-80B-A3B-Instruct model. |
| 31 | + |
| 32 | + For the Qwen3 VL MoE model, support has been added for the data-free pathway. The data-free pathway applies FP8 quantization, for example, channel-wise and block-wise quantization. |
| 33 | + |
| 34 | + **NOTE**: These models are not supported in tranformers<=4.56.2. You may need to install transformers from source. |
| 35 | + |
| 36 | +!!! info "Transforms support for non-full-size rotation sizes" |
| 37 | + You can now set a `transform_block_size` field in the Transform-based modifier classes `SpinQuantModifier` and `QuIPModifier`. You can configure transforms of variable size with this field, and you don't need to restrict hadamards to match the size of the weight. |
| 38 | + |
16 | 39 | ## Recent Updates
|
17 | 40 |
|
18 | 41 | !!! info "QuIP and SpinQuant-style Transforms"
|
|
27 | 50 | !!! info "Llama4 Quantization Support"
|
28 | 51 | Quantize a Llama4 model to [W4A16](examples/quantization_w4a16.md) or [NVFP4](examples/quantization_w4a16.md). The checkpoint produced can seamlessly run in vLLM.
|
29 | 52 |
|
30 |
| -!!! info "Large Model Support with Sequential Onloading" |
31 |
| - As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe.md). |
32 |
| - |
33 |
| -!!! info "Axolotl Sparse Finetuning Integration" |
34 |
| - Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor). |
35 |
| - |
36 | 53 | For more information, check out the [latest release on GitHub](https://github.com/vllm-project/llm-compressor/releases/latest).
|
37 | 54 |
|
38 | 55 | ## Key Features
|
|
0 commit comments