-
Notifications
You must be signed in to change notification settings - Fork 453
Description
Overview
The near-term focus for LLM Compressor will centre on performance improvements across core workflows, targeted enhancements to NVFP4, stabilizing and hardening MXFP4 support, and broad improvements to modifier functionality. These efforts aim to improve efficiency, robustness, and overall usability while enabling more reliable and scalable model compression workflows.
In addition, we will continue to expand quantization support for the latest model releases to ensure timely compatibility with newly introduced architectures and checkpoints, including adopting transformers v5.0
We will also be focusing on improving the quality of our documentation, examples, CI/CD for easier access and understanding.
Q1 Roadmap
Performance Refactor - Enable Distributed Quantization Support
Status: In Progress
RFC: #2180
Issues:
- [Performance Refactor] Replace accelerate’s offloading with distributed-friendly implementations #2215
- [Performance Refactor] Implement a user interface for loading distributed/offloaded models #2216
- [Performance Refactor] Add data parallelism support to LLM Compressor #2217
Enable Modifier Specific Support:
- [Performance Refactor] Extend modifiers to support weight-parallel optimization - GPTQModifier #2218
- [Performance Refactor] Extend modifiers to support weight-parallel optimization - AWQModifier #2219
- [Performance Refactor] Extend modifiers to support weight-parallel optimization - QuantizationModifier #2220
- [AutoRound] Add DDP Support and Example #2411
MXFP4 vLLM Integration / Validation
Status: In Progress
- MXFP4A16 Support :
- MXFP4 Support - Activation Quantization Validation (move examples out of the experimental folder)
MXFP8 Support
Status: In Progress
AWQ, GPTQ Improvements and Benchmarking
Status: In Progress
- [AWQ] Modifier Speedups #2265
- [AWQ] Option to disable quantization #2206
- [AWQ] Add Token Masking Support for Calibration #2250
- [NVFP4][GPTQ] Fix GPTQModifier + NVFP4 Support #2294
- [MXFP4][GPTQ] Extend rounding to support FP32 compressed-tensors#551
- [MXFP4][GPTQ] Add GPTQ + MXFP4A16 Example #2304
NVFP4 Improvements
Status: Not Yet Started
- vLLM Support for NVFP4 + micro-rotations: [RFC]: MR-GPTQ (GPTQ+NVFP4) #2006
- General benchmarking with AutoRound, QuantizationModifier, and AWQ
Transformers v5 Support
Status: Not Yet Started
- Support for updated MoE Calibration: [upstream] Expecting future huggingface/transformer incompatibility #2036
Quantized Model Support
Status: In Progress
- FP8 Block + NVFp4 DSR1 Support
- GLM Support: Added GLM Modeling #2170
- MinMax-M2 Support: Minimax-M2 / M2.1 calibration #2171
- Qwen3.5 Support:
- Calibration Support: [Qwen3.5] Calibration support and NVFP4 Example #2482
- VL Examples: [Examples] Add Qwen3.5-27B NVFP4A16 and MXFP4A16 quantization examples #2467
CI/CD Buildkite Migration
Status: In Progress
- Migrate CI/CD to Buildkite