|
1 | | -# Getting started with LLM Compressor docs |
| 1 | +# What is LLM Compressor? |
2 | 2 |
|
3 | | -```bash |
4 | | -cd docs |
5 | | -``` |
| 3 | +**LLM Compressor** is an easy-to-use library for optimizing large language models for deployment with vLLM. It provides a comprehensive toolkit for applying state-of-the-art compression algorithms to reduce model size, lower hardware requirements, and improve inference performance. |
6 | 4 |
|
7 | | -- Install the dependencies: |
| 5 | +<p align="center"> |
| 6 | + <img alt="LLM Compressor Flow" src="assets/llmcompressor-user-flows.png" width="100%" style="max-width: 100%;"/> |
| 7 | +</p> |
8 | 8 |
|
9 | | -```bash |
10 | | -make install |
11 | | -``` |
| 9 | +## Which challenges does LLM Compressor address? |
12 | 10 |
|
13 | | -- Clean the previous build (optional but recommended): |
| 11 | +Model optimization through quantization and pruning addresses the key challenges of deploying AI at scale: |
14 | 12 |
|
15 | | -```bash |
16 | | -make clean |
17 | | -``` |
| 13 | +| Challenge | How LLM Compressor helps | |
| 14 | +|-----------|--------------------------| |
| 15 | +| GPU and infrastructure costs | Reduces memory requirements by 50-75%, enabling deployment on fewer GPUs | |
| 16 | +| Response latency | Reduces data movement overhead because quantized weights load faster | |
| 17 | +| Request throughput | Utilizes lower-precision tensor cores for faster computation | |
| 18 | +| Energy consumption | Smaller models consume less power during inference | |
18 | 19 |
|
19 | | -- Serve the docs: |
| 20 | +For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md) |
20 | 21 |
|
21 | | -```bash |
22 | | -make serve |
23 | | -``` |
| 22 | +## New in this release |
24 | 23 |
|
25 | | -This will start a local server at http://localhost:8000. You can now open your browser and view the documentation. |
| 24 | +Review the [LLM Compressor v0.9.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.9.0) for details about new features. Highlights include: |
| 25 | + |
| 26 | +!!! info "Updated offloading and model loading support" |
| 27 | + Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](guides/big_models_and_distributed/model_loading.md) |
| 28 | + |
| 29 | +!!! info "Distributed GPTQ Support" |
| 30 | + GPTQ now supports Distributed Data Parallel (DDP) functionality to significantly improve calibration runtime. An example using DDP with GPTQ can be found [here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_ddp_example.py) |
| 31 | + |
| 32 | +!!! info "Updated FP4 Microscale Support" |
| 33 | + GPTQ now supports FP4 quantization schemes, including both [MXFP4](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16_fp4/mxfp4/llama3_example.py) and [NVFP4](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models |
| 34 | + |
| 35 | + |
| 36 | +!!! info "New Model-Free PTQ Pathway" |
| 37 | + A new model-free PTQ pathway has been added to LLM Compressor, called model_free_ptq. This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where oneshot may fail. This pathway is currently supported for data-free pathways only, such as FP8 quantization and was leveraged to quantize the Mistral Large 3 model. Additional examples have been added illustrating how LLM Compressor can be used for Kimi K2 |
| 38 | + |
| 39 | +!!! info "Extended KV Cache and Attention Quantization Support" |
| 40 | + LLM Compressor now supports attention quantization. KV Cache quantization, which previously only supported per-tensor scales, has been extended to support any quantization scheme including a new per-head quantization scheme. Support for these checkpoints is ongoing in vLLM and scripts to get started have been added to the [experimental](https://github.com/vllm-project/llm-compressor/tree/main/experimental) folder |
| 41 | + |
| 42 | +## Supported algorithms and techniques |
| 43 | + |
| 44 | +| Algorithm | Description | Use Case | |
| 45 | +|-----------|-------------|----------| |
| 46 | +| **RTN** (Round-to-Nearest) | Fast baseline quantization | Quick compression with minimal setup | |
| 47 | +| **GPTQ** | Weighted quantization with calibration | High-accuracy 4 and 8 bit weight quantization | |
| 48 | +| **AWQ** | Activation-aware weight quantization | Preserves accuracy for important weights | |
| 49 | +| **SmoothQuant** | Outlier handling for W8A8 | Improved activation quantization | |
| 50 | +| **SparseGPT** | Pruning with quantization | 2:4 sparsity patterns | |
| 51 | +| **SpinQuant** | Rotation-based transforms | Improved low-bit accuracy | |
| 52 | +| **QuIP** | Incoherence processing | Advanced quantization preprocessing | |
| 53 | +| **FP8 KV Cache** | KV cache quantization | Long context inference on Hopper-class and newer GPUs | |
| 54 | +| **AutoRound** | Optimizes rounding and clipping ranges via sign-gradient descent | Broad compatibility | |
| 55 | + |
| 56 | +## Supported quantization schemes |
| 57 | + |
| 58 | +LLM Compressor supports applying multiple formats in a given model. |
| 59 | + |
| 60 | +| Format | Targets | Compute Capability | Use Case | |
| 61 | +|--------|---------|-------------------|----------| |
| 62 | +| **W4A16/W8A16** | Weights | 8.0 (Ampere and up) | Optimize for latency on older hardware | |
| 63 | +| **W8A8-INT8** | Weights and activations | 7.5 (Turing and up) | Balanced performance and compatibility | |
| 64 | +| **W8A8-FP8** | Weights and activations | 8.9 (Hopper and up) | High throughput on modern GPUs | |
| 65 | +| **NVFP4/MXFP4** | Weights and activations | 10.0 (Blackwell) | Maximum compression on latest hardware | |
| 66 | +| **W4AFP8** | Weights and activations | 8.9 (Hopper and up) | Low-bit weights with dynamic FP8 activations | |
| 67 | +| **W4AINT8** | Weights and activations | 7.5 (Turing and up) | Low-bit weights with dynamic INT8 activations | |
| 68 | +| **2:4 Sparse** | Weights | 8.0 (Ampere and up) | Sparsity-accelerated inference | |
| 69 | + |
| 70 | +!!! note |
| 71 | + Listed compute capability indicates the minimum architecture required for hardware acceleration. |
0 commit comments