Skip to content

Commit c6e409f

Browse files
dsikkabrian-dellabetta
authored andcommitted
[Docs] Reorganize + Additional Guides (vllm-project#2379)
SUMMARY: - Add choosing a model - Add choosing a dataset - Re-organize to set-up a step-by-step compression guide - Additional clean-up and organization Sample Doc Generation: https://vllm--2379.org.readthedocs.build/projects/llm-compressor/en/2379/ --------- Signed-off-by: Dipika Sikka <ds3822@columbia.edu> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
1 parent d0ca503 commit c6e409f

File tree

16 files changed

+246
-156
lines changed

16 files changed

+246
-156
lines changed

docs/.nav.yml

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
nav:
22
- Home: index.md
3-
- Why use LLM Compressor?: getting-started/why-llmcompressor.md
4-
- Choosing the right compression scheme: getting-started/choosing-scheme.md
5-
- Choosing the right compression algorithm: getting-started/choosing-algo.md
3+
- Why use LLM Compressor?: steps/why-llmcompressor.md
4+
- Compresssing your model, step-by-step:
5+
- Choosing your model: steps/choosing-model.md
6+
- Choosing the right compression scheme: steps/choosing-scheme.md
7+
- Choosing the right compression algorithm: steps/choosing-algo.md
8+
- Choosing a dataset: steps/choosing-dataset.md
9+
- Compressing your model: steps/compress.md
10+
- Deploying with vLLM: steps/deploy.md
611
- Getting started:
712
- getting-started/index.md
813
- Installing LLM Compressor: getting-started/install.md
9-
- Compressing your Model: getting-started/compress.md
10-
- Deploying with vLLM: getting-started/deploy.md
11-
- FAQ: getting-started/faq.md
1214
- Key Models:
1315
- key-models/index.md
1416
- Llama 4:
@@ -26,7 +28,9 @@ nav:
2628
- Guides:
2729
- Compression Schemes: guides/compression_schemes.md
2830
- Saving a Model: guides/saving_a_model.md
29-
- Observers: observers.md
31+
- Observers: guides/observers.md
32+
- Memory Requirements: guides/memory.md
33+
- Runtime Performance: guides/runtime.md
3034
- Examples:
3135
- examples/index.md
3236
- examples/*
@@ -35,3 +39,5 @@ nav:
3539
- developer/*
3640
- API Reference:
3741
- api/*
42+
- FAQ:
43+
- faq/faq.md
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ This involves understanding your hardware availability and inference requirement
1616

1717
**4. What are the memory requirements for compression?**
1818

19-
Refer to [Memory Requirements for LLM Compressor](compress.md#memory-requirements-for-llm-compressor).
19+
Refer to [Memory Requirements for LLM Compressor](../guides/memory.md).
2020

2121
**5. Which model layers should be quantized?**
2222

docs/getting-started/compress.md

Lines changed: 0 additions & 127 deletions
This file was deleted.

docs/getting-started/index.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Getting Started
22

3-
Welcome to LLM Compressor! This section will guide you through the process of installing the library, compressing your first model, and deploying it with vLLM for faster, more efficient inference.
3+
This section will guide you through the process of installing the library, compressing your first model, and deploying it with vLLM for faster, more efficient inference.
44

55
LLM Compressor makes it simple to optimize large language models for deployment, offering various quantization techniques that help you find the perfect balance between model quality, performance, and resource efficiency.
66

@@ -16,7 +16,7 @@ Follow the guides below to get started with LLM Compressor and optimize your mod
1616

1717
Learn about the benefits of model optimization and how LLM Compressor helps reduce costs and improve performance.
1818

19-
[:octicons-arrow-right-24: Why LLM Compressor](why-llmcompressor.md)
19+
[:octicons-arrow-right-24: Why LLM Compressor](../steps/why-llmcompressor.md)
2020

2121
- :material-package-variant:{ .lg .middle } Installation
2222

@@ -30,24 +30,24 @@ Follow the guides below to get started with LLM Compressor and optimize your mod
3030

3131
---
3232

33-
Learn how to apply quantization to your models using different algorithms and formats.
33+
Learn how to compress your model using different algorithms and formats with a step-by-step walkthrough
3434

35-
[:octicons-arrow-right-24: Compression Guide](compress.md)
35+
[:octicons-arrow-right-24: Compression Guide](../steps/choosing-model.md)
3636

3737
- :material-rocket-launch:{ .lg .middle } Deploy with vLLM
3838

3939
---
4040

4141
Deploy your compressed model for efficient inference using vLLM.
4242

43-
[:octicons-arrow-right-24: Deployment Guide](deploy.md)
43+
[:octicons-arrow-right-24: Deployment Guide](../steps/deploy.md)
4444

4545
- :material-rocket-launch:{ .lg .middle } FAQ
4646

4747
---
4848

4949
View the most frequently asked questions for LLM Compressor.
5050

51-
[:octicons-arrow-right-24: FAQ](faq.md)
51+
[:octicons-arrow-right-24: FAQ](../faq/faq.md)
5252

5353
</div>

docs/guides/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,20 @@ Welcome to the LLM Compressor guides section! Here you'll find comprehensive doc
2222

2323
[:octicons-arrow-right-24: Saving a Model](saving_a_model.md)
2424

25+
- :material-content-save:{ .lg .middle } Memory requirements
26+
27+
---
28+
29+
Learn about LLM Compressor's memory requirements for various supported algorithms
30+
31+
[:octicons-arrow-right-24: LLM Compressor Memory Requirements](./memory.md)
32+
33+
- :material-content-save:{ .lg .middle } Runtime requirements
34+
35+
---
36+
37+
Learn about LLM Compressor's runtime requirements for various supported algorithms
38+
39+
[:octicons-arrow-right-24: LLM Compressor Memory Requirements](./runtime.md)
40+
2541
</div>

docs/guides/memory.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Memory requirements for LLM Compressor
2+
3+
When compressing a model you should be aware that the memory requirements are dependent on model size and the algorithm used, such as GPTQ/SparseGPT.
4+
5+
This section will go through how to calculate the CPU and GPU memory requirements for each algorithm using several popular models, an 8B, a 684B, and a model with vision capabilities, as examples.
6+
7+
The GPTQ/SparseGPT requires a large amount of auxiliary memory. GPTQ/SparseGPT allocates an auxiliary hessian matrix for any layers that are onloaded to the GPU. This is because the hessian matrices have to be almost as large as the weights they are trying to represent.
8+
9+
Also, larger models, like DeepSeek R1 use a large amount of CPU memory, and models with large vision towers, such as command A, may use large amounts of GPU memory.
10+
11+
## Things to note when calculating memory requirements for LLM Compressor:
12+
13+
1. A 1B model uses 2Gb of memory to load:
14+
```
15+
mem(1B parameters) ~= (1B parameters) * (2 bytes / parameter) = 2B bytes ~= 2Gb
16+
```
17+
18+
2. How text decoder layers and vision tower layers are loaded on to GPU differs significantly.
19+
20+
In the case of text decoder layers, LLM Compressor dynamically loads one layer at a time into the GPU for computation. The rest of the model remains in CPU memory.
21+
22+
However, vision tower layers are loaded onto GPU all at once. Unlike the text model, vision towers are not split up into individual layers before onloading to the GPU. This can create a GPU memory bottleneck for models whose vision towers are larger than their text layers.
23+
24+
At this time LLM Compressor does not quantise the vision tower as quantization is generally not worth the tradeoff between latency/throughput and accuracy loss.
25+
26+
3. LLM Compressor does not currently support tensor parallelism for compression. Supporting this feature will allow layers to be sharded across GPUs, leading to reduced memory usage per GPU and faster compression.
27+
28+
## QuantizationModifier or Round-To-Nearest (RTN)
29+
30+
The quantization modifier, RTN, does not require any additional memory beyond the storage needed for its quantization parameters (scales/zeros).
31+
32+
If we ignore these scales and zero points from our calculation, we can estimate the following memory requirements:
33+
34+
35+
| Model| CPU requirements | GPU requirements |
36+
|--------|-------------|----------------------------|
37+
| **Meta-Llama-3-8B-Instruct** | mem(8B params) ~= 16Gb | mem(1 Layer) ~= 0.5Gb |
38+
| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) ~= 22.4Gb|
39+
| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B) ~= 1.3Gb |
40+
41+
## GPT Quantization(GPTQ)/ Sparse GPT
42+
43+
The GPTQ/ SparseGPT algorithms differ from the RTN in that they must also allocate an auxiliary hessian matrices for any layers that are onloaded to the GPU.
44+
45+
This hessian matrix is used to increase the accuracy recovery of the algorithm, and is approximately the same size as the original weights.
46+
47+
| Model| CPU requirements | GPU requirements |
48+
|--------|-------------|----------------------------|
49+
| **Meta-Llama-3-8B-Instruct** |mem(8B params) ~= 16Gb | mem(1 Layer) * 2 ~= 1Gb |
50+
| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) * 2 ~= 44.8Gb |
51+
| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B)*2 ~= 2.6Gb |
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Observers are designed to be flexible and support a variety of quantization stra
66

77
## Base Class
88

9-
### [Observer](../src/llmcompressor/observers/base.py)
9+
### [Observer](../../src/llmcompressor/observers/base.py)
1010
Base class for all observers. Subclasses must implement the `calculate_qparams` method to define how quantization parameters are computed.
1111

1212
The base class handles:
@@ -20,14 +20,14 @@ This class is not used directly but provides the scaffolding for all custom obse
2020

2121
## Implemented Observers
2222

23-
### [MinMax](../src/llmcompressor/observers/min_max.py)
23+
### [MinMax](../../src/llmcompressor/observers/min_max.py)
2424
Computes `scale` and `zero_point` by tracking the minimum and maximum of the observed tensor. This is the simplest and most common observer. Works well for symmetric and asymmetric quantization.
2525

2626
Best used for:
2727
- Int8 or Int4 symmetric quantization
2828
- Channel-wise or group-wise strategies
2929

30-
### [MSE](../src/llmcompressor/observers/mse.py)
30+
### [MSE](../../src/llmcompressor/observers/mse.py)
3131
Computes quantization parameters by minimizing the Mean Squared Error (MSE) between the original and quantized tensor. Optionally maintains a moving average of min/max values for smoother convergence.
3232

3333
Best used when:

docs/guides/runtime.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Runtime requirements for LLM Compressor
2+
3+
The following are typical runtimes for each LLM Compressor algorithm based on runs using Meta-Llama-3-8B-Instruct on a NVIDIA A100 Tensor Core GPU.
4+
5+
| Algorithm| Estimated Time
6+
|--------|-------------|
7+
| **RTN (QuantizationModifier)** <br> Weights only (no activation quant) | ~ 1 minutes |
8+
| **RTN (QuantizationModifier)** <br> Weights and activations | ~ 20 minutes |
9+
| **GPTQ** (weights only) | ~ 30 minutes |
10+
| **AWQ** (weights only) | ~ 30 minutes |

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Model optimization through quantization and pruning addresses the key challenges
1717
| Request throughput | Utilizes lower-precision tensor cores for faster computation |
1818
| Energy consumption | Smaller models consume less power during inference |
1919

20-
For more information, see [Why use LLM Compressor?](./getting-started/why-llmcompressor.md)
20+
For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md)
2121

2222
## New in this release
2323

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -56,16 +56,6 @@ Use the table below to select the algorithm that best matches your deployment re
5656
| SpinQuant or QuIP + GPTQ | Best low-bit accuracy |
5757
| FP8 KV Cache | Target KV Cache or attention activations |
5858

59-
## Supported model types
60-
61-
The following model architectures are fully supported in LLM Compressor:
62-
63-
| Model type | Notes |
64-
|------------|-------|
65-
| Standard language models | Llama, Mistral, Qwen, and more |
66-
| Multimodal/Vision models | Vision-language models |
67-
| Mixture of Experts (MoE) models | DeepSeek, Qwen MoE, Mistral |
68-
| Large multi-GPU models | Multi-GPU and CPU offloading support |
6959

7060
### Mixed-precision quantization for accuracy recovery
7161

@@ -88,5 +78,6 @@ See [the non-uniform quantization examples](https://github.com/vllm-project/llm-
8878

8979
## Next steps
9080

81+
- [Choosing your dataset](./choosing-dataset.md)
9182
- [Compress your first model](compress.md)
9283
- [Deploy with vLLM](deploy.md)

0 commit comments

Comments
 (0)