Skip to content

Commit ff526d7

Browse files
[Docs] Add Sequential Onloading, Disk Offloading, and Distributed Oneshot Docs (#2396)
## Purpose ## * Add documentation for new features in v0.10.0 * Add up-to-date documentation on sequential onloading ## Changes ## * Add docs page for Sequential Onloading * Add docs page for Model Loading * Add docs page for Distributed Oneshot * Fix the path of observers.md * Slightly change wording on docs home page * Add redirect to model loading docs in disk offloading examples folder --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent 1e4d3c5 commit ff526d7

File tree

11 files changed

+191
-2
lines changed

11 files changed

+191
-2
lines changed

docs/.nav.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@ nav:
2727
- FP8 Example: key-models/mistral-large-3/fp8-example.md
2828
- Guides:
2929
- Compression Schemes: guides/compression_schemes.md
30+
- Sequential Onloading: guides/sequential_onloading.md
31+
- Model Loading: guides/model_loading.md
32+
- Distributed Oneshot: guides/distributed_oneshot.md
3033
- Saving a Model: guides/saving_a_model.md
3134
- Observers: guides/observers.md
3235
- Memory Requirements: guides/memory.md

docs/assets/model_graph.jpg

463 KB
Loading

docs/assets/seq_targets.jpg

116 KB
Loading
153 KB
Loading
138 KB
Loading

docs/guides/distributed_oneshot.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
## Distributed Oneshot ##
2+
As an experimental feature, LLM Compressor supports distributed oneshot for the purpose of greatly speeding up the runtime of model calibration and compression. For more information on implementation, see [[RFC] [Performance Refactor][Distributed] Sequential Onloading with Data-Parallel Calibration and Weight-Parallel Optimization](https://github.com/vllm-project/llm-compressor/issues/2180) as well as [[GPTQ][ddp] enabling DDP for GPTQ](https://github.com/vllm-project/llm-compressor/pull/2333).
3+
4+
## Usage ##
5+
In order to convert a script meant for single-threaded compression into one of distributed compression, please make the following changes:
6+
7+
### 1. Initialize the Distributed Context ###
8+
9+
In order to utilize the `torch.distributed` module, each rank must initialize the distributed module and assign itself a separate GPU device. This can be done by calling the `init_dist` utility provided by `compressed_tensors`.
10+
11+
```python
12+
from compressed_tensors.offload import init_dist
13+
14+
init_dist()
15+
```
16+
17+
### 2. Modify Model Loading ###
18+
19+
In order to prevent separate processes from loading the model multiple times and creating excess work/memory usage, we must load our model using the `load_offloaded_model` context. For more information, see [Model Loading](./model_loading.md#distributed-oneshot).
20+
21+
Before:
22+
```python
23+
model = AutoModelForCausalLM.from_pretrained(
24+
model_id,
25+
dtype="auto"
26+
)
27+
```
28+
29+
After:
30+
```python
31+
from compressed_tensors.offload import load_offloaded_model
32+
33+
with load_offloaded_model():
34+
model = AutoModelForCausalLM.from_pretrained(
35+
model_id,
36+
dtype="auto",
37+
device_map="auto_offload",
38+
)
39+
```
40+
41+
### 3. Modify Dataset Loading ###
42+
43+
In order to prevent separate processes loading the entire dataset and creating excess work/memory usage, we must partition our dataset into disjoint sets. For a dataset of *N* samples and *R* ranks, each rank only loads *N/R* samples.
44+
45+
```python
46+
ds = load_dataset(
47+
DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]"
48+
)
49+
```
50+
51+
52+
```python
53+
from llmcompressor.datasets.utils import get_rank_partition
54+
55+
ds = load_dataset(
56+
DATASET_ID, split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES)
57+
)
58+
```
59+
60+
### 4. Call your script with `torchrun` ###
61+
62+
Now, your script is ready to run using distributed processes. To start, simply run your script using `python3 -m torchrun --nproc_per_node=2 YOUR_EXAMPLE.py` to run with two GPU devices. For a complete example script, see [llama_ddp_example.py](/examples/quantization_w4a16/llama3_ddp_example.py). The below table shows results and speedups as of LLM Compressor v0.10.0, future changes will bring these numbers closer to linear speedups.
63+
64+
| model_id | world_size | max_time | max_memory | save_time | flex_extract | eval_time |
65+
|----------|-------------|----------|------------|-----------|--------------|-----------|
66+
| Meta-Llama-3-8B-Instruct | 1 | 745.03 | 5.82 | 19.57 | 0.7066 | 95.28 |
67+
| Meta-Llama-3-8B-Instruct | 2 | 372.20 | 5.57 | 49.10 | 0.7089 | 95.24 |
68+
| Meta-Llama-3-8B-Instruct | 4 | 264.07 | 5.82 | 52.50 | 0.7180 | 96.74 |
69+
| Qwen3-30B-A3B | 1 | 14207.53 | 6.56 | 748.23 | 0.8704 | 209.93 |
70+
| Qwen3-30B-A3B | 2 | 7018.25 | 6.36 | 696.65 | 0.8810 | 205.89 |
71+
| Qwen3-30B-A3B | 4 | 3694.46 | 6.36 | 723.05 | 0.8832 | 217.62 |

docs/guides/memory.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Also, larger models, like DeepSeek R1 use a large amount of CPU memory, and mode
1717
1818
2. How text decoder layers and vision tower layers are loaded on to GPU differs significantly.
1919
20-
In the case of text decoder layers, LLM Compressor dynamically loads one layer at a time into the GPU for computation. The rest of the model remains in CPU memory.
20+
In the case of text decoder layers, LLM Compressor typically loads one layer at a time into the GPU for computation, while the rest remains offloaded in CPU/Disk memory. For more information, see [Sequential Onloading](./sequential_onloading.md).
2121
2222
However, vision tower layers are loaded onto GPU all at once. Unlike the text model, vision towers are not split up into individual layers before onloading to the GPU. This can create a GPU memory bottleneck for models whose vision towers are larger than their text layers.
2323

docs/guides/model_loading.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Model Loading #
2+
LLM Compressor utilizes the [Compressed Tensors](https://github.com/vllm-project/compressed-tensors) library to handle model offloading. In nearly all cases, it is recommended to compress your model using the [sequential pipeline](./sequential_onloading.md), which enables the quantization of large models without requiring significant VRAM.
3+
4+
!!! tip
5+
For more information on when to use the *basic* pipeline rather than the *sequential* pipeline, see [Basic Pipeline](./model_loading.md#basic-pipeline). In these cases, it is recommended to load your model onto GPU first, rather than CPU/Disk.
6+
7+
Loading your model directly onto CPU is simple using `transformers`:
8+
9+
```python
10+
# model is on cpu
11+
model = AutoModelForCausalLM.from_pretrained(model_stub, dtype="auto")
12+
```
13+
14+
However, there are some exceptions when it is required to change this logic to handle more advanced loading. The table below shows the behavior of different model loading configurations.
15+
16+
Distributed=False | device_map="auto" | device_map="cuda" | device_map="cpu" | device_map="auto_offload"
17+
-- | -- | -- | -- | --
18+
`load_offloaded_model` context required? | No | No | No | Yes
19+
Behavior | Try to load model onto all visible cuda devices. Fallback to cpu and disk if model too large | Try to load model onto first cuda device only. Error if model is too large | Try to load model onto cpu. Error if the model is too large | Try to load model onto cpu. Fallback to disk if model is too large
20+
LLM Compressor Examples | This is the recommended load option when using the "basic" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline
21+
22+
Distributed=True | device_map="auto" | device_map="cuda" | device_map="cpu" | device_map="auto_offload"
23+
-- | -- | -- | -- | --
24+
`load_offloaded_model` context required? | Yes | Yes | Yes | Yes
25+
Behavior | Try to load model onto device 0, then broadcast replicas to other devices. Fallback to cpu and disk if model is too large | Try to load model onto device 0 only, then broadcast replicas to other devices. Error if model is too large | Try to load model onto cpu. Error if the model is too large | Try to load model onto cpu. Fallback to disk if model is too large
26+
LLM Compressor Examples | This is the recommended load option when using the "basic" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline
27+
28+
## Disk Offloading ##
29+
When compressing models which are larger than the available CPU memory, it is recommended to utilize disk offloading for any weights which cannot fit on the cpu. To enable disk offloading, use the `load_offloaded_model` context from `compressed_tensors` to load your model, along with `device_map="auto_offload"`.
30+
31+
```python
32+
from compressed_tensors.offload import load_offloaded_model
33+
34+
with load_offloaded_model():
35+
model_id = "Qwen/Qwen3-0.6B"
36+
model = AutoModelForCausalLM.from_pretrained(
37+
model_id,
38+
dtype="auto",
39+
device_map="auto_offload", # fit as much as possible on cpu, rest goes on disk
40+
max_memory={"cpu": 6e8}, # optional argument to specify how much cpu memory to use
41+
offload_folder="./offload_folder", # file system with lots of storage
42+
)
43+
```
44+
45+
In order to specify where disk-offloaded weights should be stored, please specify the `offload_folder` argument.
46+
47+
You can then call `oneshot` as usual to perform calibration and compression. Some operations may be slower due to disk offloading.
48+
49+
## Distributed Oneshot ##
50+
When performing `oneshot` with distributed computing, you will need to ensure that your model does not replicate offloaded values across ranks, otherwise this will create excess work and memory usage. Coordinated loading between ranks is automatically handled by the `load_offloaded_model` context, so long as it is entered after `torch.distributed` has been initialized.
51+
52+
```python
53+
from compressed_tensors.offload import init_dist, load_offloaded_model
54+
55+
init_dist()
56+
with load_offloaded_model():
57+
model = AutoModelForCausalLM.from_pretrained(
58+
model_id, dtype="auto", device_map="auto_offload"
59+
)
60+
```
61+
62+
## Basic Pipeline ##
63+
It is recommended to only use the basic pipeline when your model is small enough to fit into the available VRAM, including any auxillary memory requirements of algorithms such as GPTQ hessians. The basic pipeline can provide compression runtime speedups when compared to the sequential pipeline.
64+
65+
In these cases, you can load the model directly onto your GPU devices, and call oneshot with the relevant argument.
66+
67+
```python
68+
model = AutoModelForCausalLM.from_pretrained(model_stub, device_map="auto") # model is on devices
69+
...
70+
oneshot(model, ..., pipeline="basic")
71+
```
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Sequential Onloading #
2+
3+
## Introduction ##
4+
5+
LLM Compressor is capable of compressing models much larger than the amount of memory available as VRAM. This is achieved through a technique called **sequential onloading** whereby only a fraction of the model weights are moved to GPU memory for calibration while the rest of the weights remain offloaded to CPU or disk. When performing calibration, the entire dataset is offloaded to CPU, then onloaded one batch at a time to reduce peak activations memory usage.
6+
7+
![sequential_onloading](../assets/sequential_onloading.jpg)
8+
9+
If basic calibration/inference is represented with the following pseudo code...
10+
```python
11+
for i in range(len(activations)):
12+
for layer in model.layers:
13+
activations[i] = layer(activations[i])
14+
```
15+
16+
Then sequential onloading is the technique by which the order of the two for loops is swapped.
17+
```python
18+
for layer in model.layers:
19+
for i in range(len(activations)):
20+
dataset[i] = layer(dataset[i])
21+
```
22+
23+
## Implementation ##
24+
25+
Before a model can be sequentially onloaded, it must first be broken up into disjoint parts which can be individually onloaded. This is achieved through the [torch.fx.Tracer](https://github.com/pytorch/pytorch/blob/main/torch/fx/README.md#tracing) module, which allows a model to be represented as a graph of operations (nodes) and data inputs (edges). Once the model has been traced into a valid graph representation, the graph is cut (partitioned) into disjoint subgraphs, each of which is onloaded individually as a layer. This implementation can be found [here](/src/llmcompressor/pipelines/sequential/helpers.py).
26+
27+
![sequential_onloading](../assets/model_graph.jpg)
28+
*This image depicts some of the operations performed when executing the Llama3.2-Vision model*
29+
30+
![sequential_onloading](../assets/sequential_decoder_layers.jpg)
31+
*This image depicts the sequential text decoder layers of the Llama3.2-Vision model. Each of the individual decoder layers is onloaded separately*
32+
33+
## Sequential Targets and Usage ##
34+
You can use sequential onloading by calling `oneshot` with the `pipeline="sequential"` argument. Note that this pipeline is the default for all oneshot calls which require calibration data. If the sequential pipeline proves to be problematic, you can specify `pipeline="basic"` to use a basic pipeline which does not require sequential onloading, but only works performantly when the model is small enough to fit into the available VRAM.
35+
36+
If you are compressing a model using a GPU with a small amount of memory, you may need to change your sequential targets. Sequential targets control how many weights to onload to the GPU at a time. By default, the sequential targets are decoder layers which may include large MoE layers. In these cases, setting the `sequential_targets="Linear"` argument in `oneshot` will result in lower VRAM usage, but a longer runtime.
37+
38+
![sequential_onloading](../assets/seq_targets.jpg)
39+
40+
## More information ##
41+
42+
For more information, see the [RedHat AI blog post](https://developers.redhat.com/articles/2025/05/09/llm-compressor-optimize-llms-low-latency-deployments#generalizing_to_multimodal_and_moe_architectures) or the [LLM Compressor Office Hours Recording](https://www.youtube.com/watch?v=GrhuqQDmBk8).

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
<img alt="LLM Compressor Flow" src="assets/llmcompressor-user-flows.png" width="100%" style="max-width: 100%;"/>
77
</p>
88

9-
## What challenges does LLM Compressor address?
9+
## Which challenges does LLM Compressor address?
1010

1111
Model optimization through quantization and pruning addresses the key challenges of deploying AI at scale:
1212

0 commit comments

Comments
 (0)