Skip to content

Commit a9847e0

Browse files
authored
[Docs] Updates (#2416)
SUMMARY: - Fix torchrun command - Add reference to guides in compress.md - Update model loading table
1 parent fe51272 commit a9847e0

File tree

7 files changed

+18
-14
lines changed

7 files changed

+18
-14
lines changed

docs/.nav.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,11 @@ nav:
2626
- key-models/mistral-large-3/index.md
2727
- FP8 Example: key-models/mistral-large-3/fp8-example.md
2828
- Guides:
29+
- Big Models and Distributed Support:
30+
- Model Loading: guides/big_models_and_distributed/model_loading.md
31+
- Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
32+
- Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
2933
- Compression Schemes: guides/compression_schemes.md
30-
- Sequential Onloading: guides/sequential_onloading.md
31-
- Model Loading: guides/model_loading.md
32-
- Distributed Oneshot: guides/distributed_oneshot.md
3334
- Saving a Model: guides/saving_a_model.md
3435
- Observers: guides/observers.md
3536
- Memory Requirements: guides/memory.md

docs/guides/distributed_oneshot.md renamed to docs/guides/big_models_and_distributed/distributed_oneshot.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ ds = load_dataset(
5959

6060
### 4. Call your script with `torchrun` ###
6161

62-
Now, your script is ready to run using distributed processes. To start, simply run your script using `python3 -m torchrun --nproc_per_node=2 YOUR_EXAMPLE.py` to run with two GPU devices. For a complete example script, see [llama_ddp_example.py](/examples/quantization_w4a16/llama3_ddp_example.py). The below table shows results and speedups as of LLM Compressor v0.10.0, future changes will bring these numbers closer to linear speedups.
62+
Now, your script is ready to run using distributed processes. To start, simply run your script using `torchrun --nproc_per_node=2 YOUR_EXAMPLE.py` to run with two GPU devices. For a complete example script, see [llama_ddp_example.py](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_ddp_example.py). The below table shows results and speedups as of LLM Compressor v0.10.0, future changes will bring these numbers closer to linear speedups.
6363

6464
| model_id | world_size | max_time | max_memory | save_time | flex_extract | eval_time |
6565
|----------|-------------|----------|------------|-----------|--------------|-----------|

docs/guides/model_loading.md renamed to docs/guides/big_models_and_distributed/model_loading.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,13 @@ Distributed=False | device_map="auto" | device_map="cuda" | device_map="cpu" | d
1717
-- | -- | -- | -- | --
1818
`load_offloaded_model` context required? | No | No | No | Yes
1919
Behavior | Try to load model onto all visible cuda devices. Fallback to cpu and disk if model too large | Try to load model onto first cuda device only. Error if model is too large | Try to load model onto cpu. Error if the model is too large | Try to load model onto cpu. Fallback to disk if model is too large
20-
LLM Compressor Examples | This is the recommended load option when using the "basic" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline
20+
LLM Compressor Examples | This is the recommended load option when using the "basic" or "data_free" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline
2121

2222
Distributed=True | device_map="auto" | device_map="cuda" | device_map="cpu" | device_map="auto_offload"
2323
-- | -- | -- | -- | --
2424
`load_offloaded_model` context required? | Yes | Yes | Yes | Yes
2525
Behavior | Try to load model onto device 0, then broadcast replicas to other devices. Fallback to cpu and disk if model is too large | Try to load model onto device 0 only, then broadcast replicas to other devices. Error if model is too large | Try to load model onto cpu. Error if the model is too large | Try to load model onto cpu. Fallback to disk if model is too large
26-
LLM Compressor Examples | This is the recommended load option when using the "basic" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline
26+
LLM Compressor Examples | This is the recommended load option when using the "basic" or "data_free" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline
2727

2828
## Disk Offloading ##
2929
When compressing models which are larger than the available CPU memory, it is recommended to utilize disk offloading for any weights which cannot fit on the cpu. To enable disk offloading, use the `load_offloaded_model` context from `compressed_tensors` to load your model, along with `device_map="auto_offload"`.

docs/guides/sequential_onloading.md renamed to docs/guides/big_models_and_distributed/sequential_onloading.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
LLM Compressor is capable of compressing models much larger than the amount of memory available as VRAM. This is achieved through a technique called **sequential onloading** whereby only a fraction of the model weights are moved to GPU memory for calibration while the rest of the weights remain offloaded to CPU or disk. When performing calibration, the entire dataset is offloaded to CPU, then onloaded one batch at a time to reduce peak activations memory usage.
66

7-
![sequential_onloading](../assets/sequential_onloading.jpg)
7+
![sequential_onloading](../../assets/sequential_onloading.jpg)
88

99
If basic calibration/inference is represented with the following pseudo code...
1010
```python
@@ -22,20 +22,20 @@ for layer in model.layers:
2222

2323
## Implementation ##
2424

25-
Before a model can be sequentially onloaded, it must first be broken up into disjoint parts which can be individually onloaded. This is achieved through the [torch.fx.Tracer](https://github.com/pytorch/pytorch/blob/main/torch/fx/README.md#tracing) module, which allows a model to be represented as a graph of operations (nodes) and data inputs (edges). Once the model has been traced into a valid graph representation, the graph is cut (partitioned) into disjoint subgraphs, each of which is onloaded individually as a layer. This implementation can be found [here](/src/llmcompressor/pipelines/sequential/helpers.py).
25+
Before a model can be sequentially onloaded, it must first be broken up into disjoint parts which can be individually onloaded. This is achieved through the [torch.fx.Tracer](https://github.com/pytorch/pytorch/blob/main/torch/fx/README.md#tracing) module, which allows a model to be represented as a graph of operations (nodes) and data inputs (edges). Once the model has been traced into a valid graph representation, the graph is cut (partitioned) into disjoint subgraphs, each of which is onloaded individually as a layer. This implementation can be found [here](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/pipelines/sequential/helpers.py).
2626

27-
![sequential_onloading](../assets/model_graph.jpg)
27+
![sequential_onloading](../../assets/model_graph.jpg)
2828
*This image depicts some of the operations performed when executing the Llama3.2-Vision model*
2929

30-
![sequential_onloading](../assets/sequential_decoder_layers.jpg)
30+
![sequential_onloading](../../assets/sequential_decoder_layers.jpg)
3131
*This image depicts the sequential text decoder layers of the Llama3.2-Vision model. Each of the individual decoder layers is onloaded separately*
3232

3333
## Sequential Targets and Usage ##
3434
You can use sequential onloading by calling `oneshot` with the `pipeline="sequential"` argument. Note that this pipeline is the default for all oneshot calls which require calibration data. If the sequential pipeline proves to be problematic, you can specify `pipeline="basic"` to use a basic pipeline which does not require sequential onloading, but only works performantly when the model is small enough to fit into the available VRAM.
3535

3636
If you are compressing a model using a GPU with a small amount of memory, you may need to change your sequential targets. Sequential targets control how many weights to onload to the GPU at a time. By default, the sequential targets are decoder layers which may include large MoE layers. In these cases, setting the `sequential_targets="Linear"` argument in `oneshot` will result in lower VRAM usage, but a longer runtime.
3737

38-
![sequential_onloading](../assets/seq_targets.jpg)
38+
![sequential_onloading](../../assets/seq_targets.jpg)
3939

4040
## More information ##
4141

docs/guides/memory.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Also, larger models, like DeepSeek R1 use a large amount of CPU memory, and mode
1717
1818
2. How text decoder layers and vision tower layers are loaded on to GPU differs significantly.
1919
20-
In the case of text decoder layers, LLM Compressor typically loads one layer at a time into the GPU for computation, while the rest remains offloaded in CPU/Disk memory. For more information, see [Sequential Onloading](./sequential_onloading.md).
20+
In the case of text decoder layers, LLM Compressor typically loads one layer at a time into the GPU for computation, while the rest remains offloaded in CPU/Disk memory. For more information, see [Sequential Onloading](./big_models_and_distributed/sequential_onloading.md).
2121
2222
However, vision tower layers are loaded onto GPU all at once. Unlike the text model, vision towers are not split up into individual layers before onloading to the GPU. This can create a GPU memory bottleneck for models whose vision towers are larger than their text layers.
2323

docs/steps/compress.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ Before you begin, ensure that your environment meets the following prerequisites
1414
LLM Compressor provides the `oneshot` API for simple and straightforward model compression. This API allows you to apply a recipe, which defines your chosen quantization scheme and quantization algorithm, to your selected model.
1515
We'll import the `QuantizationModifier` modifier, which applies the RTN quantization algorithm and create a recipe to apply FP8 Block quantization to our model. The final model is compressed in the compressed-tensors format and ready to deploy in vLLM.
1616

17+
!!! info
18+
Note: The following script is for single-process quantization. The model is loaded onto any available GPUs and then offloaded onto the cpu if it is too large. For distributed support or support for very large models (such as certain MoEs, including Kimi-K2), see the [Big Models and Distributed Support guide](../guides/big_models_and_distributed/model_loading.md).
19+
1720
```python
1821
from transformers import AutoModelForCausalLM, AutoTokenizer
1922

@@ -42,7 +45,7 @@ oneshot(model=model, recipe=recipe)
4245

4346
# Confirm generations of the quantized model look sane.
4447
print("========== SAMPLE GENERATION ==============")
45-
dispatch_for_generation(model)
48+
dispatch_model(model)
4649
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
4750
model.device
4851
)

examples/disk_offloading/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
## Disk Offloading ##
2-
For more information on disk offloading, see [Model Loading](/docs/guides/model_loading.md).
2+
For more information on disk offloading, see [Model Loading](/docs/guides/big_models_and_distributed/model_loading.md).

0 commit comments

Comments
 (0)