Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,11 @@ nav:
- key-models/mistral-large-3/index.md
- FP8 Example: key-models/mistral-large-3/fp8-example.md
- Guides:
- Big Models and Distributed Support:
- Model Loading: guides/big_models_and_distributed/model_loading.md
- Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
- Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
- Compression Schemes: guides/compression_schemes.md
- Sequential Onloading: guides/sequential_onloading.md
- Model Loading: guides/model_loading.md
- Distributed Oneshot: guides/distributed_oneshot.md
- Saving a Model: guides/saving_a_model.md
- Observers: guides/observers.md
- Memory Requirements: guides/memory.md
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ ds = load_dataset(

### 4. Call your script with `torchrun` ###

Now, your script is ready to run using distributed processes. To start, simply run your script using `python3 -m torchrun --nproc_per_node=2 YOUR_EXAMPLE.py` to run with two GPU devices. For a complete example script, see [llama_ddp_example.py](/examples/quantization_w4a16/llama3_ddp_example.py). The below table shows results and speedups as of LLM Compressor v0.10.0, future changes will bring these numbers closer to linear speedups.
Now, your script is ready to run using distributed processes. To start, simply run your script using `torchrun --nproc_per_node=2 YOUR_EXAMPLE.py` to run with two GPU devices. For a complete example script, see [llama_ddp_example.py](/examples/quantization_w4a16/llama3_ddp_example.py). The below table shows results and speedups as of LLM Compressor v0.10.0, future changes will bring these numbers closer to linear speedups.

| model_id | world_size | max_time | max_memory | save_time | flex_extract | eval_time |
|----------|-------------|----------|------------|-----------|--------------|-----------|
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@ Distributed=False | device_map="auto" | device_map="cuda" | device_map="cpu" | d
-- | -- | -- | -- | --
`load_offloaded_model` context required? | No | No | No | Yes
Behavior | Try to load model onto all visible cuda devices. Fallback to cpu and disk if model too large | Try to load model onto first cuda device only. Error if model is too large | Try to load model onto cpu. Error if the model is too large | Try to load model onto cpu. Fallback to disk if model is too large
LLM Compressor Examples | This is the recommended load option when using the "basic" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline
LLM Compressor Examples | This is the recommended load option when using the "basic" or "data_free" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline

Distributed=True | device_map="auto" | device_map="cuda" | device_map="cpu" | device_map="auto_offload"
-- | -- | -- | -- | --
`load_offloaded_model` context required? | Yes | Yes | Yes | Yes
Behavior | Try to load model onto device 0, then broadcast replicas to other devices. Fallback to cpu and disk if model is too large | Try to load model onto device 0 only, then broadcast replicas to other devices. Error if model is too large | Try to load model onto cpu. Error if the model is too large | Try to load model onto cpu. Fallback to disk if model is too large
LLM Compressor Examples | This is the recommended load option when using the "basic" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline
LLM Compressor Examples | This is the recommended load option when using the "basic" or "data_free" pipeline |   |   | This is the recommended load option when using the "sequential" pipeline

## Disk Offloading ##
When compressing models which are larger than the available CPU memory, it is recommended to utilize disk offloading for any weights which cannot fit on the cpu. To enable disk offloading, use the `load_offloaded_model` context from `compressed_tensors` to load your model, along with `device_map="auto_offload"`.
Expand Down
File renamed without changes.
5 changes: 4 additions & 1 deletion docs/steps/compress.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ Before you begin, ensure that your environment meets the following prerequisites
LLM Compressor provides the `oneshot` API for simple and straightforward model compression. This API allows you to apply a recipe, which defines your chosen quantization scheme and quantization algorithm, to your selected model.
We'll import the `QuantizationModifier` modifier, which applies the RTN quantization algorithm and create a recipe to apply FP8 Block quantization to our model. The final model is compressed in the compressed-tensors format and ready to deploy in vLLM.

!!! info
Note: The following script is for single-process quantization. The model is loaded onto any available GPUs and then offloaded onto the cpu if it is too large. For distributed support or support for very large models (such as certain MoEs, including Kimi-K2), see the [Big Models and Distributed Support guide](../guides/big_models_and_distributed/model_loading.md).

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

Expand Down Expand Up @@ -42,7 +45,7 @@ oneshot(model=model, recipe=recipe)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
dispatch_model(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
model.device
)
Expand Down
Loading