Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,13 @@ nav:
- Mistral Large 3:
- key-models/mistral-large-3/index.md
- FP8 Example: key-models/mistral-large-3/fp8-example.md
- Guides:
- User Guides:
- Big Models and Distributed Support:
- Model Loading: guides/big_models_and_distributed/model_loading.md
- Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
- Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
- Compression Schemes: guides/compression_schemes.md
- Saving a Model: guides/saving_a_model.md
- Saving a Compressed Model: guides/saving_a_model.md
- Observers: guides/observers.md
- Memory Requirements: guides/memory.md
- Runtime Performance: guides/runtime.md
Expand Down
41 changes: 33 additions & 8 deletions docs/guides/saving_a_model.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Saving a Model
# Saving a Compressed Model

The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. This document explains these extra arguments and how to use them effectively.
The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. Serialization is handled by [compressed-tensors](https://github.com/neuralmagic/compressed-tensors), which manages the on-disk format for quantized and sparse models. This document explains these extra arguments and how to use them effectively.

## How It Works

Expand All @@ -17,7 +17,7 @@ When saving your compressed models, you can use the following extra arguments wi

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `quantization_format` | `Optional[str]` | `None` | Optional format string for quantization. If not provided, it will be inferred from the model. |
| `quantization_format` | `Optional[str]` | `None` | The on-disk serialization format for quantized weights, defined by `compressed_tensors.QuantizationFormat`. If not provided, it is inferred from the model's quantization scheme. See the compressed-tensors documentation for available formats. |
| `save_compressed` | `bool` | `True` | Controls whether to save the model in a compressed format. Set to `False` to save in the original frozen state. |

## Examples
Expand Down Expand Up @@ -46,7 +46,34 @@ oneshot(
SAVE_DIR = "your-model-W8A8-compressed"
model.save_pretrained(
SAVE_DIR,
save_compressed=True # Use the enhanced functionality
save_compressed=True
)
tokenizer.save_pretrained(SAVE_DIR)
```

### Setting quantization_format Explicitly

You can override the inferred format by passing `quantization_format` directly using `compressed_tensors.QuantizationFormat`. This is useful when you want to control exactly how weights are serialized on disk:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from compressed_tensors import QuantizationFormat
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

model = AutoModelForCausalLM.from_pretrained("your-model")
tokenizer = AutoTokenizer.from_pretrained("your-model")

oneshot(
model=model,
recipe=[QuantizationModifier(targets="Linear", scheme="W4AFP8", ignore=["lm_head"])],
)

SAVE_DIR = "your-model-W4AFP8"
model.save_pretrained(
SAVE_DIR,
save_compressed=True,
quantization_format=QuantizationFormat.pack_quantized,
)
tokenizer.save_pretrained(SAVE_DIR)
```
Expand All @@ -56,12 +83,10 @@ tokenizer.save_pretrained(SAVE_DIR)
!!! warning
Sparse compression (including 2of4 sparsity) is no longer supported by LLM Compressor due lack of hardware support and user interest. Please see https://github.com/vllm-project/vllm/pull/36799 for more information.

- When loading compressed models with `from_pretrained`, the compression format is automatically detected.
- When loading compressed models with `from_pretrained`, the compression format is automatically detected by `compressed-tensors`.
- To use compressed models with vLLM, simply load them as you would any model:
```python
from vllm import LLM
model = LLM("./your-model-compressed")
```
- Compression configurations are saved in the model's config file and are automatically applied when loading.

For more information about compression algorithms and formats, please refer to the documentation and examples in the llmcompressor repository.
- Compression configurations are saved in the model's `config.json` and are automatically applied when loading.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md

## New in this release

Review the [LLM Compressor v0.9.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.9.0) for details about new features. Highlights include:
Review the [LLM Compressor v0.10.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.10.0) for details about new features. Highlights include:

!!! info "Updated offloading and model loading support"
Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](guides/big_models_and_distributed/model_loading.md)
Expand Down
Loading