You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
- Updates release notes link in `docs/index.md` from v0.9.0 to v0.10.0
- Renames "Guides" to "User Guides" in `docs/.nav.yml`
- Updates `docs/guides/saving_a_model.md` to reference
`compressed-tensors` for serialization and adds an explicit
`quantization_format` example using `W4AFP8` with
`QuantizationFormat.pack_quantized`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/guides/saving_a_model.md
+33-8Lines changed: 33 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# Saving a Model
1
+
# Saving a Compressed Model
2
2
3
-
The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. This document explains these extra arguments and how to use them effectively.
3
+
The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. Serialization is handled by [compressed-tensors](https://github.com/neuralmagic/compressed-tensors), which manages the on-disk format for quantized and sparse models. This document explains these extra arguments and how to use them effectively.
4
4
5
5
## How It Works
6
6
@@ -17,7 +17,7 @@ When saving your compressed models, you can use the following extra arguments wi
17
17
18
18
| Parameter | Type | Default | Description |
19
19
|-----------|------|---------|-------------|
20
-
|`quantization_format`|`Optional[str]`|`None`|Optional format string for quantization. If not provided, it will be inferred from the model. |
20
+
|`quantization_format`|`Optional[str]`|`None`|The on-disk serialization format for quantized weights, defined by `compressed_tensors.QuantizationFormat`. If not provided, it is inferred from the model's quantization scheme. See the compressed-tensors documentation for available formats. |
21
21
|`save_compressed`|`bool`|`True`| Controls whether to save the model in a compressed format. Set to `False` to save in the original frozen state. |
22
22
23
23
## Examples
@@ -46,7 +46,34 @@ oneshot(
46
46
SAVE_DIR="your-model-W8A8-compressed"
47
47
model.save_pretrained(
48
48
SAVE_DIR,
49
-
save_compressed=True# Use the enhanced functionality
49
+
save_compressed=True
50
+
)
51
+
tokenizer.save_pretrained(SAVE_DIR)
52
+
```
53
+
54
+
### Setting quantization_format Explicitly
55
+
56
+
You can override the inferred format by passing `quantization_format` directly using `compressed_tensors.QuantizationFormat`. This is useful when you want to control exactly how weights are serialized on disk:
57
+
58
+
```python
59
+
from transformers import AutoModelForCausalLM, AutoTokenizer
60
+
from compressed_tensors import QuantizationFormat
61
+
from llmcompressor import oneshot
62
+
from llmcompressor.modifiers.quantization import QuantizationModifier
63
+
64
+
model = AutoModelForCausalLM.from_pretrained("your-model")
Sparse compression (including 2of4 sparsity) is no longer supported by LLM Compressor due lack of hardware support and user interest. Please see https://github.com/vllm-project/vllm/pull/36799 for more information.
58
85
59
-
- When loading compressed models with `from_pretrained`, the compression format is automatically detected.
86
+
- When loading compressed models with `from_pretrained`, the compression format is automatically detected by `compressed-tensors`.
60
87
- To use compressed models with vLLM, simply load them as you would any model:
61
88
```python
62
89
from vllm importLLM
63
90
model = LLM("./your-model-compressed")
64
91
```
65
-
- Compression configurations are saved in the model's config file and are automatically applied when loading.
66
-
67
-
For more information about compression algorithms and formats, please refer to the documentation and examples in the llmcompressor repository.
92
+
- Compression configurations are saved in the model's `config.json` and are automatically applied when loading.
Copy file name to clipboardExpand all lines: docs/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md
21
21
22
22
## New in this release
23
23
24
-
Review the [LLM Compressor v0.9.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.9.0) for details about new features. Highlights include:
24
+
Review the [LLM Compressor v0.10.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.10.0) for details about new features. Highlights include:
25
25
26
26
!!! info "Updated offloading and model loading support"
27
27
Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](guides/big_models_and_distributed/model_loading.md)
0 commit comments