Skip to content

Commit 4e6aa76

Browse files
dsikkaclaude
andauthored
[Docs] Update docs for v0.10.0 release (#2516)
## Summary - Updates release notes link in `docs/index.md` from v0.9.0 to v0.10.0 - Renames "Guides" to "User Guides" in `docs/.nav.yml` - Updates `docs/guides/saving_a_model.md` to reference `compressed-tensors` for serialization and adds an explicit `quantization_format` example using `W4AFP8` with `QuantizationFormat.pack_quantized` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
1 parent 5ae2e14 commit 4e6aa76

File tree

3 files changed

+36
-11
lines changed

3 files changed

+36
-11
lines changed

docs/.nav.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,13 @@ nav:
2929
- Mistral Large 3:
3030
- key-models/mistral-large-3/index.md
3131
- FP8 Example: key-models/mistral-large-3/fp8-example.md
32-
- Guides:
32+
- User Guides:
3333
- Big Models and Distributed Support:
3434
- Model Loading: guides/big_models_and_distributed/model_loading.md
3535
- Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
3636
- Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
3737
- Compression Schemes: guides/compression_schemes.md
38-
- Saving a Model: guides/saving_a_model.md
38+
- Saving a Compressed Model: guides/saving_a_model.md
3939
- Observers: guides/observers.md
4040
- Memory Requirements: guides/memory.md
4141
- Runtime Performance: guides/runtime.md

docs/guides/saving_a_model.md

Lines changed: 33 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Saving a Model
1+
# Saving a Compressed Model
22

3-
The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. This document explains these extra arguments and how to use them effectively.
3+
The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. Serialization is handled by [compressed-tensors](https://github.com/neuralmagic/compressed-tensors), which manages the on-disk format for quantized and sparse models. This document explains these extra arguments and how to use them effectively.
44

55
## How It Works
66

@@ -17,7 +17,7 @@ When saving your compressed models, you can use the following extra arguments wi
1717

1818
| Parameter | Type | Default | Description |
1919
|-----------|------|---------|-------------|
20-
| `quantization_format` | `Optional[str]` | `None` | Optional format string for quantization. If not provided, it will be inferred from the model. |
20+
| `quantization_format` | `Optional[str]` | `None` | The on-disk serialization format for quantized weights, defined by `compressed_tensors.QuantizationFormat`. If not provided, it is inferred from the model's quantization scheme. See the compressed-tensors documentation for available formats. |
2121
| `save_compressed` | `bool` | `True` | Controls whether to save the model in a compressed format. Set to `False` to save in the original frozen state. |
2222

2323
## Examples
@@ -46,7 +46,34 @@ oneshot(
4646
SAVE_DIR = "your-model-W8A8-compressed"
4747
model.save_pretrained(
4848
SAVE_DIR,
49-
save_compressed=True # Use the enhanced functionality
49+
save_compressed=True
50+
)
51+
tokenizer.save_pretrained(SAVE_DIR)
52+
```
53+
54+
### Setting quantization_format Explicitly
55+
56+
You can override the inferred format by passing `quantization_format` directly using `compressed_tensors.QuantizationFormat`. This is useful when you want to control exactly how weights are serialized on disk:
57+
58+
```python
59+
from transformers import AutoModelForCausalLM, AutoTokenizer
60+
from compressed_tensors import QuantizationFormat
61+
from llmcompressor import oneshot
62+
from llmcompressor.modifiers.quantization import QuantizationModifier
63+
64+
model = AutoModelForCausalLM.from_pretrained("your-model")
65+
tokenizer = AutoTokenizer.from_pretrained("your-model")
66+
67+
oneshot(
68+
model=model,
69+
recipe=[QuantizationModifier(targets="Linear", scheme="W4AFP8", ignore=["lm_head"])],
70+
)
71+
72+
SAVE_DIR = "your-model-W4AFP8"
73+
model.save_pretrained(
74+
SAVE_DIR,
75+
save_compressed=True,
76+
quantization_format=QuantizationFormat.pack_quantized,
5077
)
5178
tokenizer.save_pretrained(SAVE_DIR)
5279
```
@@ -56,12 +83,10 @@ tokenizer.save_pretrained(SAVE_DIR)
5683
!!! warning
5784
Sparse compression (including 2of4 sparsity) is no longer supported by LLM Compressor due lack of hardware support and user interest. Please see https://github.com/vllm-project/vllm/pull/36799 for more information.
5885

59-
- When loading compressed models with `from_pretrained`, the compression format is automatically detected.
86+
- When loading compressed models with `from_pretrained`, the compression format is automatically detected by `compressed-tensors`.
6087
- To use compressed models with vLLM, simply load them as you would any model:
6188
```python
6289
from vllm import LLM
6390
model = LLM("./your-model-compressed")
6491
```
65-
- Compression configurations are saved in the model's config file and are automatically applied when loading.
66-
67-
For more information about compression algorithms and formats, please refer to the documentation and examples in the llmcompressor repository.
92+
- Compression configurations are saved in the model's `config.json` and are automatically applied when loading.

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md
2121

2222
## New in this release
2323

24-
Review the [LLM Compressor v0.9.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.9.0) for details about new features. Highlights include:
24+
Review the [LLM Compressor v0.10.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.10.0) for details about new features. Highlights include:
2525

2626
!!! info "Updated offloading and model loading support"
2727
Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](guides/big_models_and_distributed/model_loading.md)

0 commit comments

Comments
 (0)