vllm-project · dsikka · Mar 25, 2026 · Mar 25, 2026
diff --git a/docs/.nav.yml b/docs/.nav.yml
@@ -29,13 +29,13 @@ nav:
     - Mistral Large 3:
       - key-models/mistral-large-3/index.md
       - FP8 Example: key-models/mistral-large-3/fp8-example.md
-  - Guides:
+  - User Guides:
     - Big Models and Distributed Support:
       - Model Loading: guides/big_models_and_distributed/model_loading.md
       - Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
       - Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
     - Compression Schemes: guides/compression_schemes.md
-    - Saving a Model: guides/saving_a_model.md
+    - Saving a Compressed Model: guides/saving_a_model.md
     - Observers: guides/observers.md
     - Memory Requirements: guides/memory.md
     - Runtime Performance: guides/runtime.md

diff --git a/docs/guides/saving_a_model.md b/docs/guides/saving_a_model.md
@@ -1,6 +1,6 @@
-# Saving a Model
+# Saving a Compressed Model
 
-The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. This document explains these extra arguments and how to use them effectively.
+The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. Serialization is handled by [compressed-tensors](https://github.com/neuralmagic/compressed-tensors), which manages the on-disk format for quantized and sparse models. This document explains these extra arguments and how to use them effectively.
 
 ## How It Works
 
@@ -17,7 +17,7 @@ When saving your compressed models, you can use the following extra arguments wi
 
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
-| `quantization_format` | `Optional[str]` | `None` | Optional format string for quantization. If not provided, it will be inferred from the model. |
+| `quantization_format` | `Optional[str]` | `None` | The on-disk serialization format for quantized weights, defined by `compressed_tensors.QuantizationFormat`. If not provided, it is inferred from the model's quantization scheme. See the compressed-tensors documentation for available formats. |
 | `save_compressed` | `bool` | `True` | Controls whether to save the model in a compressed format. Set to `False` to save in the original frozen state. |
 
 ## Examples
@@ -46,7 +46,34 @@ oneshot(
 SAVE_DIR = "your-model-W8A8-compressed"
 model.save_pretrained(
     SAVE_DIR,
-    save_compressed=True  # Use the enhanced functionality
+    save_compressed=True
+)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+
+### Setting quantization_format Explicitly
+
+You can override the inferred format by passing `quantization_format` directly using `compressed_tensors.QuantizationFormat`. This is useful when you want to control exactly how weights are serialized on disk:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from compressed_tensors import QuantizationFormat
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+model = AutoModelForCausalLM.from_pretrained("your-model")
+tokenizer = AutoTokenizer.from_pretrained("your-model")
+
+oneshot(
+    model=model,
+    recipe=[QuantizationModifier(targets="Linear", scheme="W4AFP8", ignore=["lm_head"])],
+)
+
+SAVE_DIR = "your-model-W4AFP8"
+model.save_pretrained(
+    SAVE_DIR,
+    save_compressed=True,
+    quantization_format=QuantizationFormat.pack_quantized,
 )
 tokenizer.save_pretrained(SAVE_DIR)
 ```
@@ -56,12 +83,10 @@ tokenizer.save_pretrained(SAVE_DIR)
 !!! warning
     Sparse compression (including 2of4 sparsity) is no longer supported by LLM Compressor due lack of hardware support and user interest. Please see https://github.com/vllm-project/vllm/pull/36799 for more information.
 
-- When loading compressed models with `from_pretrained`, the compression format is automatically detected.
+- When loading compressed models with `from_pretrained`, the compression format is automatically detected by `compressed-tensors`.
 - To use compressed models with vLLM, simply load them as you would any model:
   ```python
   from vllm import LLM
   model = LLM("./your-model-compressed")
   ```
-- Compression configurations are saved in the model's config file and are automatically applied when loading.
-
-For more information about compression algorithms and formats, please refer to the documentation and examples in the llmcompressor repository.
+- Compression configurations are saved in the model's `config.json` and are automatically applied when loading.
diff --git a/docs/index.md b/docs/index.md
@@ -21,7 +21,7 @@ For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md
 
 ## New in this release
 
-Review the [LLM Compressor v0.9.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.9.0) for details about new features. Highlights include:
+Review the [LLM Compressor v0.10.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.10.0) for details about new features. Highlights include:
 
 !!! info "Updated offloading and model loading support"
     Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](guides/big_models_and_distributed/model_loading.md)