diff --git a/docs/.nav.yml b/docs/.nav.yml index e15bbe4553..fe03240b0f 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -29,13 +29,13 @@ nav: - Mistral Large 3: - key-models/mistral-large-3/index.md - FP8 Example: key-models/mistral-large-3/fp8-example.md - - Guides: + - User Guides: - Big Models and Distributed Support: - Model Loading: guides/big_models_and_distributed/model_loading.md - Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md - Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md - Compression Schemes: guides/compression_schemes.md - - Saving a Model: guides/saving_a_model.md + - Saving a Compressed Model: guides/saving_a_model.md - Observers: guides/observers.md - Memory Requirements: guides/memory.md - Runtime Performance: guides/runtime.md diff --git a/docs/guides/saving_a_model.md b/docs/guides/saving_a_model.md index a3cf6be9c7..e9e8325938 100644 --- a/docs/guides/saving_a_model.md +++ b/docs/guides/saving_a_model.md @@ -1,6 +1,6 @@ -# Saving a Model +# Saving a Compressed Model -The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. This document explains these extra arguments and how to use them effectively. +The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. Serialization is handled by [compressed-tensors](https://github.com/neuralmagic/compressed-tensors), which manages the on-disk format for quantized and sparse models. This document explains these extra arguments and how to use them effectively. ## How It Works @@ -17,7 +17,7 @@ When saving your compressed models, you can use the following extra arguments wi | Parameter | Type | Default | Description | |-----------|------|---------|-------------| -| `quantization_format` | `Optional[str]` | `None` | Optional format string for quantization. If not provided, it will be inferred from the model. | +| `quantization_format` | `Optional[str]` | `None` | The on-disk serialization format for quantized weights, defined by `compressed_tensors.QuantizationFormat`. If not provided, it is inferred from the model's quantization scheme. See the compressed-tensors documentation for available formats. | | `save_compressed` | `bool` | `True` | Controls whether to save the model in a compressed format. Set to `False` to save in the original frozen state. | ## Examples @@ -46,7 +46,34 @@ oneshot( SAVE_DIR = "your-model-W8A8-compressed" model.save_pretrained( SAVE_DIR, - save_compressed=True # Use the enhanced functionality + save_compressed=True +) +tokenizer.save_pretrained(SAVE_DIR) +``` + +### Setting quantization_format Explicitly + +You can override the inferred format by passing `quantization_format` directly using `compressed_tensors.QuantizationFormat`. This is useful when you want to control exactly how weights are serialized on disk: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer +from compressed_tensors import QuantizationFormat +from llmcompressor import oneshot +from llmcompressor.modifiers.quantization import QuantizationModifier + +model = AutoModelForCausalLM.from_pretrained("your-model") +tokenizer = AutoTokenizer.from_pretrained("your-model") + +oneshot( + model=model, + recipe=[QuantizationModifier(targets="Linear", scheme="W4AFP8", ignore=["lm_head"])], +) + +SAVE_DIR = "your-model-W4AFP8" +model.save_pretrained( + SAVE_DIR, + save_compressed=True, + quantization_format=QuantizationFormat.pack_quantized, ) tokenizer.save_pretrained(SAVE_DIR) ``` @@ -56,12 +83,10 @@ tokenizer.save_pretrained(SAVE_DIR) !!! warning Sparse compression (including 2of4 sparsity) is no longer supported by LLM Compressor due lack of hardware support and user interest. Please see https://github.com/vllm-project/vllm/pull/36799 for more information. -- When loading compressed models with `from_pretrained`, the compression format is automatically detected. +- When loading compressed models with `from_pretrained`, the compression format is automatically detected by `compressed-tensors`. - To use compressed models with vLLM, simply load them as you would any model: ```python from vllm import LLM model = LLM("./your-model-compressed") ``` -- Compression configurations are saved in the model's config file and are automatically applied when loading. - -For more information about compression algorithms and formats, please refer to the documentation and examples in the llmcompressor repository. \ No newline at end of file +- Compression configurations are saved in the model's `config.json` and are automatically applied when loading. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 636e3776db..42242f82b8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -21,7 +21,7 @@ For more information, see [Why use LLM Compressor?](./steps/why-llmcompressor.md ## New in this release -Review the [LLM Compressor v0.9.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.9.0) for details about new features. Highlights include: +Review the [LLM Compressor v0.10.0 release notes](https://github.com/vllm-project/llm-compressor/releases/tag/0.10.0) for details about new features. Highlights include: !!! info "Updated offloading and model loading support" Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](guides/big_models_and_distributed/model_loading.md)