diff --git a/docs/.nav.yml b/docs/.nav.yml index 2546bd9d75..2d0236b248 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -32,8 +32,11 @@ nav: - Memory Requirements: guides/memory.md - Runtime Performance: guides/runtime.md - Examples: - - examples/index.md + - examples/README.md - examples/* + - Experimental: + - experimental/README.md + - experimental/* - Developer: - developer/index.md - developer/* diff --git a/docs/api/index.md b/docs/api/index.md index 22812a1b49..6a0ffa1ed7 100644 --- a/docs/api/index.md +++ b/docs/api/index.md @@ -19,4 +19,4 @@ oneshot( ``` For advanced usage, you can configure individual modifiers and apply them directly to models. -See the [Examples](../examples/index.md) section for detailed usage patterns. +See the [Examples](https://github.com/vllm-project/llm-compressor/tree/main/examples) section for detailed usage patterns. diff --git a/docs/examples/index.md b/docs/examples/index.md deleted file mode 100644 index 27d8aaad44..0000000000 --- a/docs/examples/index.md +++ /dev/null @@ -1,5 +0,0 @@ -# LLM Compressor examples - -This section provides practical demonstrations showing how to use LLM Compressor to optimize large language models for faster and more efficient deployment with vLLM. These examples will help you understand the various compression techniques and functionalities available in LLM Compressor, making it easier to apply them to your own models. - -Each example is designed to be self-contained, with clear instructions and code snippets that you can run directly. diff --git a/docs/scripts/gen_files.py b/docs/scripts/gen_files.py index af1b6cf0be..8daf6374f0 100644 --- a/docs/scripts/gen_files.py +++ b/docs/scripts/gen_files.py @@ -82,6 +82,16 @@ def migrate_examples(): examples_path = project_root / "examples" files = [] + # Add the main examples README.md + main_readme = examples_path / "README.md" + if main_readme.exists(): + files.append( + ProcessFile( + root_path=main_readme.relative_to(project_root), + docs_path=Path("examples/README.md"), + ) + ) + # Find all README.md files 2 levels down (examples/EXAMPLE_NAME/README.md) for example_dir in examples_path.iterdir(): if ( @@ -101,6 +111,40 @@ def migrate_examples(): process_files(files, project_root) +def migrate_experimental(): + project_root = find_project_root() + experimental_path = project_root / "experimental" + files = [] + + # Add the main experimental README.md + main_readme = experimental_path / "README.md" + if main_readme.exists(): + files.append( + ProcessFile( + root_path=main_readme.relative_to(project_root), + docs_path=Path("experimental/README.md"), + ) + ) + + # Find all README.md files 2 levels down (experimental/EXPERIMENTAL_NAME/README.md) + for experimental_dir in experimental_path.iterdir(): + if ( + not experimental_dir.is_dir() + or not (readme_path := experimental_dir / "README.md").exists() + ): + continue + + experimental_name = experimental_dir.name + files.append( + ProcessFile( + root_path=readme_path.relative_to(project_root), + docs_path=Path(f"experimental/{experimental_name}.md"), + ) + ) + + process_files(files, project_root) + + def migrate_readme_to_index(): """Copy README.md files to index.md for MkDocs compatibility. @@ -127,4 +171,5 @@ def migrate_readme_to_index(): migrate_developer_docs() migrate_examples() +migrate_experimental() migrate_readme_to_index() diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000000..7cdcc20f6c --- /dev/null +++ b/examples/README.md @@ -0,0 +1,33 @@ +--- +weight: -4 +--- + +# LLM Compressor Examples + +The LLM Compressor examples are organized primarily by quantization scheme. Each folder contains model-specific examples showing how to apply that quantization scheme to a particular model. + +Some examples are additionally grouped by model type, such as: +- `multimodal_audio` +- `multimodal_vision` +- `quantizing_moe` + +Other examples are grouped by algorithm, such as: +- `awq` +- `autoround` + +## How to find the right example + +- If you are interested in quantizing a specific model, start by browsing the model-type folders (for example, `multimodal_audio`, `multimodal_vision`, or `quantizing_moe`). +- If you don’t see your model there, decide which quantization scheme you want to use (e.g., FP8, FP4, INT4, INT8, or KV cache / attention quantization) and look in the corresponding `quantization_***` folder. +- Each quantization scheme folder contains at least one LLaMA 3 example, which can be used as a general reference for other models. + +## Where to start if you’re unsure + +If you’re unsure which quantization scheme to use, a good starting point is a data-free pathway, such as `w8a8_fp8`, found under `quantization_w8a8_fp8`. For more details on available schemes and when to use them, see the Compression Schemes [guide](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/compression_schemes/). + +## Need help? + +If you don’t see your model or aren’t sure which quantization scheme applies, feel free to open an issue and someone from the community will be happy to help. + +!!! note + We are currently updating and improving our documentation and examples structure. Feedback is very welcome during this transition. \ No newline at end of file diff --git a/examples/awq/README.md b/examples/awq/README.md index ee7f00e602..321d77a960 100644 --- a/examples/awq/README.md +++ b/examples/awq/README.md @@ -1,4 +1,4 @@ -# Quantizing Models with Activation-Aware Quantization (AWQ) # +# AWQ Quantization # Activation Aware Quantization (AWQ) is a state-of-the-art technique to quantize the weights of large language models which involves using a small calibration dataset to calibrate the model. The AWQ algorithm utilizes calibration data to derive scaling factors which reduce the dynamic range of weights while minimizing accuracy loss to the most salient weight values. diff --git a/examples/big_models_with_sequential_onloading/README.md b/examples/big_models_with_sequential_onloading/README.md index 60ec557ad1..0ebd550b4c 100644 --- a/examples/big_models_with_sequential_onloading/README.md +++ b/examples/big_models_with_sequential_onloading/README.md @@ -1,4 +1,5 @@ -# Big Modeling with Sequential Onloading # +# Big Model Quantization with Sequential Onloading + ## What is Sequential Onloading? ## Sequential onloading is a memory-efficient approach for compressing large language models (LLMs) using only a single GPU. Instead of loading the entire model into memory—which can easily require hundreds of gigabytes—this method loads and compresses one layer at a time. The outputs are offloaded before the next layer is processed, dramatically reducing peak memory usage while maintaining high compression fidelity. diff --git a/examples/model_free_ptq/README.md b/examples/model_free_ptq/README.md index 28b0e75c63..a33481405e 100644 --- a/examples/model_free_ptq/README.md +++ b/examples/model_free_ptq/README.md @@ -1,4 +1,4 @@ -# Quantizing models without a model definition +# Model-free Quantization `model_free_ptq` provides a PTQ pathway for data-free schemes (such for FP8 Dynamic Per Token or FP8 Block). Specifically, this pathway removes the requirement for a model definition or the need to load the model through transformers. If you are interested in applying a data-free scheme, there are two key scenarios in which applying this pathway may make sense for your model: diff --git a/examples/multimodal_audio/README.md b/examples/multimodal_audio/README.md index 0f8250d8df..6aeb36d30e 100644 --- a/examples/multimodal_audio/README.md +++ b/examples/multimodal_audio/README.md @@ -1,4 +1,4 @@ -# Quantizing Multimodal Audio Models # +# Multimodal Audio Model Quantization https://github.com/user-attachments/assets/6732c60b-1ebe-4bed-b409-c16c4415dff5 diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index fc6f75c1d5..22fe907063 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -1,4 +1,4 @@ -# Quantizing Multimodal Vision-Language Models # +# Multimodal Vision-Language Quantization #
diff --git a/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_group-128_recipe.yaml b/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_group-128_recipe.yaml
deleted file mode 100644
index bb76f11015..0000000000
--- a/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_group-128_recipe.yaml
+++ /dev/null
@@ -1,20 +0,0 @@
-sparsity_stage:
- sparsity_modifiers:
- SparseGPTModifier:
- sparsity: 0.5
- mask_structure: "2:4"
- targets: ["Linear"]
- ignore: ["re:.*lm_head"]
-quantization_stage:
- quantization_modifiers:
- GPTQModifier:
- ignore: ["lm_head"]
- config_groups:
- group_0:
- weights:
- num_bits: 4
- type: "int"
- symmetric: true
- strategy: "group"
- group_size: 128
- targets: ["Linear"]
diff --git a/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_recipe.yaml b/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_recipe.yaml
deleted file mode 100644
index a5c40228a9..0000000000
--- a/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_recipe.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-sparsity_stage:
- sparsity_modifiers:
- SparseGPTModifier:
- sparsity: 0.5
- mask_structure: "2:4"
- targets: ["Linear"]
- ignore: ["re:.*lm_head"]
-finetuning_stage:
- finetuning_modifiers:
- ConstantPruningModifier:
- targets: [
- 're:.*q_proj.weight',
- 're:.*k_proj.weight',
- 're:.*v_proj.weight',
- 're:.*o_proj.weight',
- 're:.*gate_proj.weight',
- 're:.*up_proj.weight',
- 're:.*down_proj.weight',
- ]
- start: 0
-quantization_stage:
- quantization_modifiers:
- GPTQModifier:
- ignore: ["lm_head"]
- config_groups:
- group_0:
- weights:
- num_bits: 4
- type: "int"
- symmetric: true
- strategy: "channel"
- targets: ["Linear"]
diff --git a/examples/quantization_2of4_sparse_w4a16/README.md b/examples/quantization_2of4_sparse_w4a16/README.md
deleted file mode 100644
index c72bc97d2b..0000000000
--- a/examples/quantization_2of4_sparse_w4a16/README.md
+++ /dev/null
@@ -1,131 +0,0 @@
-# `int4` Weight Quantization of a 2:4 Sparse Model
-
-> **DEPRECATION WARNING:** The `marlin_24` compression format is deprecated and will be removed in a future release, as vLLM no longer supports marlin_24 models. See [issue #2267](https://github.com/vllm-project/llm-compressor/issues/2267) for more details.
-
-`llm-compressor` supports quantizing weights while maintaining sparsity patterns for memory savings and inference acceleration with `vLLM`
-
-> `2:4 sparisty + int4/int8` mixed precision computation is supported in vLLM on Nvidia capability > 8.0 (Ampere, Ada Lovelace, Hopper).
-
-## NOTE: The following example no longer includes finetuning as training
-Training support has been deprecated as of v0.9.0. To apply finetuning
-to your sparse model, see the Axolotl integration blog post for best
-fine tuning practices
-https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open
-
-
-## Installation
-
-To get started, install:
-
-```bash
-git clone https://github.com/vllm-project/llm-compressor.git
-cd llm-compressor
-pip install -e .
-```
-
-## Quickstart
-
-The example includes an end-to-end script for applying the quantization algorithm.
-
-```bash
-python3 llama7b_sparse_w4a16.py
-```
-
-
-# Creating a Sparse Quantized Llama7b Model
-
-This example uses LLMCompressor and Compressed-Tensors to create a 2:4 sparse and quantized Llama2-7b model.
-The model is calibrated and trained with the ultachat200k dataset.
-At least 75GB of GPU memory is required to run this example.
-
-Follow the steps below, or to run the example as `python examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py`
-
-## Step 1: Select a model, dataset, and recipe
-In this step, we select which model to use as a baseline for sparsification, a dataset to
-use for calibration and finetuning, and a recipe.
-
-Models can reference a local directory, or a model in the huggingface hub.
-
-Datasets can be from a local compatible directory or the huggingface hub.
-
-Recipes are YAML files that describe how a model should be optimized during or after training.
-The recipe used for this flow is located in [2of4_w4a16_recipe.yaml](./2of4_w4a16_recipe.yaml).
-It contains instructions to prune the model to 2:4 sparsity, run one epoch of recovery finetuning,
-and quantize to 4 bits in one show using GPTQ.
-
-```python
-from pathlib import Path
-
-import torch
-from loguru import logger
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-from llmcompressor import oneshot, train
-
-# load the model in as bfloat16 to save on memory and compute
-model_stub = "neuralmagic/Llama-2-7b-ultrachat200k"
-model = AutoModelForCausalLM.from_pretrained(model_stub, dtype=torch.bfloat16)
-tokenizer = AutoTokenizer.from_pretrained(model_stub)
-
-# uses LLM Compressor's built-in preprocessing for ultra chat
-dataset = "ultrachat-200k"
-
-# Select the recipe for 2 of 4 sparsity and 4-bit activation quantization
-recipe = "2of4_w4a16_recipe.yaml"
-
-# save location of quantized model
-output_dir = "output_llama7b_2of4_w4a16_channel"
-output_path = Path(output_dir)
-
-# set dataset config parameters
-splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}
-max_seq_length = 512
-num_calibration_samples = 512
-preprocessing_num_workers = 8
-```
-
-## Step 2: Run `sparsification` and `quantization`
-The compression process now runs in two stages: sparsification and quantization.
-Each stage saves the intermediate model outputs to the `output_llama7b_2of4_w4a16_channel` directory.
-
-```python
-from llmcompressor import oneshot, train
-from pathlib import Path
-
-output_dir = "output_llama7b_2of4_w4a16_channel"
-output_path = Path(output_dir)
-
-# 1. Oneshot sparsification: apply pruning
-oneshot(
- model=model,
- **oneshot_kwargs,
- output_dir=output_dir,
- stage="sparsity_stage",
-)
-
-
-# 2. Oneshot quantization: compress model weights to lower precision
-quantized_model = oneshot(
- model=(output_path / "sparsity_stage"),
- **oneshot_kwargs,
- stage="quantization_stage",
-)
-
-# skip_sparsity_compression_stats is set to False
-# to account for sparsity in the model when compressing
-quantized_model.save_pretrained(
- f"{output_dir}/quantization_stage", skip_sparsity_compression_stats=False
-)
-tokenizer.save_pretrained(f"{output_dir}/quantization_stage")
-
-```
-
-### Custom Quantization
-The current repo supports multiple quantization techniques configured using a recipe. Supported strategies are tensor, group, and channel.
-
-The recipe (`2of4_w4a16_recipe.yaml`) uses channel-wise quantization (`strategy: "channel"`).
-To change the quantization strategy, edit the recipe file accordingly:
-
-Use `tensor` for per-tensor quantization
-Use `group` for group-wise quantization and specify the group_size parameter (e.g., 128)
-See `2of4_w4a16_group-128_recipe.yaml` for a group-size example
diff --git a/examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py b/examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
deleted file mode 100644
index 918fb2a793..0000000000
--- a/examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
+++ /dev/null
@@ -1,77 +0,0 @@
-# NOTE: The following example no longer includes finetuning as training.
-
-# Training support has been deprecated as of v0.9.0. To apply finetuning
-# to your sparse model, see the Axolotl integration blog post for best
-# fine tuning practices
-# https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open
-
-# DEPRECATION WARNING: The marlin24 compression format is deprecated and will
-# be removed in a future release, as vLLM no longer supports marlin24 models.
-# See https://github.com/vllm-project/llm-compressor/issues/2267 for details.
-
-import warnings
-from pathlib import Path
-
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-from llmcompressor import oneshot
-
-# load the model in as bfloat16 to save on memory and compute
-model_stub = "neuralmagic/Llama-2-7b-ultrachat200k"
-model = AutoModelForCausalLM.from_pretrained(model_stub, dtype=torch.bfloat16)
-tokenizer = AutoTokenizer.from_pretrained(model_stub)
-
-# uses LLM Compressor's built-in preprocessing for ultra chat
-dataset = "ultrachat-200k"
-
-# Select the recipe for 2 of 4 sparsity and 4-bit activation quantization
-recipe = "2of4_w4a16_recipe.yaml"
-
-# save location of quantized model
-output_dir = "output_llama7b_2of4_w4a16_channel"
-output_path = Path(output_dir)
-
-# set dataset config parameters
-splits = {"calibration": "train_gen[:5%]"}
-max_seq_length = 512
-num_calibration_samples = 10
-preprocessing_num_workers = 64
-
-oneshot_kwargs = dict(
- dataset=dataset,
- recipe=recipe,
- num_calibration_samples=num_calibration_samples,
- preprocessing_num_workers=preprocessing_num_workers,
- splits=splits,
-)
-
-# Models are automatically saved in
-# ./output_llama7b_2of4_w4a16_channel/ + (sparsity/quantization)_stage
-
-# Oneshot sparsification
-oneshot(
- model=model,
- **oneshot_kwargs,
- output_dir=output_dir,
- stage="sparsity_stage",
-)
-
-# Oneshot quantization
-quantized_model = oneshot(
- model=(output_path / "sparsity_stage"),
- **oneshot_kwargs,
- stage="quantization_stage",
-)
-quantized_model.save_pretrained(
- f"{output_dir}/quantization_stage", skip_sparsity_compression_stats=False
-)
-tokenizer.save_pretrained(f"{output_dir}/quantization_stage")
-
-warnings.warn(
- "The marlin24 compression format is deprecated and will be removed in a future "
- "release, as vLLM no longer supports marlin24 models. "
- "See https://github.com/vllm-project/llm-compressor/issues/2267 for details.",
- DeprecationWarning,
- stacklevel=2,
-)
diff --git a/examples/quantization_kv_cache/README.md b/examples/quantization_kv_cache/README.md
index c1f4d0421a..07d15a205f 100644
--- a/examples/quantization_kv_cache/README.md
+++ b/examples/quantization_kv_cache/README.md
@@ -1,6 +1,6 @@
-# `fp8` Weight, Activation, and KV Cache Quantization
+# KV Cache Quantization
-`llmcompressor` now supports quantizing weights, activations, and KV cache to `fp8` for memory savings and inference acceleration with `vllm`.
+`llmcompressor` supports quantizing `fp8` KV Cache for memory savings and inference acceleration with `vllm`.
> `fp8` computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
diff --git a/examples/quantization_w4a4_fp4/README.md b/examples/quantization_w4a4_fp4/README.md
index 5f22dad4b7..5410ebb9d9 100644
--- a/examples/quantization_w4a4_fp4/README.md
+++ b/examples/quantization_w4a4_fp4/README.md
@@ -1,4 +1,6 @@
-# `fp4` Quantization
+# `fp4` Quantization with NVFP4
+
+For weight-only FP4 quantization (e.g MXFP4A16, NVFP4A16) see examples [here](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a16_fp4).
`llm-compressor` supports quantizing weights and activations to `fp4` for memory savings and inference acceleration with `vLLM`. In particular, `nvfp4` is supported - a 4-bit floating point encoding format introduced with the NVIDIA Blackwell GPU architecture.
@@ -81,14 +83,3 @@ tokenizer.save_pretrained(SAVE_DIR)
```
We have successfully created an `nvfp4` model!
-
-# Quantizing MoEs
-
-To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under `llama4_example.py`. The pipeline automatically applies the appropriate MoE calibration context which:
-
-1. Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include `torch.nn.Linear` layers in its MoE blocks, a requirement for LLM Compressor to run quantization.
-2. Ensures experts are quantized correctly as not all experts are activated during calibration
-
-Similarly, an example quantizing the Qwen3-30B-A3B model can be found under `qwen_30b_a3b.py`. This model uses contextual MoE calibration which temporarily updates the model definition to use `Qwen3MoeSparseMoeBlock` which updates how the forward pass is handled in the MoE block during calibration. Feel free to update the definition under `llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with this behavior and evaluate its impact on quantization performance.
-
-
diff --git a/examples/quantizing_moe/README.md b/examples/quantizing_moe/README.md
index 89c47fbfc4..c8ee22d13d 100644
--- a/examples/quantizing_moe/README.md
+++ b/examples/quantizing_moe/README.md
@@ -1,101 +1,139 @@
-# Quantizing Mixtral-8x7B-Instruct-v0.1 Model with FP8
+# Quantizing Mixture of Experts (MoE) models
-This directory contains example scripts for quantizing LLMs using the static per-tensor FP8 quantization scheme.
+These examples demonstrate how to quantize MoE models using `llm-compressor`. We'll walk through the GLM-4.7 example which applies AWQ quantization to create a W4A16 (4-bit weights, 16-bit activations) model.
-## Installation
+## End-to-End Example: Quantizing GLM-4.7
-To get started, install the necessary dependencies by executing the following commands:
+You can run the complete example with:
```bash
-git clone https://github.com/vllm-project/llm-compressor.git
-cd llm-compressor
-pip install -e .
+python3 glm4_7_example.py
```
-## Quickstart
+This example demonstrates quantizing the `zai-org/GLM-4.7` MoE model using AWQ (Activation-aware Weight Quantization) to 4-bit precision. The process automatically handles MoE-specific calibration requirements.
-The provided example script demonstrates an end-to-end process for applying the quantization algorithm:
+### Step 1: Load the Model and Tokenizer
-```bash
-python3 mixtral_example.py
+First, load the GLM-4.7 model and its tokenizer from the Hugging Face Hub:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor.modeling.glm4_moe import CalibrationGlm4MoeMoE # noqa: F401
+
+model_id = "zai-org/GLM-4.7"
+model = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_id)
```
-## Creating a Quantized MoE Model
+**Important**: The import of `CalibrationGlm4MoeMoE` is crucial for proper MoE calibration. This custom module automatically replaces the original `Glm4MoeMoE` class during calibration to ensure all experts are properly calibrated, even those that wouldn't normally be activated for certain tokens. More details on this can be found in [Quantizing MoEs with a custom definition](#quantizing-moes-with-a-custom-definition).
-This example leverages `llm-compressor` and `compressed-tensors` to create an FP8-quantized `Mixtral-8x7B-Instruct-v0.1` model. The model is calibrated and trained using the `ultrachat_200k` dataset.
+### Step 2: Prepare the Calibration Dataset
-You can follow the detailed steps below or simply run the example script with:
+Load and preprocess a calibration dataset. In this example, we use `ultrachat_200k`:
-```bash
-python mixtral_example.py
+```python
+from datasets import load_dataset
+
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"
+DATASET_SPLIT = "train_sft"
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048
+
+# Load and shuffle the dataset
+ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
+ds = ds.shuffle(seed=42)
+
+# Apply chat template
+def preprocess(example):
+ return {
+ "text": tokenizer.apply_chat_template(
+ example["messages"],
+ tokenize=False,
+ )
+ }
+
+ds = ds.map(preprocess)
+
+# Tokenize
+def tokenize(sample):
+ return tokenizer(
+ sample["text"],
+ padding=False,
+ max_length=MAX_SEQUENCE_LENGTH,
+ truncation=True,
+ add_special_tokens=False,
+ )
+
+ds = ds.map(tokenize, remove_columns=ds.column_names)
```
-### Step 1: Select a Model, Dataset, and Recipe
+**Note**: 512 calibration samples is a good starting point. Increasing the number of samples can improve quantization accuracy.
-In this step, you'll choose a base model for quantization, a dataset for calibration, and a quantization recipe.
+### Step 3: Configure the Quantization Recipe
-- **Models**: Can be referenced from a local directory or retrieved from the Hugging Face Hub.
-- **Datasets**: Can also be from a local directory or the Hugging Face Hub.
-- **Recipes**: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use a `QuantizationModifier` object with the scheme set to `FP8`.
+Define which layers to quantize and which to ignore. GLM-4.7 has dense layers at the beginning that should be excluded:
```python
-from llmcompressor.modifiers.quantization import QuantizationModifier
-
-recipe = QuantizationModifier(scheme="FP8", targets="Linear", ignore=["lm_head", "re:.*block_sparse_moe.gate"])
+from llmcompressor.modifiers.awq import AWQModifier
+
+moe_ignores = [
+ # Layers 0-2: Dense layers - ignore entire layers
+ "model.layers.0.*",
+ "model.layers.1.*",
+ "model.layers.2.*",
+ # Ignore the output head
+ "lm_head",
+]
+
+# Configure AWQ with W4A16 (4-bit weights, 16-bit activations)
+recipe = AWQModifier(targets="Linear", scheme="W4A16", ignore=moe_ignores)
```
-NOTE: `.*block_sparse_moe.gate` layers do not quantize well, hence they are ignored!
+**Why ignore these layers?**
+- Layers 0-2 are dense (non-MoE) layers that may be sensitive to aggressive quantization
+- The `lm_head` (language model head) is typically kept at higher precision for better output quality
-### Step 2: Run Quantization Using Oneshot
+### Step 4: Run Quantization with `oneshot`
-The `oneshot` method applies the selected recipe to your model and dataset without requiring any fine-tuning. The model will be sparsified and saved to `Mixtral-8x7B-Instruct-v0.1-FP8`.
+Apply the quantization recipe using the `oneshot` method:
```python
from llmcompressor import oneshot
-output_dir = "Mixtral-8x7B-Instruct-v0.1-FP8"
-
oneshot(
model=model,
- dataset=dataset,
+ dataset=ds,
recipe=recipe,
- save_compressed=True,
- output_dir=output_dir,
- max_seq_length=2048,
- num_calibration_samples=512,
+ max_seq_length=MAX_SEQUENCE_LENGTH,
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
+```
+
+The `oneshot` method:
+- Calibrates the quantization parameters using the provided dataset
+- Applies AWQ to quantize weights to 4-bit precision
+- Automatically uses the calibration-friendly MoE definition to ensure all experts are properly calibrated
+
+### Step 5: Save the Quantized Model
+Save the compressed model to disk:
+
+```python
+SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
```
-### Custom Quantization
+The model will be saved in a compressed format with 4-bit weights, ready for vLLM inference.
-NOTE: Only per-tensor quantization is supported in vLLM as of now (`vllm==0.6.1`)
+# Quantizing MoEs with a custom definition
+Quantizing MoE models with a scheme that requires calibration data (for example, schemes where activations are not dynamic, such as FP8 or INT8 per-tensor activations, or NVFP4), or with an algorithm that requires data (such as GPTQ, AWQ, or AutoRound), requires a calibration-friendly MoE block definition for the model being quantized.
-The repository supports multiple quantization techniques configured via a recipe. Supported strategies include `tensor`, `group`, and `channel` quantization.
+Examples of calibration-friendly definitions can be found in the [modeling folder](https://github.com/vllm-project/llm-compressor/tree/main/src/llmcompressor/modeling). Each definition enables an MoE calibration context by inheriting from the [`MoECalibrationModule` class](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/modeling/moe_context.py) and registering the MoE block that should be replaced with a custom definition.
-In the above example, quantization is specified by the `FP8` scheme. For other preset schemes, refer to the [quantization schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py) in the `compressed-tensors` library.
+In particular, each model-specific definition includes an updated forward pass that ensures all tokens are routed through all experts during calibration, including experts that would not normally be activated. Only the activated experts contribute to the final output of the MoE block. This behavior ensures proper calibration of all expert layers.
-A custom scheme can also be specified using `config_groups`:
+These custom definitions replace the existing MoE implementations during `oneshot` processing. The replacement can be either temporary or permanent; in the temporary case, the original definition is restored after calibration. In the GLM-4.7 example above, the `CalibrationGlm4MoeMoE` custom definition registers a replacement of all `Glm4MoeMoE` instances from the transformers library with the calibration-friendly version. You can see this definition replacement applied in [llmcompressor/modeling/glm4_moe.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/modeling/glm4_moe.py).
-```python
-# Example of defining a custom quantization scheme
-
-from llmcompressor.modifiers.quantization.gptq import GPTQModifier
-
-config_groups = {
- "group_0": {
- "targets": ["Linear"],
- "input_activations": None,
- "output_activations": None,
- "weights": {
- "num_bits": 8,
- "type": "int",
- "symmetric": True,
- "strategy": "group",
- "group_size": 128,
- }
- }
-}
+Without a custom calibration-friendly definition, MoE experts may be calibrated incorrectly, which can result in numerical instability or NaNs.
-recipe = GPTQModifier(config_groups=config_groups)
-```
diff --git a/examples/sparse_2of4_quantization_fp8/README.md b/examples/sparse_2of4_quantization_fp8/README.md
index 1655fee3a7..a72ea0d7fd 100644
--- a/examples/sparse_2of4_quantization_fp8/README.md
+++ b/examples/sparse_2of4_quantization_fp8/README.md
@@ -1,4 +1,4 @@
-# Applying 2:4 Sparsity with Optional FP8 Quantization
+# 2:4 Sparsity with FP8 Quantization
This script demonstrates how to apply **2:4 structured sparsity** with and without **FP8 quantization** to the `Meta-Llama-3-8B-Instruct` model using the `llm-compressor` library. The compressed model is optimized for memory efficiency and faster inference on supported GPUs.
diff --git a/experimental/README.md b/experimental/README.md
index 30fc1866d6..605b404ce2 100644
--- a/experimental/README.md
+++ b/experimental/README.md
@@ -1,3 +1,3 @@
# Experimental Features
-This folder aims to highlight features that are a work-in-progress or are supported in LLM Compressor and/or Compressed-Tensors but lack full support in downstream libraries like vLLM.
+Experimental features aim to highlight features that are a work-in-progress or are supported in LLM Compressor and/or Compressed-Tensors but lack full support in downstream libraries like vLLM.
diff --git a/experimental/mistral/README.md b/experimental/mistral/README.md
index b42b673d90..1d97a13ce7 100644
--- a/experimental/mistral/README.md
+++ b/experimental/mistral/README.md
@@ -1,2 +1,3 @@
# Mistral-format model compression (experimental)
-For quantizing mistral models which do not have a huggingface model definition such as `mistralai/Devstral-Small-2505`, `mistralai/Magistral-Small-2506`, and `mistralai/mistral-large-3`, please use the [`model_free_ptq`](/src/llmcompressor/entrypoints/model_free/) entrypoint.
\ No newline at end of file
+
+To quantize mistral models which do not have a huggingface model definition such as `mistralai/Devstral-Small-2505`, `mistralai/Magistral-Small-2506`, and `mistralai/mistral-large-3`, please use the [`model_free_ptq`](/src/llmcompressor/entrypoints/model_free/) entrypoint.
\ No newline at end of file
diff --git a/experimental/mxfp4/README.md b/experimental/mxfp4/README.md
new file mode 100644
index 0000000000..f0455ca335
--- /dev/null
+++ b/experimental/mxfp4/README.md
@@ -0,0 +1,5 @@
+# MXFP4 Quantization
+
+vLLM currently supports MXFP4A16 quantization i.e weight-only quantization. Examples for this can be found [here](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a16_fp4/mxfp4).
+
+However, you can still generate MXFP4 models through LLM Compressor. These models have fully dynamic activations and this pathway has not yet been enabled for compressed-tensors models in vLLM.
\ No newline at end of file
diff --git a/src/llmcompressor/entrypoints/README.md b/src/llmcompressor/entrypoints/README.md
index 7694d7aa81..6058ab72b0 100644
--- a/src/llmcompressor/entrypoints/README.md
+++ b/src/llmcompressor/entrypoints/README.md
@@ -9,8 +9,7 @@ A complete list of formats can be found here: https://docs.vllm.ai/projects/llm-
### Sparsification
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
-- [2:4-Sparsity with FP4 Weight](../../../examples/quantization_2of4_sparse_w4a16/README.md)
-- [2:4-Sparsity with FP8 Weight, FP8 Input Activation](../../../examples/sparse_2of4_quantization_fp8/README.md)
+- [2:4-Sparsity with FP8 Weight Activation Quantization](../../../examples/sparse_2of4_quantization_fp8/README.md)
### Example