vllm-project · dsikka · Feb 25, 2026 · Feb 24, 2026 · Feb 24, 2026 · Feb 24, 2026
diff --git a/examples/awq/README.md b/examples/awq/README.md
@@ -1,4 +1,4 @@
-# Quantizing Models with Activation-Aware Quantization (AWQ) #
+# AWQ Quantization #
 
 Activation Aware Quantization (AWQ) is a state-of-the-art technique to quantize the weights of large language models which involves using a small calibration dataset to calibrate the model. The AWQ algorithm utilizes calibration data to derive scaling factors which reduce the dynamic range of weights while minimizing accuracy loss to the most salient weight values.
 

diff --git a/examples/big_models_with_sequential_onloading/README.md b/examples/big_models_with_sequential_onloading/README.md
@@ -1,4 +1,5 @@
-# Big Modeling with Sequential Onloading #
+# Big Model Quantization with Sequential Onloading#
+
 ## What is Sequential Onloading? ##
 Sequential onloading is a memory-efficient approach for compressing large language models (LLMs) using only a single GPU. Instead of loading the entire model into memory—which can easily require hundreds of gigabytes—this method loads and compresses one layer at a time. The outputs are offloaded before the next layer is processed, dramatically reducing peak memory usage while maintaining high compression fidelity.
 

diff --git a/examples/model_free_ptq/README.md b/examples/model_free_ptq/README.md
@@ -1,4 +1,4 @@
-# Quantizing models without a model definition 
+# Model-free Quantization
 
 `model_free_ptq` provides a PTQ pathway for data-free schemes (such for FP8 Dynamic Per Token or FP8 Block). Specifically, this pathway removes the requirement for a model definition or the need to load the model through transformers. If you are interested in applying a data-free scheme, there are two key scenarios in which applying this pathway may make sense for your model:
 

diff --git a/examples/multimodal_audio/README.md b/examples/multimodal_audio/README.md
@@ -1,4 +1,4 @@
-# Quantizing Multimodal Audio Models #
+#  Multimodal Audio Model Quantization #
 
 https://github.com/user-attachments/assets/6732c60b-1ebe-4bed-b409-c16c4415dff5
 

diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md
@@ -1,4 +1,4 @@
-# Quantizing Multimodal Vision-Language Models #
+# Multimodal Vision-Language Quantization #
 
 <p align="center" style="text-align: center;">
     <img src=http://images.cocodataset.org/train2017/000000231895.jpg alt="sample image from MS COCO dataset"/>

diff --git a/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_group-128_recipe.yaml b/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_group-128_recipe.yaml
diff --git a/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_recipe.yaml b/examples/quantization_2of4_sparse_w4a16/2of4_w4a16_recipe.yaml
diff --git a/examples/quantization_2of4_sparse_w4a16/README.md b/examples/quantization_2of4_sparse_w4a16/README.md
diff --git a/examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py b/examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
diff --git a/examples/quantization_kv_cache/README.md b/examples/quantization_kv_cache/README.md
@@ -1,6 +1,6 @@
-# `fp8` Weight, Activation, and KV Cache Quantization
+# KV Cache Quantization
 
-`llmcompressor` now supports quantizing weights, activations, and KV cache to `fp8` for memory savings and inference acceleration with `vllm`.
+`llmcompressor` supports quantizing `fp8` KV Cache for memory savings and inference acceleration with `vllm`.
 
 > `fp8` computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
 

diff --git a/examples/quantization_w4a4_fp4/README.md b/examples/quantization_w4a4_fp4/README.md
@@ -1,4 +1,6 @@
-# `fp4` Quantization
+# `fp4` Quantization with NVFP4
+
+For weight-only FP4 quantization (e.g MXFP4A16, NVFP4A16) see examples [here](../quantization_w4a16_fp4/).
 
 `llm-compressor` supports quantizing weights and activations to `fp4` for memory savings and inference acceleration with `vLLM`. In particular, `nvfp4` is supported - a 4-bit floating point encoding format introduced with the NVIDIA Blackwell GPU architecture.
 
@@ -80,15 +82,4 @@ model.save_pretrained(SAVE_DIR, save_compressed=True)
 tokenizer.save_pretrained(SAVE_DIR)
 ```
 
-We have successfully created an `nvfp4` model!
-
-# Quantizing MoEs
-
-To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under `llama4_example.py`. The pipeline automatically applies the appropriate MoE calibration context which:
-
-1. Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include `torch.nn.Linear` layers in its MoE blocks, a requirement for LLM Compressor to run quantization.
-2. Ensures experts are quantized correctly as not all experts are activated during calibration
-
-Similarly, an example quantizing the Qwen3-30B-A3B model can be found under `qwen_30b_a3b.py`. This model uses contextual MoE calibration which temporarily updates the model definition to use `Qwen3MoeSparseMoeBlock` which updates how the forward pass is handled in the MoE block during calibration. Feel free to update the definition under `llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with this behavior and evaluate its impact on quantization performance.
-
-
+We have successfully created an `nvfp4` model!
diff --git a/examples/quantizing_moe/README.md b/examples/quantizing_moe/README.md
@@ -1,6 +1,6 @@
-# Quantizing Mixtral-8x7B-Instruct-v0.1 Model with FP8
+# Quantizing MoEs
 
-This directory contains example scripts for quantizing LLMs using the static per-tensor FP8 quantization scheme.
+This directory contains example scripts for quantizing MoEs.
 
 ## Installation
 
@@ -69,8 +69,6 @@ oneshot(
 
 ### Custom Quantization
 
-NOTE: Only per-tensor quantization is supported in vLLM as of now (`vllm==0.6.1`)
-
 The repository supports multiple quantization techniques configured via a recipe. Supported strategies include `tensor`, `group`, and `channel` quantization.
 
 In the above example, quantization is specified by the `FP8` scheme. For other preset schemes, refer to the [quantization schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py) in the `compressed-tensors` library.

diff --git a/examples/sparse_2of4_quantization_fp8/README.md b/examples/sparse_2of4_quantization_fp8/README.md
@@ -1,4 +1,4 @@
-# Applying 2:4 Sparsity with Optional FP8 Quantization
+# 2:4 Sparsity with FP8 Quantization
 
 This script demonstrates how to apply **2:4 structured sparsity** with and without **FP8 quantization** to the `Meta-Llama-3-8B-Instruct` model using the `llm-compressor` library. The compressed model is optimized for memory efficiency and faster inference on supported GPUs.