vllm-project
diff --git a/‎README.md
Lines changed: 1 addition & 0 deletions b/‎README.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/getting-started/compress.md
Lines changed: 55 additions & 1 deletion b/‎docs/getting-started/compress.md
Lines changed: 55 additions & 1 deletion
diff --git a/‎docs/guides/compression_formats.md
Lines changed: 23 additions & 0 deletions b/‎docs/guides/compression_formats.md
Lines changed: 23 additions & 0 deletions
diff --git a/‎examples/big_models_with_sequential_onloading/llama3.3_70b.py
Lines changed: 1 addition & 1 deletion b/‎examples/big_models_with_sequential_onloading/llama3.3_70b.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/multimodal_vision/qwen_2_5_vl_example.py
Lines changed: 1 addition & 1 deletion b/‎examples/multimodal_vision/qwen_2_5_vl_example.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/quantization_w4a16/llama3_example.py
Lines changed: 1 addition & 1 deletion b/‎examples/quantization_w4a16/llama3_example.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/quantization_w4a4_fp4/README.md
Lines changed: 94 additions & 0 deletions b/‎examples/quantization_w4a4_fp4/README.md
Lines changed: 94 additions & 0 deletions
diff --git a/‎examples/quantization_w4a4_fp4/qwen_30b_a3b.py
Lines changed: 4 additions & 1 deletion b/‎examples/quantization_w4a4_fp4/qwen_30b_a3b.py
Lines changed: 4 additions & 1 deletion
diff --git a/‎examples/quantization_w8a8_fp8/fp8_block_example.py
Lines changed: 3 additions & 1 deletion b/‎examples/quantization_w8a8_fp8/fp8_block_example.py
Lines changed: 3 additions & 1 deletion
diff --git a/‎examples/quantizing_moe/deepseek_r1_example.py
Lines changed: 1 addition & 1 deletion b/‎examples/quantizing_moe/deepseek_r1_example.py
Lines changed: 1 addition & 1 deletion
@@ -18,6 +18,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
 
 Some of the exciting new features include:
 
+* **DeepSeekV3-style Block Quantization Support**:  This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py). 
 * **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
 * **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
 * **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.
 
@@ -59,4 +59,58 @@ oneshot(
 )
 ```
 
-When you run the above code, the compressed model is saved to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing.
+When you run the above code, the compressed model is saved to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing. 
+
+## Memory requirements for LLM Compressor
+
+When compressing a model you should be aware that the memory requirements are dependent on model size and the algorithm used, such as GPTQ/SparseGPT.  
+
+This section will go through how to calculate the CPU and GPU memory requirements for each algorithm using several popular models, an 8B, a 684B, and a model with vision capabilities, as examples. 
+
+The GPTQ/SparseGPT requires a large amount of auxiliary memory. GPTQ/SparseGPT allocates an auxiliary hessian matrix for any layers that are onloaded to the GPU. This is because the hessian matrices have to be almost as large as the weights they are trying to represent. 
+
+Also, larger models, like DeepSeek R1 use a large amount of CPU memory, and models with large vision towers, such as command A, may use large amounts of GPU memory. 
+
+### Things to note when calculating memory requirements for LLM Compressor:
+
+1. A 1B model uses 2Gb of memory to load:
+    ```
+	mem(1B parameters) ~= (1B parameters) * (2 bytes / parameter) = 2B bytes ~= 2Gb
+    ```
+
+2. How text decoder layers and vision tower layers are loaded on to GPU differs significantly. 
+    
+    In the case of text decoder layers, LLM Compressor dynamically loads one layer at a time into the GPU for computation. The rest of the model remains in CPU memory. 
+
+    However, vision tower layers are loaded onto GPU all at once. Unlike the text model, vision towers are not split up into individual layers before onloading to the GPU. This can create a GPU memory bottleneck for models whose vision towers are larger than their text layers.		
+
+    At this time LLM Compressor does not quantise the vision tower as quantization is generally not worth the tradeoff between latency/throughput and accuracy loss.   
+
+3. LLM Compressor does not currently support tensor parallelism for compression. Supporting this feature will allow layers to be sharded across GPUs, leading to reduced memory usage per GPU and faster compression.
+
+### QuantizationModifier or Round-To-Nearest (RTN)
+
+The quantization modifier, RTN, does not require any additional memory beyond the storage needed for its quantization parameters (scales/zeros). 
+
+If we ignore these scales and zero points from our calculation, we can estimate the following memory requirements:
+
+
+| Model| CPU requirements | GPU requirements |
+|--------|-------------|----------------------------|
+| **Meta-Llama-3-8B-Instruct** | mem(8B params) ~= 16Gb | mem(1 Layer) ~= 0.5Gb |
+| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) ~= 22.4Gb|
+| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B) ~= 1.3Gb |
+
+### GPT Quantization(GPTQ)/ Sparse GPT 
+
+The GPTQ/ SparseGPT algorithms differ from the RTN in that they must also allocate an auxiliary hessian matrices for any layers that are onloaded to the GPU. 
+
+This hessian matrix is used to increase the accuracy recovery of the algorithm, and is approximately the same size as the original weights.
+
+| Model| CPU requirements | GPU requirements |
+|--------|-------------|----------------------------|
+| **Meta-Llama-3-8B-Instruct** |mem(8B params) ~= 16Gb | mem(1 Layer) * 2 ~= 1Gb |
+| **DeepSeek-R1-0528-BF16** | mem(684B params) ~= 1368Gb | mem(1 Layer) * 2 ~= 44.8Gb |
+| **Qwen2.5-VL-7B-Instruct** | mem(7B params) ~= 14Gb | max(mem(1 Text Layer)~= 0.4B, mem(Vision tower)~=1.3B)*2 ~= 2.6Gb |
+
+
@@ -0,0 +1,23 @@
+# Compression Formats
+
+The following table outlines the possible quantization and sparsity 
+compression formats that are applied to a model during compression.
+The formats are determined according to the quantization scheme and 
+sparsity type. For more details on the quantization schemes, see 
+`guides/compression_schemes.md`.
+
+
+| Quantization  | Sparsity | Quant Compressor     | Sparsity Compressor |
+|---------------|----------|----------------------|---------------------|
+| W8A8 - int    | None     | int_quantized        | Dense               |
+| W8A8 - float  | None     | float_quantized      | Dense               |
+| W4A16 - float | None     | nvfp4_pack_quantized | Dense               |
+| W4A4 - float  | None     | nvfp4_pack_quantized | Dense               |
+| W4A16 - int   | None     | pack_quantized       | Dense               |
+| W8A16 - int   | None     | pack_quantized       | Dense               |
+| W8A16 - float | None     | naive_quantized      | Dense               |
+| W8A8 - int    | 2:4      | int_quantized        | Sparse24            |
+| W8A8 - float  | 2:4      | float_quantized      | Sparse24            |
+| W4A16 - int   | 2:4      | marlin_24            | Dense               |
+| W8A16 - int   | 2:4      | marlin_24            | Dense               |
+| W8A16 - float | 2:4      | naive_quantized      | Dense               |
@@ -1,9 +1,9 @@
 from datasets import load_dataset
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
+from llmcompressor import oneshot
 from llmcompressor.modifiers.quantization import GPTQModifier
 from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
-from llmcompressor.transformers import oneshot
 from llmcompressor.utils import dispatch_for_generation
 
 # Select model and load it.
 
@@ -6,8 +6,8 @@
 from qwen_vl_utils import process_vision_info
 from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
 
+from llmcompressor import oneshot
 from llmcompressor.modifiers.quantization import GPTQModifier
-from llmcompressor.transformers import oneshot
 from llmcompressor.utils import dispatch_for_generation
 
 # Load model.
 
@@ -1,8 +1,8 @@
 from datasets import load_dataset
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
+from llmcompressor import oneshot
 from llmcompressor.modifiers.quantization import GPTQModifier
-from llmcompressor.transformers import oneshot
 from llmcompressor.utils import dispatch_for_generation
 
 # Select model and load it.
 
@@ -0,0 +1,94 @@
+# `fp4` Quantization
+
+`llm-compressor` supports quantizing weights and activations to `fp4` for memory savings and inference acceleration with `vLLM`. In particular, `nvfp4` is supported - a 4-bit floating point encoding format introduced with the NVIDIA Blackwell GPU architecture.
+
+## Installation
+
+To get started, install:
+
+```bash
+git clone https://github.com/vllm-project/llm-compressor.git
+cd llm-compressor
+pip install -e .
+```
+
+## Quickstart
+
+The example includes an end-to-end script for applying the quantization algorithm.
+
+```bash
+python3 llama3_example.py
+```
+
+The resulting model `Meta-Llama-3-8B-Instruct-NVFP4` is ready to be loaded into vLLM.
+Note: if running inference on a machine that is < SM100, vLLM will not run activation
+quantization, only weight-only quantization.
+
+## Code Walkthough
+
+Now, we will step though the code in the example:
+1) Load model
+2) Prepare calibration data
+3) Apply quantization
+
+### 1) Load Model
+
+Load the model using `AutoModelForCausalLM` for handling quantized saving and loading. 
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+```
+
+### 2) Prepare Calibration Data
+
+Prepare the calibration data. `nvfp4` quantization generates per-tensor global scales and per-group (size 16) local quantization scales for the weights, as well as per-tensor global scales for the activations. Per-group local activation quantization scales are generated dynamically during inference time. We need some sample data to calibrate the global activation scales. Typically, a small number of samples is sufficient. In this example, we use a sample size of 20.
+
+It is useful to use calibration data that closely matches the type of data used in deployment. If you have fine-tuned a model, using a sample of your training data is a good idea. In our case, we are quantizing an instruction-tuned generic model, so we will use the `ultrachat` dataset. 
+
+### 3) Apply Quantization
+
+With the dataset ready, we will now apply quantization.
+
+We first select the quantization algorithm.
+
+In our case, we will apply the default QuantizationModifier recipe for `nvfp4` to all linear layers.
+> See the `Recipes` documentation for more information on making complex recipes
+
+```python
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+# Configure the quantization algorithm to run.
+recipe = QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head"])
+
+# Apply quantization.
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+
+# Save to disk compressed.
+SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+
+We have successfully created an `nvfp4` model!
+
+# Quantizing MoEs
+
+To quantize MoEs, a few additional steps are required. An example quantizing Llama4 can be found under `llama4_example.py`. Here, we replace all `Llama4TextMoe` modules by calling `replace_modules_for_calibration`. This replacement allows us to:
+
+1. Linearize the model to enable quantization and execution in vLLM. This is required as the native model definition does not include `torch.nn.Linear` layers in its MoE blocks, a requirement for LLM Compressor to run quantization.
+2. Ensure experts are quantized correctly as not all experts are activated during calibration
+
+Similarly, an example quantizing the Qwen3-30B-A3B model can be found under `qwen_30b_a3b.py`. This model does not require additional linearization as required by the Llama4 model. However, similar to Llama4, in order to ensure the experts are quantized correctly, we can pass in `calibrate_moe_context` which temporarily updates the model definition to use `Qwen3MoeSparseMoeBlock` which updates how the forward pass is handled in the MoE block during calibration. Feel free to update the definition under `llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with this behavior and evaluate its impact on quantization performance.
+
+
@@ -60,7 +60,10 @@ def tokenize(sample):
 
 # Apply quantization.
 # We see `calibrate_moe_context` to True to update all `Qwen3MoeSparseMoeBlock`
-# during calibration
+# during calibration.
+# Feel free to update the definition under
+# llm-compressor/src/llmcompressor/modeling/qwen3_moe.py` to play around with
+# this behaviour and evaluate its impact on quantization performance
 oneshot(
     model=model,
     dataset=ds,
 
@@ -16,7 +16,9 @@
 #   * quantize the weights to fp8 with per channel via ptq
 #   * quantize the activations to fp8 with dynamic per token
 recipe = QuantizationModifier(
-    targets="Linear", scheme="FP8_BLOCK", ignore=["lm_head", "re:.*mlp.gate$"],
+    targets="Linear",
+    scheme="FP8_BLOCK",
+    ignore=["lm_head", "re:.*mlp.gate$"],
 )
 
 # Apply quantization.
 
@@ -1,9 +1,9 @@
 from datasets import load_dataset
 from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
 
+from llmcompressor import oneshot
 from llmcompressor.modeling import replace_modules_for_calibration
 from llmcompressor.modifiers.quantization import GPTQModifier
-from llmcompressor.transformers import oneshot
 
 # Select model and load it.