Merge branch 'main' into kylesayrs/transform-quip-modifier

dsikka · web-flow · commit 06d5967e8335 · 2025-08-14T14:59:24.000-04:00
diff --git a/README.md b/README.md
@@ -18,6 +18,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
 
 Some of the exciting new features include:
 
+* **DeepSeekV3-style Block Quantization Support**:  This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py). 
 * **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
 * **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
 * **Preliminary FP4 Quantization Support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [weight-only quantization](examples/quantization_w4a16_fp4/llama3_example.py) and [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py). Support is currently preliminary and additional support will be added for MoEs.
diff --git a/docs/guides/compression_formats.md b/docs/guides/compression_formats.md
@@ -0,0 +1,23 @@
+# Compression Formats
+
+The following table outlines the possible quantization and sparsity 
+compression formats that are applied to a model during compression.
+The formats are determined according to the quantization scheme and 
+sparsity type. For more details on the quantization schemes, see 
+`guides/compression_schemes.md`.
+
+
+| Quantization  | Sparsity | Quant Compressor     | Sparsity Compressor |
+|---------------|----------|----------------------|---------------------|
+| W8A8 - int    | None     | int_quantized        | Dense               |
+| W8A8 - float  | None     | float_quantized      | Dense               |
+| W4A16 - float | None     | nvfp4_pack_quantized | Dense               |
+| W4A4 - float  | None     | nvfp4_pack_quantized | Dense               |
+| W4A16 - int   | None     | pack_quantized       | Dense               |
+| W8A16 - int   | None     | pack_quantized       | Dense               |
+| W8A16 - float | None     | naive_quantized      | Dense               |
+| W8A8 - int    | 2:4      | int_quantized        | Sparse24            |
+| W8A8 - float  | 2:4      | float_quantized      | Sparse24            |
+| W4A16 - int   | 2:4      | marlin_24            | Dense               |
+| W8A16 - int   | 2:4      | marlin_24            | Dense               |
+| W8A16 - float | 2:4      | naive_quantized      | Dense               |
diff --git a/src/llmcompressor/transformers/compression/quantization_format.py b/src/llmcompressor/transformers/compression/quantization_format.py
@@ -18,24 +18,7 @@ def infer_quantization_format(
     Infers the quantization format for a model based on its state and provided
     compression arguments.
 
-    The following table outlines the possible quantization and sparsity formats
-    along with their corresponding compressor formats:
-
-        +---------------+----------+----------------------+---------------------+
-        | Quantization  | Sparsity | Quant Compressor     | Sparsity Compressor |
-        |               |          | Format               | Format              |
-        +---------------+----------+----------------------+---------------------+
-        | W8A8 - int    | None     | int_quantized        | Dense               |
-        | W8A8 - float  | None     | float_quantized      | Dense               |
-        | W4A16 - int   | None     | pack_quantized       | Dense               |
-        | W8A16 - int   | None     | pack_quantized       | Dense               |
-        | W8A16 - float | None     | naive_quantized      | Dense               |
-        | W8A8 - int    | 2:4      | int_quantized        | Sparse24            |
-        | W8A8 - float  | 2:4      | float_quantized      | Sparse24            |
-        | W4A16 - int   | 2:4      | marlin_24            | Dense               |
-        | W8A16 - int   | 2:4      | marlin_24            | Dense               |
-        | W8A16 - float | 2:4      | naive_quantized      | Dense               |
-        +---------------+----------+----------------------+---------------------+
+    For a summary of the formats, see `docs/guides/compression_formats.md`.
 
     :param model: model to check for quantization, if the model is not quantized no
         quantization format is returned