[Multi-Quant] follow-up to README and example (#1855)

brian-dellabetta · web-flow · commit 4a4f0bea6f64 · 2025-09-23T13:37:19.000-04:00
SUMMARY: - [x] Add multi-quant information to top-level README - [x] Add `--independent` flag to example so both sequential and independent pipelines can be run in automated testing TEST PLAN: Example runs, freshly uploaded model checkpoint [here](https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-selfattn-w8a8-mlp-w4a16-sequential) --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
diff --git a/README.md b/README.md
@@ -28,12 +28,11 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
 
 Some of the exciting new features include:
 
+* **Quantization with Multiple Modifiers**: Multiple quantization modifiers can now be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This is an advanced usage of `llm-compressor` and an active area of research. See the [non-uniform quantization support](examples/quantization_non_uniform) section for more detail and [example usage](examples/quantization_non_uniform/quantization_multiple_modifiers.py).
 * **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
 * **DeepSeekV3-style Block Quantization Support**:  This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py). 
 * **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
 * **FP4 Quantization - now with MoE and non-uniform support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to fp8 for better recovery. You can also mix other quantization schemes, such as int8 and int4.
-* **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
-* **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
 
 ### Supported Formats
 * Activation Quantization: W8A8 (int8 and fp8)
diff --git a/examples/quantization_non_uniform/README.md b/examples/quantization_non_uniform/README.md
@@ -15,6 +15,6 @@ It may also be interesting to quantize a model with two different [quantization
 This section outlines how multiple quantization modifiers can be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This heterogeneous application of multiple modifiers comes in 2 flavors:
 
 1. Run every modifier in a single, sequential pipeline, performing a single calibrated run. See `./quantization_multiple_modifiers.py` for an example.
-2. Run each modifier in its own, independent pipeline, performing a calibrated run for each modifier. To run each modifier independently, run `./quantization_multiple_modifiers.py` with `oneshot(..., pipeline="independent")` instead of `pipeline="sequential"`.
+2. Run each modifier in its own, independent pipeline, performing a calibrated run for each modifier. To run each modifier independently, run the example with the `--independent` flag set (`python ./quantization_multiple_modifiers.py --independent`).
 
 This is an advanced usage of `llm-compressor` and an active area of research. Best practices will be provided in a future release, after further research and sensitivity analysis.
diff --git a/examples/quantization_non_uniform/quantization_multiple_modifiers.py b/examples/quantization_non_uniform/quantization_multiple_modifiers.py
@@ -1,3 +1,5 @@
+import argparse
+
 from datasets import load_dataset
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
@@ -6,6 +8,18 @@
 from llmcompressor.modifiers.quantization import GPTQModifier
 from llmcompressor.utils import dispatch_for_generation
 
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Quantization with multiple modifiers")
+    parser.add_argument(
+        "--independent",
+        action="store_true",
+        help="Add this flag if you'd like to run each modifier "
+        "independently instead of in the same sequence",
+    )
+    return parser.parse_args()
+
+
 # Select model and load it.
 model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
 model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
@@ -20,10 +34,6 @@
 NUM_CALIBRATION_SAMPLES = 512
 MAX_SEQUENCE_LENGTH = 2048
 
-# Load dataset and preprocess.
-ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
-ds = ds.shuffle(seed=42)
-
 
 def preprocess(example):
     return {
@@ -34,9 +44,6 @@ def preprocess(example):
     }
 
 
-ds = ds.map(preprocess)
-
-
 # Tokenize inputs.
 def tokenize(sample):
     return tokenizer(
@@ -48,8 +55,6 @@ def tokenize(sample):
     )
 
 
-ds = ds.map(tokenize, remove_columns=ds.column_names)
-
 # Configure the quantization algorithm to run.
 #   * quantize self_attn layers to W8A8 with GPTQ
 #   * quantize mlp layers to W4A16 with AWQ
@@ -72,30 +77,37 @@ def tokenize(sample):
     ),
 ]
 
-# Apply algorithms.
-oneshot(
-    model=model,
-    dataset=ds,
-    recipe=recipe,
-    max_seq_length=MAX_SEQUENCE_LENGTH,
-    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-    # Option 1) run both modifiers in a single calibrated run
-    pipeline="sequential",
-    # Option 2) run each modifier in its own separate pipeline
-    # pipeline="independent",
-)
-
-# Confirm generations of the quantized model look sane.
-print("\n\n")
-print("========== SAMPLE GENERATION ==============")
-dispatch_for_generation(model)
-sample = tokenizer("Hello my name is", return_tensors="pt")
-sample = {key: value.to(model.device) for key, value in sample.items()}
-output = model.generate(**sample, max_new_tokens=100)
-print(tokenizer.decode(output[0]))
-print("==========================================\n\n")
-
-# Save to disk compressed.
-SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-gptq-w8a8-self_attn-awq-w4a16-mlp"
-model.save_pretrained(SAVE_DIR, save_compressed=True)
-tokenizer.save_pretrained(SAVE_DIR)
+if __name__ == "__main__":
+    args = parse_args()
+    # Load dataset and preprocess.
+    ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
+    ds = ds.shuffle(seed=42)
+    ds = ds.map(preprocess)
+    ds = ds.map(tokenize, remove_columns=ds.column_names)
+
+    # Apply algorithms.
+    oneshot(
+        model=model,
+        dataset=ds,
+        recipe=recipe,
+        max_seq_length=MAX_SEQUENCE_LENGTH,
+        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+        pipeline="independent" if args.independent else "sequential",
+    )
+
+    # Confirm generations of the quantized model look sane.
+    print("\n\n")
+    print("========== SAMPLE GENERATION ==============")
+    dispatch_for_generation(model)
+    sample = tokenizer("Hello my name is", return_tensors="pt")
+    sample = {key: value.to(model.device) for key, value in sample.items()}
+    output = model.generate(**sample, max_new_tokens=100)
+    print(tokenizer.decode(output[0]))
+    print("==========================================\n\n")
+
+    # Save to disk compressed.
+    SAVE_DIR = (
+        model_id.rstrip("/").split("/")[-1] + "-gptq-w8a8-self_attn-awq-w4a16-mlp"
+    )
+    model.save_pretrained(SAVE_DIR, save_compressed=True)
+    tokenizer.save_pretrained(SAVE_DIR)