Skip to content

Commit 4a4f0be

Browse files
[Multi-Quant] follow-up to README and example (#1855)
SUMMARY: - [x] Add multi-quant information to top-level README - [x] Add `--independent` flag to example so both sequential and independent pipelines can be run in automated testing TEST PLAN: Example runs, freshly uploaded model checkpoint [here](https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-selfattn-w8a8-mlp-w4a16-sequential) --------- Signed-off-by: Brian Dellabetta <[email protected]>
1 parent 66efff5 commit 4a4f0be

File tree

3 files changed

+50
-39
lines changed

3 files changed

+50
-39
lines changed

README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,11 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
2828

2929
Some of the exciting new features include:
3030

31+
* **Quantization with Multiple Modifiers**: Multiple quantization modifiers can now be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This is an advanced usage of `llm-compressor` and an active area of research. See the [non-uniform quantization support](examples/quantization_non_uniform) section for more detail and [example usage](examples/quantization_non_uniform/quantization_multiple_modifiers.py).
3132
* **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
3233
* **DeepSeekV3-style Block Quantization Support**: This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](examples/quantization_w8a8_fp8/fp8_block_example.py).
3334
* **Llama4 Quantization Support**: Quantize a Llama4 model to [W4A16](examples/multimodal_vision/llama4_example.py) or [NVFP4](examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
3435
* **FP4 Quantization - now with MoE and non-uniform support:** Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 [configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [fp4 activation support](examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](examples/quantization_non_uniform) where some layers are selectively quantized to fp8 for better recovery. You can also mix other quantization schemes, such as int8 and int4.
35-
* **Large Model Support with Sequential Onloading**: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see [Big Modeling with Sequential Onloading](examples/big_models_with_sequential_onloading/README.md) as well as the [DeepSeek-R1 Example](examples/quantizing_moe/deepseek_r1_example.py).
36-
* **Axolotl Sparse Finetuning Integration:** Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create [fast sparse open-source models with Axolotl and LLM Compressor](https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open). See also the [Axolotl integration docs](https://docs.axolotl.ai/docs/custom_integrations.html#llmcompressor).
3736

3837
### Supported Formats
3938
* Activation Quantization: W8A8 (int8 and fp8)

examples/quantization_non_uniform/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,6 @@ It may also be interesting to quantize a model with two different [quantization
1515
This section outlines how multiple quantization modifiers can be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This heterogeneous application of multiple modifiers comes in 2 flavors:
1616

1717
1. Run every modifier in a single, sequential pipeline, performing a single calibrated run. See `./quantization_multiple_modifiers.py` for an example.
18-
2. Run each modifier in its own, independent pipeline, performing a calibrated run for each modifier. To run each modifier independently, run `./quantization_multiple_modifiers.py` with `oneshot(..., pipeline="independent")` instead of `pipeline="sequential"`.
18+
2. Run each modifier in its own, independent pipeline, performing a calibrated run for each modifier. To run each modifier independently, run the example with the `--independent` flag set (`python ./quantization_multiple_modifiers.py --independent`).
1919

2020
This is an advanced usage of `llm-compressor` and an active area of research. Best practices will be provided in a future release, after further research and sensitivity analysis.
Lines changed: 48 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
import argparse
2+
13
from datasets import load_dataset
24
from transformers import AutoModelForCausalLM, AutoTokenizer
35

@@ -6,6 +8,18 @@
68
from llmcompressor.modifiers.quantization import GPTQModifier
79
from llmcompressor.utils import dispatch_for_generation
810

11+
12+
def parse_args():
13+
parser = argparse.ArgumentParser(description="Quantization with multiple modifiers")
14+
parser.add_argument(
15+
"--independent",
16+
action="store_true",
17+
help="Add this flag if you'd like to run each modifier "
18+
"independently instead of in the same sequence",
19+
)
20+
return parser.parse_args()
21+
22+
923
# Select model and load it.
1024
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
1125
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
@@ -20,10 +34,6 @@
2034
NUM_CALIBRATION_SAMPLES = 512
2135
MAX_SEQUENCE_LENGTH = 2048
2236

23-
# Load dataset and preprocess.
24-
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
25-
ds = ds.shuffle(seed=42)
26-
2737

2838
def preprocess(example):
2939
return {
@@ -34,9 +44,6 @@ def preprocess(example):
3444
}
3545

3646

37-
ds = ds.map(preprocess)
38-
39-
4047
# Tokenize inputs.
4148
def tokenize(sample):
4249
return tokenizer(
@@ -48,8 +55,6 @@ def tokenize(sample):
4855
)
4956

5057

51-
ds = ds.map(tokenize, remove_columns=ds.column_names)
52-
5358
# Configure the quantization algorithm to run.
5459
# * quantize self_attn layers to W8A8 with GPTQ
5560
# * quantize mlp layers to W4A16 with AWQ
@@ -72,30 +77,37 @@ def tokenize(sample):
7277
),
7378
]
7479

75-
# Apply algorithms.
76-
oneshot(
77-
model=model,
78-
dataset=ds,
79-
recipe=recipe,
80-
max_seq_length=MAX_SEQUENCE_LENGTH,
81-
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
82-
# Option 1) run both modifiers in a single calibrated run
83-
pipeline="sequential",
84-
# Option 2) run each modifier in its own separate pipeline
85-
# pipeline="independent",
86-
)
87-
88-
# Confirm generations of the quantized model look sane.
89-
print("\n\n")
90-
print("========== SAMPLE GENERATION ==============")
91-
dispatch_for_generation(model)
92-
sample = tokenizer("Hello my name is", return_tensors="pt")
93-
sample = {key: value.to(model.device) for key, value in sample.items()}
94-
output = model.generate(**sample, max_new_tokens=100)
95-
print(tokenizer.decode(output[0]))
96-
print("==========================================\n\n")
97-
98-
# Save to disk compressed.
99-
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-gptq-w8a8-self_attn-awq-w4a16-mlp"
100-
model.save_pretrained(SAVE_DIR, save_compressed=True)
101-
tokenizer.save_pretrained(SAVE_DIR)
80+
if __name__ == "__main__":
81+
args = parse_args()
82+
# Load dataset and preprocess.
83+
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
84+
ds = ds.shuffle(seed=42)
85+
ds = ds.map(preprocess)
86+
ds = ds.map(tokenize, remove_columns=ds.column_names)
87+
88+
# Apply algorithms.
89+
oneshot(
90+
model=model,
91+
dataset=ds,
92+
recipe=recipe,
93+
max_seq_length=MAX_SEQUENCE_LENGTH,
94+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
95+
pipeline="independent" if args.independent else "sequential",
96+
)
97+
98+
# Confirm generations of the quantized model look sane.
99+
print("\n\n")
100+
print("========== SAMPLE GENERATION ==============")
101+
dispatch_for_generation(model)
102+
sample = tokenizer("Hello my name is", return_tensors="pt")
103+
sample = {key: value.to(model.device) for key, value in sample.items()}
104+
output = model.generate(**sample, max_new_tokens=100)
105+
print(tokenizer.decode(output[0]))
106+
print("==========================================\n\n")
107+
108+
# Save to disk compressed.
109+
SAVE_DIR = (
110+
model_id.rstrip("/").split("/")[-1] + "-gptq-w8a8-self_attn-awq-w4a16-mlp"
111+
)
112+
model.save_pretrained(SAVE_DIR, save_compressed=True)
113+
tokenizer.save_pretrained(SAVE_DIR)

0 commit comments

Comments
 (0)