Skip to content

Commit 8aee6ce

Browse files
kylesayrsdsikka
andauthored
[MoE] Cleanup MoE examples (#1576)
## Purpose ## * Update MoE examples to reflect latest MoE models * Remove redundant moe examples, standardize examples around FP8 ## Prerequisites ## * #1572 * #1535 ## Changes ## * Just three examples * `deepseek_r1_example.py` * `mixtral_example.py` * `qwen_example.py` * Update examples tests to run mixtral and qwen but not deepseek examples --------- Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>
1 parent 0a20392 commit 8aee6ce

File tree

8 files changed

+46
-354
lines changed

8 files changed

+46
-354
lines changed

examples/quantizing_moe/README.md

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,17 @@ pip install -e .
1717
The provided example script demonstrates an end-to-end process for applying the quantization algorithm:
1818

1919
```bash
20-
python3 mixtral_moe_w8a8_fp8.py
20+
python3 mixtral_example.py
2121
```
2222

2323
## Creating a Quantized MoE Model
2424

25-
This example leverages `llm-compressor` and `compressed-tensors` to create an FP8-quantized `Mixtral-8x7B-Instruct-v0.1` model. The model is calibrated and trained using the `open_platypus` dataset.
25+
This example leverages `llm-compressor` and `compressed-tensors` to create an FP8-quantized `Mixtral-8x7B-Instruct-v0.1` model. The model is calibrated and trained using the `ultrachat_200k` dataset.
2626

2727
You can follow the detailed steps below or simply run the example script with:
2828

2929
```bash
30-
python mixtral_moe_w8a8_fp8.py
30+
python mixtral_example.py
3131
```
3232

3333
### Step 1: Select a Model, Dataset, and Recipe
@@ -61,7 +61,6 @@ oneshot(
6161
recipe=recipe,
6262
save_compressed=True,
6363
output_dir=output_dir,
64-
6564
max_seq_length=2048,
6665
num_calibration_samples=512,
6766
)
@@ -74,7 +73,7 @@ NOTE: Only per-tensor quantization is supported in vLLM as of now (`vllm==0.6.1`
7473

7574
The repository supports multiple quantization techniques configured via a recipe. Supported strategies include `tensor`, `group`, and `channel` quantization.
7675

77-
In the above example, FP8 per-tensor quantization is used as specified by the `FP8` scheme. For other preset schemes, refer to the [quantization schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py) in the `compressed-tensors` library.
76+
In the above example, quantization is specified by the `FP8` scheme. For other preset schemes, refer to the [quantization schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py) in the `compressed-tensors` library.
7877

7978
A custom scheme can also be specified using `config_groups`:
8079

@@ -84,18 +83,18 @@ A custom scheme can also be specified using `config_groups`:
8483
from llmcompressor.modifiers.quantization.gptq import GPTQModifier
8584

8685
config_groups = {
87-
"group_0": {
88-
"targets": ["Linear"],
89-
"input_activations": None,
90-
"output_activations": None,
91-
"weights": {
92-
"num_bits": 8,
93-
"type": "int",
94-
"symmetric": true,
95-
"strategy": "group",
96-
"group_size": 128,
97-
}
98-
}
86+
"group_0": {
87+
"targets": ["Linear"],
88+
"input_activations": None,
89+
"output_activations": None,
90+
"weights": {
91+
"num_bits": 8,
92+
"type": "int",
93+
"symmetric": True,
94+
"strategy": "group",
95+
"group_size": 128,
96+
}
97+
}
9998
}
10099

101100
recipe = GPTQModifier(config_groups=config_groups)

examples/quantizing_moe/deepseek_moe_w4a16.py

Lines changed: 0 additions & 125 deletions
This file was deleted.

examples/quantizing_moe/deepseek_moe_w8a8_int8.py

Lines changed: 0 additions & 101 deletions
This file was deleted.

examples/quantizing_moe/deepseek_recipe_w4a16.yaml

Lines changed: 0 additions & 8 deletions
This file was deleted.

examples/quantizing_moe/deepseek_moe_w8a8_fp8.py renamed to examples/quantizing_moe/mixtral_example.py

Lines changed: 21 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,23 @@
1+
import torch
12
from datasets import load_dataset
2-
from packaging.version import Version
3-
from transformers import AutoModelForCausalLM, AutoTokenizer, __version__
3+
from transformers import AutoModelForCausalLM, AutoTokenizer
44

55
from llmcompressor import oneshot
66
from llmcompressor.modifiers.quantization import QuantizationModifier
77
from llmcompressor.utils import dispatch_for_generation
88

9-
# NOTE: transformers 4.49.0 has an attribute error with DeepSeek.
10-
# Please consider either downgrading your transformers version to a
11-
# previous version or upgrading to a version where this bug is fixed
12-
139
# select a Mixture of Experts model for quantization
14-
MODEL_ID = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
10+
MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
1511

1612
model = AutoModelForCausalLM.from_pretrained(
17-
MODEL_ID, torch_dtype="auto", trust_remote_code=True
13+
MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True
1814
)
1915
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
2016

2117
# Select calibration dataset.
22-
# its recommended to use more calibration samples for MoE models so each expert is hit
2318
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
2419
DATASET_SPLIT = "train_sft"
25-
NUM_CALIBRATION_SAMPLES = 2048
20+
NUM_CALIBRATION_SAMPLES = 512
2621
MAX_SEQUENCE_LENGTH = 2048
2722

2823

@@ -56,16 +51,17 @@ def tokenize(sample):
5651

5752
ds = ds.map(tokenize, remove_columns=ds.column_names)
5853

59-
# define a llmcompressor recipe for FP8 W8A8 quantization
54+
# Configure the quantization algorithm to run.
6055
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
6156
# list so they remain at full precision
62-
recipe = [
63-
QuantizationModifier(
64-
targets="Linear",
65-
scheme="FP8",
66-
ignore=["lm_head", "re:.*mlp.gate$"],
67-
),
68-
]
57+
recipe = QuantizationModifier(
58+
scheme="FP8",
59+
targets="Linear",
60+
ignore=[
61+
"lm_head",
62+
"re:.*block_sparse_moe.gate", # does not quantize well
63+
],
64+
)
6965

7066
oneshot(
7167
model=model,
@@ -76,22 +72,13 @@ def tokenize(sample):
7672
trust_remote_code_model=True,
7773
)
7874

79-
# Confirm generations of the quantized model look sane.
80-
# Generation is broken for deepseek models when using the latest transformers package
81-
if Version(__version__) < Version("4.48"):
82-
print("========== SAMPLE GENERATION ==============")
83-
dispatch_for_generation(model)
84-
SAMPLE_INPUT = ["I love quantization because"]
85-
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
86-
inputs = tokenizer(SAMPLE_INPUT, return_tensors="pt", padding=True).to(model.device)
87-
output = model.generate(**inputs, max_length=50)
88-
text_output = tokenizer.batch_decode(output)
89-
print(text_output)
90-
else:
91-
print(
92-
"WARNING: cannot perform sample generation of "
93-
"deepseek models with transformers >= 4.48"
94-
)
75+
print("========== SAMPLE GENERATION ==============")
76+
dispatch_for_generation(model)
77+
sample = tokenizer("Hello my name is", return_tensors="pt")
78+
sample = {key: value.to("cuda") for key, value in sample.items()}
79+
output = model.generate(**sample, max_new_tokens=100)
80+
print(tokenizer.decode(output[0]))
81+
print("==========================================")
9582

9683
# Save to disk in compressed-tensors format.
9784
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8"

0 commit comments

Comments
 (0)