Skip to content

Commit 2e0035f

Browse files
authored
Update MoE examples (#192)
* Update MoE examples * Add top-level link * Fix deepseek_moe_w8a8_int8.py * Add deepseek_moe_w8a8_fp8.py * Quality * Quality
1 parent 23c499a commit 2e0035f

File tree

5 files changed

+98
-14
lines changed

5 files changed

+98
-14
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ Applying quantization with `llmcompressor`:
3838
* [Activation quantization to `int8`](examples/quantization_w8a8_int8)
3939
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8)
4040
* [Weight only quantization to `int4`](examples/quantization_w4a16)
41+
* [Quantizing MoE LLMs](examples/quantizing_moe)
4142

4243
### User Guides
4344
Deep dives into advanced usage of `llmcompressor`:

examples/quantizing_moe/README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Quantizing TinyMixtral-4x248M-MoE Model with FP8
1+
# Quantizing Mixtral-8x7B-Instruct-v0.1 Model with FP8
22

3-
This directory contains an example script for quantizing the `TinyMixtral-4x248M-MoE` model using the FP8 quantization scheme.
3+
This directory contains an example script for quantizing the `Mixtral-8x7B-Instruct-v0.1` model using the static per-tensor FP8 quantization scheme.
44

55
## Installation
66

@@ -17,7 +17,7 @@ pip install -e .
1717
The provided example script demonstrates an end-to-end process for applying the quantization algorithm:
1818

1919
```bash
20-
python3 mixtral_moe_fp8.py
20+
python3 mixtral_moe_w8a8_fp8.py
2121
```
2222

2323
## Creating a Quantized MoE Model
@@ -27,7 +27,7 @@ This example leverages `llm-compressor` and `compressed-tensors` to create an FP
2727
You can follow the detailed steps below or simply run the example script with:
2828

2929
```bash
30-
python examples/quantizing_moe/mixtral_moe_fp8.py
30+
python mixtral_moe_w8a8_fp8.py
3131
```
3232

3333
### Step 1: Select a Model, Dataset, and Recipe
@@ -36,12 +36,12 @@ In this step, you'll choose a baseline model for quantization, a dataset for cal
3636

3737
- **Models**: Can be referenced from a local directory or retrieved from the Hugging Face Hub.
3838
- **Datasets**: Can also be from a local directory or the Hugging Face Hub.
39-
- **Recipes**: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use a `GPTQModifier` object with the scheme set to `FP8`.
39+
- **Recipes**: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use a `QuantizationModifier` object with the scheme set to `FP8`.
4040

4141
```python
42-
from llmcompressor.modifiers.quantization.gptq import GPTQModifier
42+
from llmcompressor.modifiers.quantization import QuantizationModifier
4343

44-
recipe = GPTQModifier(scheme="FP8", targets="Linear", ignore=["lm_head", "re:.*block_sparse_moe.gate"], sequential_update=True)
44+
recipe = QuantizationModifier(scheme="FP8", targets="Linear", ignore=["lm_head", "re:.*block_sparse_moe.gate"])
4545
```
4646

4747
NOTE: `.*block_sparse_moe.gate` layers do not quantize well, hence they are ignored!
@@ -69,9 +69,11 @@ oneshot(
6969

7070
### Custom Quantization
7171

72+
NOTE: Only per-tensor quantization is supported in vLLM as of now (`vllm==0.6.1`)
73+
7274
The repository supports multiple quantization techniques configured via a recipe. Supported strategies include `tensor`, `group`, and `channel` quantization.
7375

74-
In the above example, FP8 channel-wise quantization is used as specified by the `FP8` scheme. For other preset schemes, refer to the [quantization schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py) in the `Compressed-Tensors` library.
76+
In the above example, FP8 per-tensor quantization is used as specified by the `FP8` scheme. For other preset schemes, refer to the [quantization schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py) in the `compressed-tensors` library.
7577

7678
A custom scheme can also be specified using `config_groups`:
7779

@@ -89,7 +91,7 @@ config_groups = {
8991
"num_bits": 8,
9092
"type": "int",
9193
"symmetric": true,
92-
"strategy": "tensor",
94+
"strategy": "group",
9395
"group_size": 128,
9496
}
9597
}
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
from datasets import load_dataset
2+
from transformers import AutoTokenizer
3+
4+
from llmcompressor.modifiers.quantization import QuantizationModifier
5+
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
6+
7+
# select a Mixture of Experts model for quantization
8+
MODEL_ID = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
9+
10+
model = SparseAutoModelForCausalLM.from_pretrained(
11+
MODEL_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True
12+
)
13+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
14+
15+
# Select calibration dataset.
16+
# its recommended to use more calibration samples for MoE models so each expert is hit
17+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
18+
DATASET_SPLIT = "train_sft"
19+
NUM_CALIBRATION_SAMPLES = 2048
20+
MAX_SEQUENCE_LENGTH = 2048
21+
22+
23+
# Load dataset and preprocess.
24+
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
25+
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
26+
27+
28+
def preprocess(example):
29+
return {
30+
"text": tokenizer.apply_chat_template(
31+
example["messages"],
32+
tokenize=False,
33+
)
34+
}
35+
36+
37+
ds = ds.map(preprocess)
38+
39+
40+
# Tokenize inputs.
41+
def tokenize(sample):
42+
return tokenizer(
43+
sample["text"],
44+
padding=False,
45+
max_length=MAX_SEQUENCE_LENGTH,
46+
truncation=True,
47+
add_special_tokens=False,
48+
)
49+
50+
51+
ds = ds.map(tokenize, remove_columns=ds.column_names)
52+
53+
# define a llmcompressor recipe for FP8 W8A8 quantization
54+
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
55+
# list so they remain at full precision
56+
recipe = [
57+
QuantizationModifier(
58+
targets="Linear",
59+
scheme="FP8",
60+
ignore=["lm_head", "re:.*mlp.gate$"],
61+
),
62+
]
63+
64+
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8"
65+
66+
oneshot(
67+
model=model,
68+
dataset=ds,
69+
recipe=recipe,
70+
max_seq_length=MAX_SEQUENCE_LENGTH,
71+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
72+
save_compressed=True,
73+
output_dir=SAVE_DIR,
74+
)
75+
76+
77+
print("========== SAMPLE GENERATION ==============")
78+
SAMPLE_INPUT = ["I love quantization because"]
79+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
80+
inputs = tokenizer(SAMPLE_INPUT, return_tensors="pt", padding=True).to(model.device)
81+
output = model.generate(**inputs, max_length=50)
82+
text_output = tokenizer.batch_decode(output)
83+
print(text_output)

examples/quantizing_moe/deepseek_moe_w8a8.py renamed to examples/quantizing_moe/deepseek_moe_w8a8_int8.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ def tokenize(sample):
6262

6363
ds = ds.map(tokenize, remove_columns=ds.column_names)
6464

65-
# define a llmcompressor recipe for W416 quantization
65+
# define a llmcompressor recipe for INT8 W8A8 quantization
6666
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
6767
# list so they remain at full precision
6868
recipe = [

examples/quantizing_moe/mixtral_moe_fp8.py renamed to examples/quantizing_moe/mixtral_moe_w8a8_fp8.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
from transformers import AutoTokenizer
44

5-
from llmcompressor.modifiers.quantization.gptq import GPTQModifier
5+
from llmcompressor.modifiers.quantization import QuantizationModifier
66
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
77
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
88

@@ -34,9 +34,7 @@
3434
"re:.*block_sparse_moe.gate", # does not quantize well
3535
]
3636

37-
recipe = GPTQModifier(
38-
scheme="FP8", targets="Linear", ignore=layers_to_ignore, sequential_update=True
39-
)
37+
recipe = QuantizationModifier(scheme="FP8", targets="Linear", ignore=layers_to_ignore)
4038

4139

4240
oneshot(

0 commit comments

Comments
 (0)