You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/quantizing_moe/README.md
+11-9Lines changed: 11 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# Quantizing TinyMixtral-4x248M-MoE Model with FP8
1
+
# Quantizing Mixtral-8x7B-Instruct-v0.1 Model with FP8
2
2
3
-
This directory contains an example script for quantizing the `TinyMixtral-4x248M-MoE` model using the FP8 quantization scheme.
3
+
This directory contains an example script for quantizing the `Mixtral-8x7B-Instruct-v0.1` model using the static per-tensor FP8 quantization scheme.
4
4
5
5
## Installation
6
6
@@ -17,7 +17,7 @@ pip install -e .
17
17
The provided example script demonstrates an end-to-end process for applying the quantization algorithm:
18
18
19
19
```bash
20
-
python3 mixtral_moe_fp8.py
20
+
python3 mixtral_moe_w8a8_fp8.py
21
21
```
22
22
23
23
## Creating a Quantized MoE Model
@@ -27,7 +27,7 @@ This example leverages `llm-compressor` and `compressed-tensors` to create an FP
27
27
You can follow the detailed steps below or simply run the example script with:
28
28
29
29
```bash
30
-
python examples/quantizing_moe/mixtral_moe_fp8.py
30
+
python mixtral_moe_w8a8_fp8.py
31
31
```
32
32
33
33
### Step 1: Select a Model, Dataset, and Recipe
@@ -36,12 +36,12 @@ In this step, you'll choose a baseline model for quantization, a dataset for cal
36
36
37
37
-**Models**: Can be referenced from a local directory or retrieved from the Hugging Face Hub.
38
38
-**Datasets**: Can also be from a local directory or the Hugging Face Hub.
39
-
-**Recipes**: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use a `GPTQModifier` object with the scheme set to `FP8`.
39
+
-**Recipes**: These are YAML files or Python modifier objects that describe how a model should be optimized during or after training. In this example, we use a `QuantizationModifier` object with the scheme set to `FP8`.
40
40
41
41
```python
42
-
from llmcompressor.modifiers.quantization.gptqimportGPTQModifier
42
+
from llmcompressor.modifiers.quantization importQuantizationModifier
NOTE: `.*block_sparse_moe.gate` layers do not quantize well, hence they are ignored!
@@ -69,9 +69,11 @@ oneshot(
69
69
70
70
### Custom Quantization
71
71
72
+
NOTE: Only per-tensor quantization is supported in vLLM as of now (`vllm==0.6.1`)
73
+
72
74
The repository supports multiple quantization techniques configured via a recipe. Supported strategies include `tensor`, `group`, and `channel` quantization.
73
75
74
-
In the above example, FP8 channel-wise quantization is used as specified by the `FP8` scheme. For other preset schemes, refer to the [quantization schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py) in the `Compressed-Tensors` library.
76
+
In the above example, FP8 per-tensor quantization is used as specified by the `FP8` scheme. For other preset schemes, refer to the [quantization schemes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/quant_scheme.py) in the `compressed-tensors` library.
75
77
76
78
A custom scheme can also be specified using `config_groups`:
0 commit comments