Skip to content

Commit 3446413

Browse files
markdown update is still WIP
Signed-off-by: cliu-us <[email protected]>
1 parent d7dfc27 commit 3446413

File tree

2 files changed

+17
-120
lines changed

2 files changed

+17
-120
lines changed

examples/MX/README.md

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
# Direct Quantization (DQ) Using `microscaling`
2-
This is the same example as in the [DQ](../DQ_SQ/README.md) folder, except using [microscaling](https://arxiv.org/abs/2310.10537) format.
1+
# `microscaling` Examples Using a Toy Model and Direct Quantization (DQ)
2+
Here, we provide two simple examples of using MX format in `fms-mo`.
3+
"MX format", such as `MXFP8`, is a different format compared to typical IEEE formats, e.g. PyTorch FP8s (`e4m3` or `e5m2`, see our other [FP8 example](../FP8_QUANT/README.md).) Mainly all the `mx` format are group-based where each member of the group is using the specified format, e.g. FP8 for MXFP8 while each group has a shared (usualy 8-bit) "scale". Group size could be as small as 32 or 16, depending on hardware design.
4+
> [!NOTE]
5+
It is important to keep in mind that `mx` is not natively supported by Hopper GPUs yet (some will be supported by Blackwell), which means the quantization configurations and corresponding behavior are simulated, i.e. no real "speed up" should be expected.
36

4-
Here, we provide an example of direct quantization. In this case, we demonstrate DQ of `llama3-8b` model into MXINT8, MXFP8, MXFP6, MXFP4 for weights, activations, and/or KV-cache. Note that `MXFP8` is a different format compared to typical PyTorch FP8s (e4m3 or e5m2), see our other [FP8 example](../FP8_QUANT/README.md). Mainly all the `mx` format are not natively supported by Hopper yet (some will be supported by Blackwell), which means the quantization configurations and corresponding behavior are simulated, no "speed up" should be expected.
57

68
## Requirements
79
- [FMS Model Optimizer requirements](../../README.md#requirements)
@@ -11,6 +13,18 @@ Here, we provide an example of direct quantization. In this case, we demonstrate
1113
1214
## QuickStart
1315

16+
First example is based on a toy model with only a few Linear layers, in which only one Linear layer will be quantized with MX version of `int8`, `int4`, `fp8`, and `fp4`. The example can simply be run as follow
17+
18+
```bash
19+
>>> python simple_mx_example.py
20+
```
21+
Expected output includes:
22+
```bash
23+
24+
```
25+
26+
The second example is the same as in the [DQ](../DQ_SQ/README.md) folder, except using [microscaling](https://arxiv.org/abs/2310.10537) format. We demonstrate the effect of MXINT8, MXFP8, MXFP6, MXFP4 for weights, activations, and/or KV-cache.
27+
1428
**1. Prepare Data** for calibration process by converting into its tokenized form. An example of tokenization using `LLAMA-3-8B`'s tokenizer is below.
1529

1630
```python

examples/MX/ffn_tmp.py

Lines changed: 0 additions & 117 deletions
This file was deleted.

0 commit comments

Comments
 (0)