Skip to content

Commit 2603fc1

Browse files
rtj1gemini-code-assist[bot]HDCharlesbrian-dellabetta
authored
Add GSM8K evaluation script and AWQ+FP8 results (#2330)
This PR adds GSM8K evaluation results for AWQ+FP8 quantization as requested in #2305. ## What's included **RESULTS.md** - Evaluation results comparing FP8_DYNAMIC vs FP8_BLOCK quantization schemes on Meta-Llama-3-8B-Instruct ## Results | Scheme | Strict Match | Flexible Extract | |--------|-------------|------------------| | **FP8_DYNAMIC** | **76.42%** | **76.19%** | | **FP8_BLOCK** | 75.21% | 74.98% | - **Model:** Meta-Llama-3-8B-Instruct - **Hardware:** 8x NVIDIA A100-SXM4-80GB - FP8_DYNAMIC outperforms FP8_BLOCK by ~1.2% on strict match ## DIscussion This behavior where FP8_BLOCK underperforms FP8_DYNAMIC contradicts our expectation since for RTN FP8_BLOCK outperforms FP8_DYNAMIC, however there are 2 important things to notice. 1) FP8_BLOCK quantization creates quantization `groups` whose size is equivalent to the number of elements in a block, whereas FP8_DYNAMIC quantization creates quantization `groups` whose size is equal to the in_features. Thus as long as in_features is less than the block size (128x128=16384) the number of weight scales will actually be higher for per channel quantization. For Meta-Llama-3-8B-Instruct the per-channel weight quantization of the FP8_DYNAMIC scheme has more scales than FP8_BLOCK for every weight. 2) Its also noteworthy that for AWQ, the scale factors being searched for during AWQ align directly with the quantization scales of the per channel weight quantization, this is likely why AWQ yields such a large improvement for FP8_DYNAMIC ## Evaluation command ```bash lm_eval \ --model hf \ --model_args pretrained=<model_path>,dtype=auto \ --tasks gsm8k \ --batch_size 16 \ --output_path <output_dir> ``` **Note:** `batch_size=16` is important — the default `auto` picks 1, significantly increasing evaluation time. ## Model Checkpoints (from @HDCharles) - FP8_DYNAMIC: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-dynamic - FP8_BLOCK: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-block --------- Signed-off-by: rtj1 <tharunjagarlamudi@gmail.com> Signed-off-by: Jagarlamudi <76727507+rtj1@users.noreply.github.com> Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: HDCharles <charlesdavidhernandez@gmail.com>
1 parent 370c04c commit 2603fc1

File tree

1 file changed

+61
-0
lines changed

1 file changed

+61
-0
lines changed

examples/awq/RESULTS.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# AWQ + FP8 Quantization Results
2+
3+
**Model:** Meta-Llama-3-8B-Instruct
4+
**Hardware:** 8x NVIDIA A100-SXM4-80GB
5+
**Date:** Feb 10, 2026
6+
7+
## Summary
8+
9+
Ran the example scripts with both FP8 schemes (FP8_DYNAMIC and FP8_BLOCK) on Meta-Llama-3-8B-Instruct, then evaluated on GSM8K.
10+
11+
This PR adds `RESULTS.md` with reproducible workflow for evaluating AWQ+FP8 quantization schemes on GSM8K.
12+
13+
## GSM8K Results
14+
15+
| Scheme | Strict Match | Flexible Extract |
16+
|--------|-------------|------------------|
17+
| **FP8_DYNAMIC** | 76.42% | 76.19% |
18+
| **FP8_BLOCK** | 75.21% | 74.98% |
19+
20+
**Evaluation details:**
21+
- 1,319 test samples
22+
- Batch size: 16
23+
- Model: Meta-Llama-3-8B-Instruct
24+
25+
## Discussion
26+
27+
This behavior where FP8_BLOCK underperforms FP8_DYNAMIC contradicts our expectation since for RTN FP8_BLOCK outperforms FP8_DYNAMIC, however there are 2 important things to notice.
28+
1) FP8_BLOCK quantization creates quantization `groups` whose size is equivalent to the number of elements in a block, whereas FP8_DYNAMIC quantization creates quantization `groups`
29+
whose size is equal to the in_features. Thus as long as in_features is less than the block size (128x128=16384) the number of weight scales will actually be higher for per channel quantization.
30+
For Meta-Llama-3-8B-Instruct the per-channel weight quantization of the FP8_DYNAMIC scheme has more scales than FP8_BLOCK for every weight.
31+
2) Its also noteworthy that for AWQ, the scale factors being searched for during AWQ align directly with the quantization scales of the per channel weight quantization, this is likely why AWQ yields
32+
such a large improvement for FP8_DYNAMIC
33+
34+
## Model Checkpoints
35+
36+
- FP8_DYNAMIC: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-dynamic
37+
- FP8_BLOCK: https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-awq-asym-fp8-block
38+
39+
## Setup
40+
41+
Use the existing example scripts from the repo:
42+
```bash
43+
cd examples/awq
44+
python fp8_dynamic_llama_example.py
45+
python fp8_block_llama_example.py
46+
```
47+
48+
## Evaluation
49+
50+
Run GSM8K evaluation using lm-eval:
51+
52+
```bash
53+
lm_eval \
54+
--model vllm \
55+
--model_args pretrained=<model_path>,dtype=auto \
56+
--tasks gsm8k \
57+
--batch_size 16 \
58+
--output_path <output_dir>
59+
```
60+
61+
**Important:** Setting `batch_size=16` is critical. The default `auto` picks 1, which significantly increases evaluation time.

0 commit comments

Comments
 (0)