Skip to content

Commit 891863b

Browse files
mengniwang95kylesayrsHDCharles
authored
add fp8 block example (#2489)
SUMMARY: add fp8 block example denpends on auto-round's main branch TEST PLAN: output of the quantized model: ```text <|begin_of_text|>Hello my name is Ashley and I'm a 21 year old university student. I'm studying a Bachelor of Education (Primary) with a focus on special education. I'm passionate about helping others and making a positive impact on the world. I have experience working with children and young people, including volunteering at a local school, working as a youth leader at my church, and participating in a community outreach program. I'm confident in my ability to work with children of all ages and abilities, and I'm excited to start ``` --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
1 parent cc6a964 commit 891863b

File tree

2 files changed

+57
-0
lines changed

2 files changed

+57
-0
lines changed

examples/autoround/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ The accuracy of the quantized model is configured by tuning-related parameters.
6868
| `wNa16` + `FP8KV` | [llama3_example](./quantization_kv_cache/llama3_example.py) | |
6969
| `W8A8-FP8` Static | [llama4_example](./quantization_w8a8_fp8/llama4_static_quant_example.py) | |
7070
| `W8A8-FP8` Dynamic | [llama4_example](./quantization_w8a8_fp8/llama4_dynamic_quant_example.py) | |
71+
| `W8A8-FP8` Block | [llama3.1_example](./quantization_w8a8_fp8/llama3.1_block_quant_example.py) | |
7172
| `NVFP4` | [llama3.1_example](./quantization_w4a4_fp4/llama3.1_example.py) | |
7273
| `MXFP4` | [qwen3_example](../../experimental/mxfp4/autoround_qwen3_example.py) | |
7374

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
from auto_round.calib_dataset import get_dataset
2+
from compressed_tensors.offload import dispatch_model
3+
from transformers import AutoModelForCausalLM, AutoTokenizer
4+
5+
from llmcompressor import oneshot
6+
from llmcompressor.modifiers.autoround import AutoRoundModifier
7+
8+
# Select model and load it.
9+
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
10+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
11+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
12+
13+
# Select calibration dataset.
14+
NUM_CALIBRATION_SAMPLES = 128
15+
MAX_SEQUENCE_LENGTH = 2048
16+
# Get aligned calibration dataset.
17+
18+
ds = get_dataset(
19+
tokenizer=tokenizer,
20+
seqlen=MAX_SEQUENCE_LENGTH,
21+
nsamples=NUM_CALIBRATION_SAMPLES,
22+
)
23+
24+
25+
# Configure the quantization algorithm to run.
26+
# NOTE: AutoRoundModifier with iters=0 is equivalent to RTN
27+
recipe = AutoRoundModifier(
28+
targets="Linear", scheme="FP8_BLOCK", ignore=["lm_head"], iters=0
29+
)
30+
31+
32+
# Apply algorithms.
33+
oneshot(
34+
model=model,
35+
dataset=ds,
36+
recipe=recipe,
37+
max_seq_length=MAX_SEQUENCE_LENGTH,
38+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
39+
shuffle_calibration_samples=False,
40+
)
41+
42+
# Confirm generations of the quantized model look sane.
43+
print("\n\n")
44+
print("========== SAMPLE GENERATION ==============")
45+
dispatch_model(model)
46+
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
47+
model.device
48+
)
49+
output = model.generate(input_ids, max_new_tokens=100)
50+
print(tokenizer.decode(output[0]))
51+
print("==========================================\n\n")
52+
53+
# Save to disk compressed.
54+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-BLOCK-AutoRound"
55+
model.save_pretrained(SAVE_DIR, save_compressed=True)
56+
tokenizer.save_pretrained(SAVE_DIR)

0 commit comments

Comments
 (0)