Skip to content

Commit e3f40fa

Browse files
xin3heHDCharlesbrian-dellabettagemini-code-assist[bot]
authored andcommitted
add qwen3 vl autoround example (vllm-project#2357)
SUMMARY: AutoRound quantization example: qwen3-vl nvfp4 TEST PLAN: python qwen3_vl_example.py Output: ``` Hello my name is Mihai, I am a 30 year old male, and I am currently a software engineer working in a company that develops software for the financial sector. I am a very passionate person, and I am always eager to learn new things. I have a strong interest in AI, machine learning, and data science. I am also very interested in the intersection of these fields with finance. I am currently working on a project that involves building a machine learning model to predict stock prices. I am ``` --------- Signed-off-by: Xin He <xin3.he@intel.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
1 parent e8ab049 commit e3f40fa

File tree

2 files changed

+121
-7
lines changed

2 files changed

+121
-7
lines changed

examples/autoround/quantization_w4a4_fp4/README.md

100644100755
Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,17 @@ pip install -e .
1616

1717
## Quickstart
1818

19-
The example includes an end-to-end script for applying the AutoRound quantization algorithm.
19+
The example includes end-to-end scripts for applying the AutoRound quantization algorithm.
20+
21+
### Llama 3.1 Example
2022

2123
```bash
2224
python3 llama3.1_example.py
2325
```
2426

2527
The resulting model `Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound` is ready to be loaded into vLLM.
2628

27-
### Evaluate Accuracy
29+
#### Evaluate Accuracy
2830

2931
With the model created, we can now load and run in vLLM (after installing).
3032

@@ -33,7 +35,6 @@ from vllm import LLM
3335
model = LLM("./Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound")
3436
```
3537

36-
We can evaluate accuracy with `lm_eval` (`pip install lm-eval==0.4.9.1`):
3738
> Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
3839
3940
Run the following to test accuracy on GSM-8K:
@@ -46,33 +47,86 @@ lm_eval --model vllm \
4647
--batch_size 'auto'
4748
```
4849

49-
#### meta-llama/Meta-Llama-3.1-8B-Instruct
50+
##### meta-llama/Meta-Llama-3.1-8B-Instruct
5051
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
5152
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
5253
|gsm8k| 3|flexible-extract| 5|exact_match||0.7710|± |0.0116|
5354
| | |strict-match | 5|exact_match||0.7043|± |0.0126|
5455

55-
#### Meta-Llama-3.1-8B-Instruct-NVFP4 (QuantizationModifier)
56+
##### Meta-Llama-3.1-8B-Instruct-NVFP4 (QuantizationModifier)
5657
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
5758
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
5859
|gsm8k| 3|flexible-extract| 5|exact_match||0.7248|± |0.0123|
5960
| | |strict-match | 5|exact_match||0.6611|± |0.0130|
6061

6162

62-
#### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=0)
63+
##### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=0)
6364
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
6465
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
6566
|gsm8k| 3|flexible-extract| 5|exact_match||0.7362|± |0.0121|
6667
| | |strict-match | 5|exact_match||0.6702|± |0.0129|
6768

68-
#### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200)
69+
##### Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200)
6970
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
7071
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
7172
|gsm8k| 3|flexible-extract| 5|exact_match||0.7210|± |0.0124|
7273
| | |strict-match | 5|exact_match||0.6945|± |0.0127|
7374

7475
> Note: quantized model accuracy may vary slightly due to nondeterminism.
7576
77+
### Qwen3-VL Example
78+
79+
```bash
80+
python3 qwen3_vl_example.py
81+
```
82+
83+
The resulting model `Qwen3-VL-8B-Instruct-NVFP4-AutoRound` is ready to be loaded into vLLM.
84+
85+
#### Evaluate Accuracy
86+
87+
Run the following to test accuracy on GSM-8K and ChartQA:
88+
89+
```bash
90+
lm_eval --model vllm-vlm \
91+
--model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
92+
--tasks gsm8k \
93+
--num_fewshot 5 \
94+
--batch_size 'auto'
95+
96+
lm_eval --model vllm-vlm \
97+
--model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
98+
--tasks chartqa \
99+
--batch_size 'auto' \
100+
--apply_chat_template
101+
```
102+
103+
##### Qwen/Qwen3-VL-8B-Instruct (Baseline)
104+
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
105+
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
106+
|gsm8k| 3|flexible-extract| 5|exact_match||0.8628|± |0.0095|
107+
| | |strict-match | 5|exact_match||0.8453|± |0.0100|
108+
109+
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
110+
|-------|------:|------|-----:|-----------------|---|-----:|---|-----:|
111+
|chartqa| 0|none | 0|anywhere_accuracy||0.7908|± |0.0081|
112+
| | |none | 0|exact_match ||0.5592|± |0.0099|
113+
| | |none | 0|relaxed_accuracy ||0.7696|± |0.0084|
114+
115+
116+
##### Qwen3-VL-8B-Instruct-NVFP4-AutoRound (AutoRoundModifier, iters=200)
117+
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
118+
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
119+
|gsm8k| 3|flexible-extract| 5|exact_match||0.8415|± |0.0101|
120+
| | |strict-match | 5|exact_match||0.8408|± |0.0101|
121+
122+
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
123+
|-------|------:|------|-----:|-----------------|---|-----:|---|-----:|
124+
|chartqa| 0|none | 0|anywhere_accuracy||0.8220|± |0.0077|
125+
| | |none | 0|exact_match ||0.5748|± |0.0099|
126+
| | |none | 0|relaxed_accuracy ||0.8044|± |0.0079|
127+
128+
> Note: quantized model accuracy may vary slightly due to nondeterminism.
129+
76130
### Questions or Feature Request?
77131

78132
Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) or [intel/auto-round](https://github.com/intel/auto-round).
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
from auto_round.calib_dataset import get_dataset
2+
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
3+
4+
from llmcompressor import oneshot
5+
from llmcompressor.modifiers.autoround import AutoRoundModifier
6+
from llmcompressor.utils import dispatch_for_generation
7+
8+
# Load model.
9+
MODEL_ID = "Qwen/Qwen3-VL-8B-Instruct"
10+
model = Qwen3VLForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype="auto")
11+
processor = AutoProcessor.from_pretrained(MODEL_ID)
12+
tokenizer = processor.tokenizer
13+
14+
# Select calibration dataset.
15+
NUM_CALIBRATION_SAMPLES = 128
16+
MAX_SEQUENCE_LENGTH = 2048
17+
# Get aligned calibration dataset.
18+
19+
ds = get_dataset(
20+
tokenizer=tokenizer,
21+
seqlen=MAX_SEQUENCE_LENGTH,
22+
nsamples=NUM_CALIBRATION_SAMPLES,
23+
)
24+
25+
26+
# Configure the quantization algorithm to run.
27+
# * quantize the weights to 4 bit with AutoRound with a group size 128
28+
recipe = AutoRoundModifier(
29+
targets="Linear",
30+
scheme="NVFP4",
31+
ignore=["re:.*lm_head", "re:.*visual.*"],
32+
iters=200,
33+
)
34+
35+
# Apply algorithms.
36+
oneshot(
37+
model=model,
38+
dataset=ds,
39+
recipe=recipe,
40+
max_seq_length=MAX_SEQUENCE_LENGTH,
41+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
42+
# disable shuffling to get slightly better mmlu score
43+
shuffle_calibration_samples=False,
44+
)
45+
46+
print("\n\n")
47+
print("========== SAMPLE GENERATION ==============")
48+
dispatch_for_generation(model)
49+
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
50+
model.device
51+
)
52+
output = model.generate(input_ids, max_new_tokens=100)
53+
print(tokenizer.decode(output[0]))
54+
print("==========================================\n\n")
55+
56+
57+
# Save to disk in compressed-tensors format.
58+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4-AutoRound"
59+
model.save_pretrained(SAVE_DIR, save_compressed=True)
60+
processor.save_pretrained(SAVE_DIR)

0 commit comments

Comments
 (0)