Skip to content

Commit 4f7d594

Browse files
committed
Qualcomm AI Engine Direct - Support simple_eval in calibration, perplexity, and UT
1 parent 45846c8 commit 4f7d594

File tree

17 files changed

+901
-295
lines changed

17 files changed

+901
-295
lines changed

backends/qualcomm/tests/test_qnn_delegate.py

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4249,7 +4249,7 @@ def test_llama_stories_110m(self):
42494249
if not self.compile_only and not self.enable_x86_64:
42504250
self.assertGreaterEqual(msg["inference_speed"], 220) # Lanai
42514251

4252-
def test_qwen2_5(self):
4252+
def test_static_qwen2_5(self):
42534253
if not self.required_envs():
42544254
self.skipTest("missing required envs")
42554255

@@ -4274,11 +4274,14 @@ def test_qwen2_5(self):
42744274
"--decoder_model",
42754275
"qwen2_5",
42764276
"--model_mode",
4277-
"hybrid",
4278-
"--prefill_ar_len",
4279-
"32",
4277+
"kv",
42804278
"--max_seq_len",
4281-
"128",
4279+
"1024",
4280+
"--eval_perplexity",
4281+
"--tasks",
4282+
"wikitext",
4283+
"--limit",
4284+
"1",
42824285
]
42834286
if self.compile_only:
42844287
cmds.extend(["--compile_only"])
@@ -4291,8 +4294,6 @@ def test_qwen2_5(self):
42914294
if self.pre_gen_pte:
42924295
cmds.extend(["--pre_gen_pte", self.pre_gen_pte])
42934296

4294-
# Accuracy is bad for now. Just check user's prompt is returned.
4295-
golden_start_with = "My favourite condiment is "
42964297
p = subprocess.Popen(cmds, stdout=subprocess.DEVNULL)
42974298
with Listener((self.ip, self.port)) as listener:
42984299
conn = listener.accept()
@@ -4301,12 +4302,13 @@ def test_qwen2_5(self):
43014302
if "Error" in msg:
43024303
self.fail(msg["Error"])
43034304
else:
4304-
model_out = msg["result"][0]
4305-
self.assertTrue(
4306-
model_out.startswith(golden_start_with),
4307-
f"Expected Output: {golden_start_with}. Actual Output: {model_out}",
4308-
)
4309-
self.assertGreaterEqual(msg["inference_speed"], 95) # Lanai
4305+
inference_speed_ref = {"SM8650": 110, "SM8750": 130}
4306+
self.assertLessEqual(msg["wiki_ppl"], 25)
4307+
self.assertLessEqual(msg["pte_size"], 800000000) # 800mb
4308+
if self.model in inference_speed_ref:
4309+
self.assertGreaterEqual(
4310+
msg["inference_speed"], inference_speed_ref[self.model]
4311+
)
43104312

43114313

43124314
class TestExampleOssScript(TestQNN):

examples/qualcomm/oss_scripts/llama/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,16 +114,21 @@ We have two distinct mechanisms for updating the key-value (KV) cache, which can
114114
</table>
115115

116116
### Additional Configs when running the script
117+
118+
#### Compile Only
117119
If you would like to compile the model only, we have provided the flag `--compile_only`. Taking LLAMA3.2 as an example:
118120
```bash
119121
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -m ${SOC_MODEL} --ptq 16a4w --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --llama_model llama3_2 --model_mode hybrid --prefill_ar_len 32 --max_seq_len 128 --prompt "what is 1+1" --compile_only
120122
```
121123

124+
#### Pre Generated PTE
122125
On the other hand, if you already have a pre-compiled .pte model, you can perform inference by providing the flag `--pre_gen_pte` and specifying the folder that contains the .pte model. Taking LLAMA3.2 as an example:
123126
```bash
124127
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --ptq 16a4w --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --llama_model llama3_2 --model_mode hybrid --prefill_ar_len 32 --max_seq_len 128 --prompt "what is 1+1" --pre_gen_pte ${FOLDER_TO_PRE_GEN_PTE}
125128
```
126129

130+
131+
#### KV Cache Updator
127132
You can select the KV Cache update mechanism at runtime by setting the `KV_UPDATER` variable to either "shift_pointer" or "smart_mask". By default, it is set to "smart_mask".
128133
`KV_UPDATER` = "shift_pointer"
129134
```bash
@@ -140,3 +145,28 @@ For more details, please refer to the paper ["Break the Sequential Dependency of
140145
```bash
141146
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --ptq 16a4w --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --llama_model llama3_2 --model_mode lookahead --prefill_ar_len 32 --max_seq_len 128 --prompt "what is 1+1" --ngram 3 --window 2 --gcap 2
142147
```
148+
149+
#### Perplexity Evaluation
150+
This script supports perplexity evaluation and is capable of assessing perplexity scores across 3 phases: prepare_pt2e(CPU FP), convert_pt2e(CPU QDQ), QNN on device.
151+
152+
To evaluate the perplexity across all 3 phases, users should provide the `--eval_perplexity` flag and specify the evaluation task. Please notice when this flag is provided, the `--prompt ${PROMPT}` will be ignored.
153+
154+
For example, using the Qwen model and 1 wikitext sample as the evaluation task, users can assess all 3 phases perplexity score in a single run by including the appropriate configuration:
155+
```bash
156+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext --limit 1
157+
```
158+
159+
For the example script above, 1 wikitext sample is used to evaluate all 3 phases. However, there are cases where a user may want to use one sample for quantization calibration and multiple samples for perplexity evaluation. In this case, the process should be split into two runs. In the 1st run, the model is compiled using one sample. In the 2nd run, the user can provide a different configuration for QNN device execution.
160+
Example:
161+
```bash
162+
# 1st run to compile with --limit 1
163+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext --limit 1 --compile_only
164+
```
165+
```bash
166+
# 2nd run to perform QNN device execution with --limit 3
167+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext --limit 3 --pre_gen_pte ${PATH_TO_ARTIFACT_IN_1ST_RUN} --quant_attrs_path ${PATH_TO_ARTIFACT_IN_1ST_RUN}/kv_llama_qnn_quant_attrs.json
168+
```
169+
170+
#### Tasks quantization calibration
171+
If `--tasks ${TASK}` is not provided, the program will use `--prompt ${PROMPT}` as the dataset for quantization calibration.
172+
Regardless of whether `--eval_perplexity` is provided, as long as `--tasks ${TASK}` is specified, the specified tasks will be used for model quantization calibration instead of the prompt.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright (c) Qualcomm Innovation Center, Inc.
2+
# All rights reserved
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
HUGGING_FACE_REPO_IDS = {"qwen2_5": "Qwen/Qwen2.5-0.5B"}
8+
9+
EVAL_MODE = {
10+
"kv": 0,
11+
"hybrid": 1,
12+
"lookahead": 2,
13+
}
14+
15+
DECODER_MODEL_VERSION = {
16+
"stories110m": "llama2",
17+
"llama3_2": "llama3",
18+
"qwen2_5": "qwen2_5",
19+
}

0 commit comments

Comments
 (0)