You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qualcomm AI Engine Direct - Support simple_eval in calibration, perpl… (#12958)
### Summary
- Enable Perplexity Evaluation on device with `llama.py`
- Evaluate perplexity after qdq cpu
- Enable quantization to use simple_eval as calibration dataset.
- Enable UT to check perplexity for QWEN, which should be more reliable
than checking the string output.
Will have a follow up PR to address:
- External CI enablement for qwen on x86 (If it does not take too long).
- Hide Logits scale/offset to metadata in model
#### Script
`python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s
$DEVICE -m SM8750 --prompt "What is 1+1?" --temperature 0 --model_mode
kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5
--eval_perplexity --tasks wikitext`
### Test plan
`python backends/qualcomm/tests/test_qnn_delegate.py -k
TestExampleLLMScript.test_static_qwen2_5 --model SM8650 --build_folder
build-android/ --executorch_root . -s $DEVICE`
Author: @shewu-quic, @winskuo-quic
On the other hand, if you already have a pre-compiled .pte model, you can perform inference by providing the flag `--pre_gen_pte` and specifying the folder that contains the .pte model. Taking LLAMA3.2 as an example:
You can enable MaskedSoftmax feature by providing the flag `--enable_masked_softmax`. It is designed to optimize the LLMs accuracy and performance executed on HTP backend. MaskedSoftmax is used to replace the Softmax(Add(In, Mask)) structure in attention block in LLMs during backend optimization. For more details, please refer to QNN documents.
151
154
Note that it is only supported starting from QNN 2.35.
155
+
156
+
#### Perplexity Evaluation
157
+
This script supports perplexity evaluation and is capable of assessing perplexity scores across 3 phases: prepare_pt2e(CPU FP), convert_pt2e(CPU QDQ), QNN on device.
158
+
159
+
To evaluate the perplexity across all 3 phases, users should provide the `--eval_perplexity` flag and specify the evaluation task. Please notice when this flag is provided, the `--prompt ${PROMPT}` will be ignored.
160
+
161
+
For example, using the Qwen model and 1 wikitext sample as the evaluation task, users can assess all 3 phases perplexity score in a single run by including the appropriate configuration:
For the example script above, 1 wikitext sample is used to evaluate all 3 phases. However, there are cases where a user may want to use one sample for quantization calibration and multiple samples for perplexity evaluation. In this case, the process should be split into two runs. In the 1st run, the model is compiled using one sample. In the 2nd run, the user can provide a different configuration for QNN device execution.
If `--tasks ${TASK}` is not provided, the program will use `--prompt ${PROMPT}` as the dataset for quantization calibration.
179
+
Regardless of whether `--eval_perplexity` is provided, as long as `--tasks ${TASK}` is specified, the specified tasks will be used for model quantization calibration instead of the prompt.
0 commit comments