You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qualcomm AI Engine Direct - Static Decoder Runner Support 16bit KV IO (#13127)
### Summary
- Support 16bit KV IO for runner. (Capable to run either 8bit or 16bit)
- Adding README for script to run Qwen2.5 0.5B
- Improving the PPL score for Qwen2.5 0.5B from 18->12.
- Fixing BC CI bug.
Sample Script
`python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s
$DEVICE -m SM8750 --prompt "What is 1+1?" --temperature 0 --model_mode
kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5
--eval_perplexity --tasks wikitext --limit 1 --artifact
./16bit_qwen_1024 --enable_masked_softmax --r3`
#### Stats with QNN2.37.0 on SM8750
Accuracy: 12ppl (Align with prepare_pt2e and convert_pt2e)
Token Rate: ~130tok/sec, depending on seq_len.
<img width="1658" height="877" alt="image"
src="https://github.com/user-attachments/assets/8fa19068-5613-4329-a527-52f3e02d408f"
/>
### Test plan
Added E2E test to `test_qnn_delegate.py`
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --ptq 16a8w --enable_masked_softmax --r3 --decoder_model qwen2_5 --prompt "I would like to learn python, could you teach me with a simple example?"
69
75
```
70
76
71
77
### KV Cache update mechanism
@@ -120,21 +126,21 @@ We have two distinct mechanisms for updating the key-value (KV) cache, which can
120
126
#### Compile Only
121
127
If you would like to compile the model only, we have provided the flag `--compile_only`. Taking LLAMA3.2 as an example:
On the other hand, if you already have a pre-compiled .pte model, you can perform inference by providing the flag `--pre_gen_pte` and specifying the folder that contains the .pte model. Taking LLAMA3.2 as an example:
You can select the KV Cache update mechanism at runtime by setting the `KV_UPDATER` variable to either "shift_pointer" or "smart_mask". By default, it is set to "smart_mask".
@@ -147,7 +153,7 @@ You can choose the lookahead mode to enhance decoding speed. To use this mode, y
147
153
For more details, please refer to the paper ["Break the Sequential Dependency of LLM Inference Using Lookahead Decoding"](https://arxiv.org/abs/2402.02057)
0 commit comments