You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qualcomm AI Engine Direct - Static LLM Refactor & Qwen3 1.7B Improvement (#13755)
### Summary
- Refactor llama.py. The current script has some limitations when it
gets to customizing configs, especially quantization configs. As there
are more models enabled, the script is a little messy, consisting of
multiple `if`-`else` statement deciding what model should go into
specific optimization. We want to move it all the model specs under
`__init__.py`.
- Hiding scale/offset into model's metadata, so args.quant_attrs_path is
no longer required when evaluating ppl score.
- Enable Qwen3 1.7B with 16a4w_block quant. Before is using 16a8w, which
is much slower. Targeting maximizing token rate while ensuring ppl
remains within a 20% margin compared to the FP CPU baseline
#### Stats
token rate = 37tok/sec
ppl = 14.79
### Test plan
Tested all scripts ensuring no regression
cc: @haowhsu-quic
0 commit comments