Skip to content

Commit 60a2bd6

Browse files
authored
Update QCOM llama hardware support (#15965)
Added hardware support details and memory limit error handling instructions. It seems like lots of users try to use the llama flow on other hardware other than phones. Let's try to document it first
1 parent 5e96b43 commit 60a2bd6

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

examples/qualcomm/oss_scripts/llama/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,40 @@ We offer the following modes to execute the model:
3333

3434
- Lookahead Mode: Lookahead Mode introduces [lookahead decoding](https://arxiv.org/abs/2402.02057) and uses AR-N model to process prompt to enhance token generation speed. While decoding multiple tokens in a single step is infeasible, an LLM can generate multiple guess tokens in parallel. These guess tokens may fit into future parts of the generated sequence. The lookahead decoder generates and verifies these guess tokens, integrating them into the sequence if suitable. In some cases, it can obtain more than one token in a single step. Result is lossless.
3535

36+
## Hardware Support
37+
38+
We’ve validated this flow on the **Samsung Galaxy S23**, **Samsung Galaxy S24**, **Samsung Galaxy S25**, and **OnePlus 12**.
39+
Support on other hardware depends on the **HTP architecture (HtpArch)** and the feature set available on that version.
40+
41+
### HTP Minimum Version Requirements
42+
43+
- **LPBQ (16a4w block-wise quantization)** requires **V69 or newer**
44+
- **Weight sharing** between prefill and decode requires **V73 or newer**
45+
- **16-bit activations + 16-bit weights for matmul** (e.g., 16-bit KV cache) requires **V73 or newer**
46+
47+
### Quantization Guidance for Older Devices
48+
49+
For older HTP versions, you may need to adjust the quantization strategy. Recommended starting points:
50+
51+
- Use **16a4w** as the baseline
52+
- Optionally apply **SpinQuant**
53+
- Use **16a8w selectively on some layers** to further improve accuracy (mixed-precision quantization)
54+
55+
### Memory Limit Errors (4 GB HTP Limit)
56+
57+
If you encounter errors like the following, it typically means the model’s requested memory exceeds the **4 GB per-context limit** on HTP.
58+
To resolve this, try **increasing the sharding number** (`num_sharding`) to reduce per-shard memory usage:
59+
60+
```
61+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to find available PD for contextId 1 on deviceId 0 coreId 0 with context size estimate 4025634048
62+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> context create from binary failed on contextId 1
63+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Fail to create context from binary with err 1002
64+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Size Calculation encounter error! Doing Hard reset of reserved mem to 0.
65+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create context from binary with err 0x3ea
66+
[ERROR] [Qnn ExecuTorch]: Can't create context from binary
67+
```
68+
69+
3670
## Instructions
3771
### Note
3872
1. For hybrid mode, the export time will be longer and can take up to 1-4 hours to complete, depending on the specific model users are exporting.

0 commit comments

Comments
 (0)