You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added hardware support details and memory limit error handling
instructions.
It seems like lots of users try to use the llama flow on other hardware
other than phones. Let's try to document it first
Copy file name to clipboardExpand all lines: examples/qualcomm/oss_scripts/llama/README.md
+34Lines changed: 34 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,6 +33,40 @@ We offer the following modes to execute the model:
33
33
34
34
- Lookahead Mode: Lookahead Mode introduces [lookahead decoding](https://arxiv.org/abs/2402.02057) and uses AR-N model to process prompt to enhance token generation speed. While decoding multiple tokens in a single step is infeasible, an LLM can generate multiple guess tokens in parallel. These guess tokens may fit into future parts of the generated sequence. The lookahead decoder generates and verifies these guess tokens, integrating them into the sequence if suitable. In some cases, it can obtain more than one token in a single step. Result is lossless.
35
35
36
+
## Hardware Support
37
+
38
+
We’ve validated this flow on the **Samsung Galaxy S23**, **Samsung Galaxy S24**, **Samsung Galaxy S25**, and **OnePlus 12**.
39
+
Support on other hardware depends on the **HTP architecture (HtpArch)** and the feature set available on that version.
40
+
41
+
### HTP Minimum Version Requirements
42
+
43
+
-**LPBQ (16a4w block-wise quantization)** requires **V69 or newer**
44
+
-**Weight sharing** between prefill and decode requires **V73 or newer**
45
+
-**16-bit activations + 16-bit weights for matmul** (e.g., 16-bit KV cache) requires **V73 or newer**
46
+
47
+
### Quantization Guidance for Older Devices
48
+
49
+
For older HTP versions, you may need to adjust the quantization strategy. Recommended starting points:
50
+
51
+
- Use **16a4w** as the baseline
52
+
- Optionally apply **SpinQuant**
53
+
- Use **16a8w selectively on some layers** to further improve accuracy (mixed-precision quantization)
54
+
55
+
### Memory Limit Errors (4 GB HTP Limit)
56
+
57
+
If you encounter errors like the following, it typically means the model’s requested memory exceeds the **4 GB per-context limit** on HTP.
58
+
To resolve this, try **increasing the sharding number** (`num_sharding`) to reduce per-shard memory usage:
59
+
60
+
```
61
+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to find available PD for contextId 1 on deviceId 0 coreId 0 with context size estimate 4025634048
62
+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> context create from binary failed on contextId 1
63
+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Fail to create context from binary with err 1002
64
+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Size Calculation encounter error! Doing Hard reset of reserved mem to 0.
65
+
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create context from binary with err 0x3ea
66
+
[ERROR] [Qnn ExecuTorch]: Can't create context from binary
67
+
```
68
+
69
+
36
70
## Instructions
37
71
### Note
38
72
1. For hybrid mode, the export time will be longer and can take up to 1-4 hours to complete, depending on the specific model users are exporting.
0 commit comments