You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qualcomm AI Engine Direct - Refactor llama runner (#10578)
Summary:
- Refactored io_manager into five distinct components:
- DecoderRunner: Module wrapper class.
- PromptProcessor: Handles prompt processing using the decoder and
key-value manager.
- TokenGenerator: Generates tokens using the decoder and key-value
manager.
- KVManager: Manages key-value cache with kv_updater, including data
buffer allocation, cache updates, and buffer updates in TensorImpl.
- IBufferAlloc: Allocates data buffers from RPC memory or client buffer.
- Validated story llama with CL=128, prefill_ar_len=16, QNN SDK: 2.32
- Original :
| CL | prefill_ar_len | eval_mode | kv_updater | Model Load Time
(seconds) | Prompt evaluation (seconds) | Generated token rate
(tokens/seconds) | Time to first generated token (seconds) |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 128 | 16 | KV | shift_pointer | 0.3082 | 0.0105 | 237.5553131 | 0.0152
|
| 128 | 16 | KV | smart_mask | 0.2691 | 0.0501 | 258.9103433 | 0.0544 |
| 128 | 16 | hybrid | shift_pointer | 0.3408 | 0.008 | 232.1754892 |
0.008 |
| 128 | 16 | hybrid | smart_mask | 0.3175 | 0.0447 | 237.7134587 |
0.0447 |
- Refactor:
| CL | prefill_ar_len | eval_mode | kv_updater | Model Load Time
(seconds) | Prompt evaluation (seconds) | Generated token rate
(tokens/seconds) | Time to first generated token (seconds) |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 128 | 16 | KV | shift_pointer |0.2808 | 0.0124 | 234.835 | 0.0124 |
| 128 | 16 | KV | smart_mask | 0.238 | 0.027 | 251.004016 | 0.027 |
| 128 | 16 | hybrid | shift_pointer | 0.3305 | 0.0082 | 229.1122162 |
0.0082 |
| 128 | 16 | hybrid | smart_mask | 0.258| 0.013 |239.463602 | 0.013 |
- Support multi-turn use case.
- Validated on story llama. To simulate the scenario, I forced decode
mode to generate 5 tokens each time. Tokens with random length are
inserted after one round of prefill->decode finished.
- Reproduce command: (Note that some whitespaces are missing due to
decoding. But token is actually the same as golden.)
```
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android
--checkpoint stories110M.pt --params params.json --tokenizer_model
tokenizer.model --prompt "Once" "a little girl named Lily." "toys and
her favorite toy was a big, red ball." "s mom asked her to help her with
the laundry." "and she put all the clothes in the washing machine."
--temperature 0 --tokenizer_bin tokenizer.bin --llama_model stories110m
--model_mode hybrid --ptq 16a4w -m SM8650 -H ${HOST} -s ${DEVICE}-a
${ARTIFACTS}--max_seq_len 128 --prefill_ar_len 16
Result:
Once upon a time, there wasa little girl named Lily. She loved to play
with hertoys and her favorite toy was a big, red ball. One day, Lily's
mom asked her to help her with the laundry. Lily was happy to helpand
she put all the clothes in the washing machine.
After the clothes were
```
- Need to apply the below patch to forced decode mode to generate 5
tokens each time.
```
diff --git a/examples/qualcomm/oss_scripts/llama/runner/token_generator.cpp b/examples/qualcomm/oss_scripts/llama/runner/token_generator.cpp
index 8a81b598d..a8ec53cdb 100644
--- a/examples/qualcomm/oss_scripts/llama/runner/token_generator.cpp
+++ b/examples/qualcomm/oss_scripts/llama/runner/token_generator.cpp
@@ -170,7 +170,10 @@ Result<int64_t> TokenGenerator::generate(
"Failed to set output tensor for module %s",
forward_name_.c_str());
// Generate our tokens
- while (pos < seq_len - 1) {
+ // force decode to generate 5 runs at most
+ int64_t max_pos = std::min(pos + 5, (int64_t)seq_len - 1);
+// while (pos < seq_len - 1) {
+ while (pos < max_pos) {
// Fill in the token and position data
prepare_io(cur_token, pos);
// Only update data pointer of the cache to the tensor for SHIFT_POINTER
```
help="User prompts for Llama. When multiple prompts are entered, a multi-turn conversation will be initiated. Note that this feature is currently for testing purposes only.",
936
935
required=True,
937
936
type=str,
937
+
nargs="+",
938
938
)
939
939
940
940
parser.add_argument(
@@ -1018,7 +1018,7 @@ def _build_parser():
1018
1018
1019
1019
defexport_llama(args) ->None:
1020
1020
ifargs.compile_onlyandargs.pre_gen_pte:
1021
-
exit("Cannot set both compile_only and pre_gen_pte as true")
1021
+
raiseRuntimeError("Cannot set both compile_only and pre_gen_pte as true")
DEFINE_string(prompt, "The answer to the ultimate question is", "Prompt.");
37
+
DEFINE_string(
38
+
prompt,
39
+
"The answer to the ultimate question is",
40
+
"User prompts for Llama. When multiple prompts are entered, a multi-turn conversation will be initiated. Note that this feature is currently for testing purposes only.");
38
41
DEFINE_string(
39
42
system_prompt,
40
43
"",
@@ -49,10 +52,8 @@ DEFINE_int32(
49
52
"Total number of tokens to generate (prompt + output).");
0 commit comments