请问推理速度 #4

scguang301 · 2024-10-11T12:50:51Z

scguang301
Oct 11, 2024

我在测试dev-multi-op-in-one-graph分支，都可以运行（很棒），但qnn速度和cpu速度基本一样，请问你们测试的速度是什么样？并且使用NPU的时候，推理时，cpu也一样很高，这是为什么

我在高通8295中运行qwen2.5 0.5B q4_0量化：
./llama-cli -m ggml-model-q4_0.gguf -t 1 --chat-template chatml -p "我是一个助手" -n 128 -ngl 50 -mg 2
llama_perf_sampler_print: sampling time = 15.37 ms / 131 runs ( 0.12 ms per token, 8524.21 tokens per second)
llama_perf_context_print: load time = 1181.97 ms
llama_perf_context_print: prompt eval time = 79.10 ms / 3 tokens ( 26.37 ms per token, 37.92 tokens per second)
llama_perf_context_print: eval time = 4491.66 ms / 127 runs ( 35.37 ms per token, 28.27 tokens per second)
llama_perf_context_print: total time = 4610.07 ms / 130 tokens
[ggml_backend_qnn_free, 227]: idx 0, name:QNN-CPU
[ggml_backend_qnn_free, 227]: idx 1, name:QNN-GPU
[ggml_backend_qnn_free, 227]: idx 2, name:QNN-NPU
[ggml_backend_qnn_free, 227]: idx 2, name:QNN-NPU

./llama-cli -m ggml-model-q4_0.gguf -t 1 --chat-template chatml -p "我是一个助手" -n 128
llama_perf_sampler_print: sampling time = 3.82 ms / 33 runs ( 0.12 ms per token, 8647.80 tokens per second)
llama_perf_context_print: load time = 833.43 ms
llama_perf_context_print: prompt eval time = 81.32 ms / 3 tokens ( 27.11 ms per token, 36.89 tokens per second)
llama_perf_context_print: eval time = 1027.26 ms / 29 runs ( 35.42 ms per token, 28.23 tokens per second)
llama_perf_context_print: total time = 1118.79 ms / 32 tokens
[ggml_backend_qnn_free, 227]: idx 0, name:QNN-CPU
[ggml_backend_qnn_free, 227]: idx 1, name:QNN-GPU
[ggml_backend_qnn_free, 227]: idx 2, name:QNN-NPU

Answered by chraac

May 27, 2025

Hi, sorry for the late reply, have an issue already to track the performance here, please have a look

#34

View full answer

scguang301 · 2024-10-11T12:54:24Z

scguang301
Oct 11, 2024
Author

怎么能够更快一些

0 replies

scguang301 · 2024-10-14T09:56:12Z

scguang301
Oct 14, 2024
Author

我使用 q4_0 量化模型，我发现 backend-ops.cpp 中 GGML_OP_MUL_MAT 都打印 src0 type 2 and src1 type 0 are not equal，导致没有调度到NPU上面，所以最终还是cpu的执行效果，请问怎么处理

0 replies

akshatshah17 · 2025-01-08T06:11:15Z

akshatshah17
Jan 8, 2025

hi @scguang301 I am also getting similar problem I am getting both GPU and HTP logs while executing the tiny_llama model so I am actully confused whether model is running on QNN NPU or QNN GPU.

llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 636.18 MiB
......................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 299]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB
[qnn_init, 299]: create QNN device successfully
[alloc_rpcmem, 594]: failed to allocate rpc memory, size: 2048 MB
[qnn_init, 372]: capacity of QNN rpc ion memory is about 2000 MB
[init_htp_perfinfra, 485]: HTP backend perf_infrastructure creation ok
[init_htp_perfinfra, 497]: HTP infra type = 0, which is perf infra type
[ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-npu
llama_kv_cache_init: qnn-gpu KV buffer size = 88.00 MiB
llama_new_context_with_model: KV self size = 88.00 MiB, K (f16): 44.00 MiB, V (f16): 44.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 280.01 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: model was trained on only 2048 context tokens (4096 specified)

sampler seed: 3467048278
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

[{<(Task)>}]
You are a summarization expert. Please read the provided carefully and summarize it in 3 sentences in English. The summary should comprehensively cover the entire content of the original text and be written with the same meaning as the source material.

[{<(Input)>}] Bread, milk, eggs, chicken, rice, pasta, tomatoes, spinach, bananas, apples, yogurt, cheese, toothpaste, soap, tissues, laundry detergent, coffee, Two proteins: a rotisserie chicken and two 4 oz. fillets of fresh salmon Two veggies: asparagus and carrots a handful of bananas and two avocados cereal
[{<(ParagraphSummary)>}]
Ingredients:

1 rotisserie chicken (or two 4 oz. Fillets), cooked
2 asparagus, chopped
2 carrots, chopped
2 bananas, sliced
2 avocados, mashed
1/4 cup cereal
Instructions:

Preheat oven to 375°F. Line a baking dish with parchment paper.
Place chicken in a large bowl, add asparagus and carrots, and toss with olive oil, salt, and pepper. Spread in a single layer in the prepared baking dish.
In a small bowl, combine mashed banana, cereal, and chicken stock. Pour over chicken mixture.
Bake for 30-35 minutes, or until cooked through.
Serve with mashed avocado and top with toppings (such as sliced jalapeños or chopped cilantro). Enjoy! [end of text]
llama_perf_sampler_print: sampling time = 9.81 ms / 441 runs ( 0.02 ms per token, 44958.71 tokens per second)
llama_perf_context_print: load time = 637.56 ms
llama_perf_context_print: prompt eval time = 1246.73 ms / 190 tokens ( 6.56 ms per token, 152.40 tokens per second)
llama_perf_context_print: eval time = 3927.85 ms / 250 runs ( 15.71 ms per token, 63.65 tokens per second)
llama_perf_context_print: total time = 5202.37 ms / 440 tokens
[ggml_backend_qnn_free, 208]: idx 2, name:qnn-npu
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu

0 replies

chraac · 2025-05-27T05:50:31Z

chraac
May 27, 2025
Maintainer

Hi, sorry for the late reply, have an issue already to track the performance here, please have a look

#34

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

请问推理速度 #4

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

请问推理速度 #4

Uh oh!

scguang301 Oct 11, 2024

Replies: 4 comments

Uh oh!

scguang301 Oct 11, 2024 Author

Uh oh!

scguang301 Oct 14, 2024 Author

Uh oh!

akshatshah17 Jan 8, 2025

Uh oh!

chraac May 27, 2025 Maintainer

scguang301
Oct 11, 2024

scguang301
Oct 11, 2024
Author

scguang301
Oct 14, 2024
Author

akshatshah17
Jan 8, 2025

chraac
May 27, 2025
Maintainer