Skip to content

Conversation

@MQ-mengqing
Copy link
Contributor

  • ggml : optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX

  • ggml : optimize mul_sum_i8_pairs_float for LoongArch ASX

  • ggml : optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

  • ggml : optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

  • ggml : optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

  • ggml : optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

  • ggml : optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 13, 2025
@MQ-mengqing
Copy link
Contributor Author

I got gguf from [https://huggingface.co/MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF],
and llama-bench shows on my [email protected] ASOC OS,

$ llama-bench -m Llama-3.2-1B-Instruct.Q2_K.gguf \
              -m Llama-3.2-1B-Instruct.Q3_K_S.gguf \
              -m Llama-3.2-1B-Instruct.Q4_K_S.gguf \
              -m Llama-3.2-1B-Instruct.Q5_K_S.gguf \
              -m Llama-3.2-1B-Instruct.Q6_K.gguf \
              -m Llama-3.2-1B-Instruct.Q8_0.gguf \
              -m Llama-3.2-1B-Instruct.IQ4_XS.gguf
Before,
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         33.15 ± 0.01 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         28.61 ± 0.14 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         30.04 ± 0.00 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         23.49 ± 0.05 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         31.33 ± 0.00 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         22.41 ± 0.05 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         pp512 |         27.76 ± 0.01 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         tg128 |         20.27 ± 0.03 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         pp512 |         27.51 ± 0.00 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         tg128 |         22.98 ± 0.03 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         pp512 |         35.64 ± 0.01 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         tg128 |         22.16 ± 0.03 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         24.48 ± 0.00 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         18.93 ± 0.02 |


After,

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.45 ± 0.01 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         33.49 ± 0.01 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         36.53 ± 0.01 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         27.26 ± 0.11 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.34 ± 0.02 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         25.51 ± 0.06 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         pp512 |         34.06 ± 0.02 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         tg128 |         23.37 ± 0.03 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         pp512 |         37.92 ± 0.01 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         tg128 |         27.72 ± 0.03 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         pp512 |         37.25 ± 0.01 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         tg128 |         22.17 ± 0.03 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         33.47 ± 0.01 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         23.46 ± 0.25 |

@ggerganov
Copy link
Member

cc @junchao-loongson for review

@junchao-loongson
Copy link
Collaborator

  • benchmark

cpu:3A6000 2.5G
os:Deepin 23
gcc:14.2.0

junchao@junchao-PC ~/work/ai/llama.cpp                                                                                                                                                                                                             [15:38:49] 
> $ ./build/bin/llama-bench   -m ../model-gguf/Llama-3.2-1B-Instruct.Q2_K.gguf \                                                                                                                                                                   [±pr11842]
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q3_K_S.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q4_K_S.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q5_K_S.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q6_K.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q8_0.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.IQ4_XS.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.15 ± 0.06 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         36.28 ± 0.30 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         35.48 ± 0.01 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         33.67 ± 0.24 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.26 ± 0.05 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         36.11 ± 0.05 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         pp512 |         34.25 ± 0.05 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         tg128 |         32.50 ± 0.24 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.05 ± 0.40 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         tg128 |         31.22 ± 0.22 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         pp512 |         37.29 ± 0.13 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         tg128 |         25.68 ± 0.17 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         33.65 ± 0.04 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         31.12 ± 0.22 |

build: 66ed5e38 (4712)

Benchmark can reproduce

  • ctest
junchao@junchao-PC ~/work/ai/llama.cpp                                                                                                                                                                                                             [14:53:41] 
> $ bash ./ci/run.sh ./tmp/results ./tmp/mnt  
.......
+ tee -a /home/junchao/work/ai/llama.cpp/tmp/results/ctest_release-ctest.log
+ ctest --output-on-failure -L main
Test project /home/junchao/work/ai/llama.cpp/build-ci-release
      Start  1: test-tokenizer-0-bert-bge
 1/28 Test  #1: test-tokenizer-0-bert-bge .........   Passed    0.03 sec
      Start  2: test-tokenizer-0-command-r
 2/28 Test  #2: test-tokenizer-0-command-r ........   Passed    0.60 sec
      Start  3: test-tokenizer-0-deepseek-coder
 3/28 Test  #3: test-tokenizer-0-deepseek-coder ...   Passed    0.07 sec
      Start  4: test-tokenizer-0-deepseek-llm
 4/28 Test  #4: test-tokenizer-0-deepseek-llm .....   Passed    0.20 sec
      Start  5: test-tokenizer-0-falcon
 5/28 Test  #5: test-tokenizer-0-falcon ...........   Passed    0.11 sec
      Start  6: test-tokenizer-0-gpt-2
 6/28 Test  #6: test-tokenizer-0-gpt-2 ............   Passed    0.09 sec
      Start  7: test-tokenizer-0-llama-bpe
 7/28 Test  #7: test-tokenizer-0-llama-bpe ........   Passed    0.35 sec
      Start  8: test-tokenizer-0-llama-spm
 8/28 Test  #8: test-tokenizer-0-llama-spm ........   Passed    0.04 sec
      Start  9: test-tokenizer-0-mpt
 9/28 Test  #9: test-tokenizer-0-mpt ..............   Passed    0.09 sec
      Start 10: test-tokenizer-0-phi-3
10/28 Test #10: test-tokenizer-0-phi-3 ............   Passed    0.04 sec
      Start 11: test-tokenizer-0-qwen2
11/28 Test #11: test-tokenizer-0-qwen2 ............   Passed    0.32 sec
      Start 12: test-tokenizer-0-refact
12/28 Test #12: test-tokenizer-0-refact ...........   Passed    0.09 sec
      Start 13: test-tokenizer-0-starcoder
13/28 Test #13: test-tokenizer-0-starcoder ........   Passed    0.09 sec
      Start 14: test-sampling
14/28 Test #14: test-sampling .....................   Passed    1.31 sec
      Start 15: test-grammar-parser
15/28 Test #15: test-grammar-parser ...............   Passed    0.00 sec
      Start 16: test-grammar-integration
16/28 Test #16: test-grammar-integration ..........   Passed    0.01 sec
      Start 17: test-llama-grammar
17/28 Test #17: test-llama-grammar ................   Passed    0.00 sec
      Start 18: test-chat
18/28 Test #18: test-chat .........................   Passed    0.67 sec
      Start 19: test-tokenizer-1-llama-spm
19/28 Test #19: test-tokenizer-1-llama-spm ........   Passed    0.28 sec
      Start 20: test-log
20/28 Test #20: test-log ..........................   Passed    0.02 sec
      Start 21: test-arg-parser
21/28 Test #21: test-arg-parser ...................   Passed    0.06 sec
      Start 22: test-chat-template
22/28 Test #22: test-chat-template ................   Passed    0.13 sec
      Start 23: test-gguf
23/28 Test #23: test-gguf .........................   Passed    0.16 sec
      Start 24: test-backend-ops
24/28 Test #24: test-backend-ops ..................   Passed    0.01 sec
      Start 27: test-barrier
25/28 Test #27: test-barrier ......................   Passed    0.29 sec
      Start 28: test-quantize-fns
26/28 Test #28: test-quantize-fns .................   Passed   17.78 sec
      Start 29: test-quantize-perf
27/28 Test #29: test-quantize-perf ................   Passed    0.07 sec
      Start 30: test-rope
28/28 Test #30: test-rope .........................   Passed    0.15 sec

100% tests passed, 0 tests failed out of 28

Label Time Summary:
main    =  23.05 sec*proc (28 tests)

Total Test time (real) =  23.06 sec
.....

ctest pass

LGTM!

@ggerganov ggerganov merged commit 38e32eb into ggml-org:master Feb 14, 2025
46 checks passed
orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

* Optimize mul_sum_i8_pairs_float for LoongArch ASX

* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

* Optimize mul_sum_i8_pairs_float for LoongArch ASX

* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

* Optimize mul_sum_i8_pairs_float for LoongArch ASX

* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants