Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Sep 18, 2025

Ref discussion: #15783 (comment)

Refactor this code by using cpp template function

I tested this code by running test-backend-ops against CPU <--> Metal/CUDA/Vulkan, but maybe there are some cases missed from the test. Would be nice if you can have a deeper look, thanks!

@ngxson ngxson requested review from ggerganov and slaren September 18, 2025 02:09
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 18, 2025
@github-actions github-actions bot added the testing Everything test related label Sep 18, 2025
@ngxson
Copy link
Collaborator Author

ngxson commented Sep 18, 2025

I did a pref test between master & PR. While the result fluctuate quite a lot from one run to another, I can see that the peak performance stays the same:

master:
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                10021 runs -   101.79 us/run -     9216 kB/run -   86.41 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 1290 runs -   878.65 us/run -    65536 kB/run -   71.41 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 3762 runs -   298.64 us/run -    24576 kB/run -   78.60 GB/s
  CPY(type_src=f32,type_dst=q4_0,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 900 runs -  1464.49 us/run -    37376 kB/run -   24.43 GB/s
  CPY(type_src=q4_0,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                2025 runs -   527.50 us/run -    37376 kB/run -   67.61 GB/s


PR:
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                12754 runs -    93.92 us/run -     9216 kB/run -   93.65 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 1806 runs -   570.91 us/run -    65536 kB/run -  109.90 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 3420 runs -   303.12 us/run -    24576 kB/run -   77.43 GB/s
  CPY(type_src=f32,type_dst=q4_0,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 675 runs -  1552.52 us/run -    37376 kB/run -   23.05 GB/s
  CPY(type_src=q4_0,type_dst=f32,ne=[8192,512,2,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                1350 runs -   743.66 us/run -    37376 kB/run -   47.96 GB/s

@ngxson ngxson merged commit 0dd58b6 into ggml-org:master Sep 19, 2025
54 of 55 checks passed
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* ggml : refactor forward_dup for cpu backend

* clean up a bit

* add quant/dequant perf test
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
* ggml : refactor forward_dup for cpu backend

* clean up a bit

* add quant/dequant perf test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants