Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

@jeffbolznv jeffbolznv commented Nov 11, 2024

Split out from #10206, but the solution I went with is a bit different.

Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. In #10206, the matrix multiply is much faster if the B matrix is fp16, so there are a lot of these contiguous copies to do that conversion.

Apply similar changes to the scale shader, since scale is always contiguous.

The first commit fixes a bug in test-backend-ops perf where it computed the memory footprint of one iteration but then divided by the total time for all iterations.

Before/after on RTX 4070. In the after numbers, the larger copies are more or less framebuffer bandwidth-limited, and the smaller copies are hitting in L2.

Before:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  80102 runs -    12.86 us/run -     4608 kB/run -  341.72 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  43692 runs -    22.91 us/run -     9216 kB/run -  383.71 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  5016 runs -   206.73 us/run -    73728 kB/run -  340.61 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 198 runs -  5364.71 us/run -  1572864 kB/run -  288.08 GB/s
  SCALE(type=f32,ne=[256,3072,1,1],scale=2.000000):                    81930 runs -    12.85 us/run -     6144 kB/run -  456.08 GB/s
  SCALE(type=f32,ne=[512,3072,1,1],scale=2.000000):                    43696 runs -    23.07 us/run -    12288 kB/run -  508.00 GB/s
  SCALE(type=f32,ne=[4096,3072,1,1],scale=2.000000):                    4446 runs -   236.93 us/run -    98304 kB/run -  396.26 GB/s
  SCALE(type=f32,ne=[16384,16384,1,1],scale=2.000000):                   204 runs -  4983.04 us/run -  2097152 kB/run -  413.17 GB/s

After:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                 233024 runs -     4.41 us/run -     4608 kB/run -  997.23 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                 160204 runs -     6.34 us/run -     9216 kB/run - 1387.29 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  6384 runs -   163.27 us/run -    73728 kB/run -  431.28 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 264 runs -  3908.14 us/run -  1572864 kB/run -  395.45 GB/s
  SCALE(type=f32,ne=[256,3072,1,1],scale=2.000000):                   185708 runs -     5.49 us/run -     6144 kB/run - 1067.64 GB/s
  SCALE(type=f32,ne=[512,3072,1,1],scale=2.000000):                   114702 runs -     8.73 us/run -    12288 kB/run - 1342.44 GB/s
  SCALE(type=f32,ne=[4096,3072,1,1],scale=2.000000):                    4788 runs -   220.35 us/run -    98304 kB/run -  426.08 GB/s
  SCALE(type=f32,ne=[16384,16384,1,1],scale=2.000000):                   204 runs -  5030.21 us/run -  2097152 kB/run -  409.29 GB/s

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.
@jeffbolznv jeffbolznv requested a review from 0cc4m November 11, 2024 19:10
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 11, 2024
@jeffbolznv jeffbolznv added ggml changes relating to the ggml tensor library for machine learning and removed ggml changes relating to the ggml tensor library for machine learning labels Nov 11, 2024
Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
@0cc4m
Copy link
Collaborator

0cc4m commented Nov 12, 2024

I did some more benchmarks.

RTX 3090:

Before:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  94666 runs -    10.66 us/run -     4608 kB/run -  412.15 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  47333 runs -    21.58 us/run -     9216 kB/run -  407.36 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  6840 runs -   154.11 us/run -    73728 kB/run -  456.91 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 330 runs -  3222.34 us/run -  1572864 kB/run -  479.61 GB/s

After:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                 276716 runs -     3.70 us/run -     4608 kB/run - 1188.59 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  80102 runs -    12.60 us/run -     9216 kB/run -  697.57 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                 11400 runs -    91.06 us/run -    73728 kB/run -  773.32 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 550 runs -  1851.64 us/run -  1572864 kB/run -  834.64 GB/s

Tesla P40:

Before:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  29128 runs -    40.55 us/run -     4608 kB/run -  108.38 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  14564 runs -    75.19 us/run -     9216 kB/run -  116.91 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  1824 runs -   577.92 us/run -    73728 kB/run -  121.84 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                  88 runs - 12307.07 us/run -  1572864 kB/run -  125.57 GB/s

After:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  58256 runs -    18.64 us/run -     4608 kB/run -  235.76 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  29128 runs -    35.05 us/run -     9216 kB/run -  250.77 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  4104 runs -   255.77 us/run -    73728 kB/run -  275.31 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 198 runs -  5384.78 us/run -  1572864 kB/run -  287.00 GB/s

Radeon Pro VII:

Before:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  36410 runs -    29.78 us/run -     4608 kB/run -  147.60 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  18205 runs -    67.49 us/run -     9216 kB/run -  130.26 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  2736 runs -   385.51 us/run -    73728 kB/run -  182.65 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 132 runs -  7695.48 us/run -  1572864 kB/run -  200.83 GB/s

After:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                 131076 runs -     7.68 us/run -     4608 kB/run -  571.95 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  76461 runs -    13.51 us/run -     9216 kB/run -  650.84 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                 10944 runs -    91.95 us/run -    73728 kB/run -  765.79 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 550 runs -  1885.12 us/run -  1572864 kB/run -  819.82 GB/s

Looks like a good improvement all around.

Edit: Similar improvements to SCALE.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Vulkan changes look good to me, and I tested them successfully on Nvidia and AMD. From my side this can be merged.

@ggerganov Are the test-backend-ops changes fine?

@0cc4m 0cc4m merged commit 80dd7ff into ggml-org:master Nov 13, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants