CUDA/HIP: add support for selectable warp size to mmv #11519

IMbackK · 2025-01-30T16:44:41Z

This adds selectable warp size support to mmv to improve performance on devices with warp size != 32

Predictably this improves performance on CDNA (and GCN)

Master:

  Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B F16                   |  14.96 GiB |     8.03 B | ROCm       |  99 |          tg64 |         36.85 ± 0.10 |

PR:

  Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B F16                   |  14.96 GiB |     8.03 B | ROCm       |  99 |          tg64 |         49.38 ± 0.10 |

And dose nothing for RDNA2

Master:

  Device 0: AMD Radeon RX 6800 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B F16                   |  14.96 GiB |     8.03 B | ROCm       |  99 |          tg64 |         26.77 ± 0.07 |

PR:

  Device 0: AMD Radeon RX 6800 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B F16                   |  14.96 GiB |     8.03 B | ROCm       |  99 |          tg64 |         26.81 ± 0.05 |

IMbackK · 2025-01-30T16:49:31Z

i dont like the addition of GGML_TRUE_WARP_SIZE much, but i cant see another way that dosent:

require moveing every kernel to selectable warp size at the same time
loose developer intent by just hardcodeing 32.

IMbackK · 2025-01-30T16:49:58Z

I also dont know if #define GGML_TRUE_WARP_SIZE 32 is correct for musa

Beinsezii · 2025-01-31T11:04:12Z

Does gfx11 not also support w64, or does that not matter here?

IMbackK · 2025-01-31T11:20:21Z

RDNA can be run in wave 64 mode and on RDNA3 this can provide huge performance improvements as RDNA3 can dual issue halfs of a 64 wide wave for some operations, doubling throughput in these instances.

However rocm dose not support RDNA in wave 64 mode on hip:

https://github.com/ROCm/HIP/blob/c1f7109cdd0e7921403cea649baf24a3c38cdd20/include/hip/hip_runtime.h#L40

The reason for this is that the RDNA isa lacks some 64 wide across-wave opertaions in wave64 mode that hip requires.

Regardless if AMD somehow lifted this limitation and you compiled llamacpp with '-mwavefrontsize64' this pr would detect that we are now in wave64 mode and work fine.

in reality you will probably never see more than half peak throughput on rdna3 in regular generic hip code. Either you have to use V_PK 2x32bit instructions by hand in wave32 mode or WMMA, which also internally dual issues to the alus, where applicable.

Beinsezii · 2025-01-31T20:32:42Z

Damn. They seem to have gfx11 just on the backburner for gfx94. Maybe we could open an issue on HIP just see if it gets any attention at least.

BlueSwordM · 2025-02-01T04:21:13Z

Does this also work for GFX906 GPUs, like the Radeon VII/Mi50/Mi60?
I don't seem to be getting large speedups on my end:
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906 -DBUILD_SHARED_LIBS=OFF

Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64

Is this only applicable to FP16 models?
I'm on ROCM 6.2.4 for reference, so far above the 5.5 requirement.
CachyOS (Arch) with 6.13.0 kernel.

IMbackK · 2025-02-01T09:59:24Z

Is this only applicable to FP16 models? I'm on ROCM 6.2.4 for reference, so far above the 5.5 requirement. CachyOS (Arch) with 6.13.0 kernel.

This only affects mmv, quantized models mostly use mmvq so you should not expect anything with quantized models.

Damn. They seem to have gfx11 just on the backburner for gfx94. Maybe we could open an issue on HIP just see if it gets any attention at least.

No the isa just dosent support the required operations in wave64 mode, this is not something amd can solve.

BlueSwordM · 2025-02-01T17:05:04Z

Is this only applicable to FP16 models? I'm on ROCM 6.2.4 for reference, so far above the 5.5 requirement. CachyOS (Arch) with 6.13.0 kernel.

This only affects mmv, quantized models mostly use mmvq so you should not expect anything with quantized models.

That's what I guessed, thank you.
Would it theoretically be possible to perform such an operation with mmvq considering llama.cpp internally converts quantized int weights to FP16/FP32 at runtime? Is that even possible?

IMbackK · 2025-02-01T22:06:35Z

sure its possible, its also the plan

JohannesGaessler

My preference would be to somehow define constexpr int warp_size = 64 at the beginning of the kernel and then use that instead of the WARP_SIZE macro. How about this: define a function like constexpr __device__ ggml_cuda_get_physical_warp_size in common.cuh and make that function return 32 by default but 64 for specific AMD architectures and compile flags.

IMbackK · 2025-02-02T19:16:09Z

@JohannesGaessler done

ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <[email protected]>

CUDA/HIP: add support for selectable warp size to mmv Author : Uvos

BodhiHu · 2025-02-05T06:08:57Z

GGML_TRUE_WARP_SIZE

Hi @IMbackK , fyi, the warp size should be 128 for MUSA SUDI and QY arch:

https://docs.mthreads.com/musa-sdk/musa-sdk-doc-online/programming_guide/Chapter09

IMbackK · 2025-02-05T09:51:37Z

We can adjust the return of ggml_cuda_get_physical_warp_size to return 128 on musa, but someone will have to test this regularly when changes are made to expand its use, as i of course lack the hardware to do so.

CUDA/HIP: add support for selectable warp size to mmv

IMbackK requested a review from JohannesGaessler as a code owner January 30, 2025 16:44

IMbackK force-pushed the addWarpSize branch 2 times, most recently from a151674 to 9a6a6ef Compare January 30, 2025 16:47

IMbackK force-pushed the addWarpSize branch from 9a6a6ef to 325cf10 Compare January 30, 2025 16:55

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 30, 2025

IMbackK force-pushed the addWarpSize branch 2 times, most recently from 09b02bc to f5dd31f Compare January 30, 2025 17:40

JohannesGaessler reviewed Feb 2, 2025

View reviewed changes

CUDA/HIP: add support for selectable warp size to mmv

078ee4f

IMbackK force-pushed the addWarpSize branch from f5dd31f to 078ee4f Compare February 2, 2025 19:15

JohannesGaessler approved these changes Feb 2, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

Update ggml/src/ggml-cuda/common.cuh

182f418

Co-authored-by: Johannes Gäßler <[email protected]>

IMbackK merged commit 396856b into ggml-org:master Feb 2, 2025
46 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 4, 2025

CUDA/HIP: add support for selectable warp size to mmv (ggml-org#11519)

9673530

CUDA/HIP: add support for selectable warp size to mmv Author : Uvos

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025

CUDA/HIP: add support for selectable warp size to mmv (ggml-org#11519)

d5a7288

CUDA/HIP: add support for selectable warp size to mmv

orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025

CUDA/HIP: add support for selectable warp size to mmv (ggml-org#11519)

1c68601

CUDA/HIP: add support for selectable warp size to mmv

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025

CUDA/HIP: add support for selectable warp size to mmv (ggml-org#11519)

5707afb

CUDA/HIP: add support for selectable warp size to mmv

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025

CUDA/HIP: add support for selectable warp size to mmv (ggml-org#11519)

4990249

CUDA/HIP: add support for selectable warp size to mmv

al42and mentioned this pull request Aug 13, 2025

Reported sub-group sizes for RDNA2 device and Wave64 mode AdaptiveCpp/AdaptiveCpp#971

Closed

CUDA/HIP: add support for selectable warp size to mmv #11519

CUDA/HIP: add support for selectable warp size to mmv #11519

Uh oh!

Conversation

IMbackK commented Jan 30, 2025

Uh oh!

IMbackK commented Jan 30, 2025

Uh oh!

IMbackK commented Jan 30, 2025

Uh oh!

Beinsezii commented Jan 31, 2025

Uh oh!

IMbackK commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Beinsezii commented Jan 31, 2025

Uh oh!

BlueSwordM commented Feb 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Feb 1, 2025

Uh oh!

BlueSwordM commented Feb 1, 2025

Uh oh!

IMbackK commented Feb 1, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

IMbackK commented Feb 2, 2025

Uh oh!

Uh oh!

Uh oh!

BodhiHu commented Feb 5, 2025

Uh oh!

IMbackK commented Feb 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

IMbackK commented Jan 31, 2025 •

edited

Loading

BlueSwordM commented Feb 1, 2025 •

edited

Loading