-
Notifications
You must be signed in to change notification settings - Fork 13.7k
CUDA/HIP: add support for selectable warp size to mmv #11519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a151674 to
9a6a6ef
Compare
|
i dont like the addition of GGML_TRUE_WARP_SIZE much, but i cant see another way that dosent:
|
|
I also dont know if |
09b02bc to
f5dd31f
Compare
|
Does gfx11 not also support w64, or does that not matter here? |
|
RDNA can be run in wave 64 mode and on RDNA3 this can provide huge performance improvements as RDNA3 can dual issue halfs of a 64 wide wave for some operations, doubling throughput in these instances. However rocm dose not support RDNA in wave 64 mode on hip: The reason for this is that the RDNA isa lacks some 64 wide across-wave opertaions in wave64 mode that hip requires. Regardless if AMD somehow lifted this limitation and you compiled llamacpp with '-mwavefrontsize64' this pr would detect that we are now in wave64 mode and work fine. in reality you will probably never see more than half peak throughput on rdna3 in regular generic hip code. Either you have to use V_PK 2x32bit instructions by hand in wave32 mode or WMMA, which also internally dual issues to the alus, where applicable. |
|
Damn. They seem to have gfx11 just on the backburner for gfx94. Maybe we could open an issue on HIP just see if it gets any attention at least. |
|
Does this also work for GFX906 GPUs, like the Radeon VII/Mi50/Mi60?
Is this only applicable to FP16 models? |
This only affects mmv, quantized models mostly use mmvq so you should not expect anything with quantized models.
No the isa just dosent support the required operations in wave64 mode, this is not something amd can solve. |
That's what I guessed, thank you. |
|
sure its possible, its also the plan |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to somehow define constexpr int warp_size = 64 at the beginning of the kernel and then use that instead of the WARP_SIZE macro. How about this: define a function like constexpr __device__ ggml_cuda_get_physical_warp_size in common.cuh and make that function return 32 by default but 64 for specific AMD architectures and compile flags.
|
@JohannesGaessler done |
Co-authored-by: Johannes Gäßler <[email protected]>
CUDA/HIP: add support for selectable warp size to mmv Author : Uvos
Hi @IMbackK , fyi, the warp size should be 128 for MUSA SUDI and QY arch: https://docs.mthreads.com/musa-sdk/musa-sdk-doc-online/programming_guide/Chapter09 |
|
We can adjust the return of ggml_cuda_get_physical_warp_size to return 128 on musa, but someone will have to test this regularly when changes are made to expand its use, as i of course lack the hardware to do so. |
CUDA/HIP: add support for selectable warp size to mmv
CUDA/HIP: add support for selectable warp size to mmv
CUDA/HIP: add support for selectable warp size to mmv
CUDA/HIP: add support for selectable warp size to mmv
This adds selectable warp size support to mmv to improve performance on devices with warp size != 32
Predictably this improves performance on CDNA (and GCN)
Master:
PR:
And dose nothing for RDNA2
Master:
PR: