-
Notifications
You must be signed in to change notification settings - Fork 13.3k
CUDA: MoE helper in device code, better tile sizes #15525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: MoE helper in device code, better tile sizes #15525
Conversation
Looking at the code I realized that it's possible to reduce the number of tiles/CUDA blocks that need to be considered for MoE. These tiles/CUDA blocks would be skipped regardless but you can squeeze out a few more % if you never need to do this check in the first place. (I updated the table in the OP.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On CDNA, this pr casues a crash or failure (random) of various tests such as ./bin/test-backend-ops -p type_a=q8_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256,o=1
as well as crashes in model inference.
ggml/src/ggml-cuda/common.cuh
Outdated
template<int width = WARP_SIZE> | ||
static __device__ __forceinline__ int warp_reduce_all(int x) { | ||
#ifdef GGML_USE_HIP | ||
#ifndef GGML_USE_HIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why resitct this to cuda? might as well use __all on hip and check for ggml_cuda_get_physical_warp_size, even if we never use this with width == 64, that should still help rdna.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wasn't aware of the __all
instruction.
ggml/src/ggml-cuda/common.cuh
Outdated
|
||
template<int width = WARP_SIZE> | ||
static __device__ __forceinline__ int warp_reduce_any(int x) { | ||
#ifndef GGML_USE_HIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
I think the problem was that I was misusing the warp reduce functions. I forgot that the default width is |
jup warp_reduce_any<warp_size> is needed. |
should be ok now i think, let me check. |
Correctness is fine and performance looks fine too. CDNA:
|
I'm noticing that for very large batch sizes there are issues with shared memory limits. On Pascal it should be |
Previously I was storing the token position in the physical batch and the expert index as 32 bit integers. However, that many bits are not needed. If you store the token position with 22 bits and the expert index with 10 bits you only need half as much shared memory and the maximum physical batch size becomes twice as high as I outlined in my previous post, which I think will be enough. The assigned bits will only be insufficient for a physical batch size > 4M or more than 1024 used experts, which should be well above the values that the code needs to run for. |
* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks
* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks
This PR moves the helper code for the MoE MMQ kernel from host code to device code. This eliminates the need for synchronization. It also sets a tighter bound for the maximum number of columns that the kernel selection logic is optimizing for so there is less waste at small batch sizes.
Performance changes
@IMbackK if you could also check performance that would be appreciated.