-
Notifications
You must be signed in to change notification settings - Fork 13.4k
CUDA: add a fused top-K MoE kernel #16130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4d6c41a
to
7345668
Compare
Some more performance numbers
|
324ecbb
to
613b6c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend you cache the logits in registers from the start instead of reading the same data from VRAM twice.
613b6c3
to
17c9e7c
Compare
Did not see perf improvements after the changes, TG improves by 6-7% still |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
//special case for topk-moe | ||
if (ops.size() == 5 && ops.begin()[0] == GGML_OP_SOFT_MAX && ops.begin()[1] == GGML_OP_RESHAPE && ops.begin()[2] == GGML_OP_ARGSORT | ||
&& ops.begin()[3] == GGML_OP_VIEW && ops.begin()[4] == GGML_OP_GET_ROWS) { | ||
|
||
for (int i = 0; i < 5; i++) { | ||
if (cgraph->nodes[node_idx + i]->op != ops.begin()[i]) return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't all this redundant since it is performed in ggml_can_fuse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic of this function should be that you always apply ggml_can_fuse
first and only then do special-cases.
Edit: got it, the RESHAPE
and the VIEW
are problematic in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as I mentioned in #16102 (comment), if I have to remove the empty ops before passing to ggml_can_fuse
then it still be a special case
17c9e7c
to
7a258bf
Compare
Hmm, looks like this causes a illegal memory access when running |
Compile the CUDA code with |
Yeah I did that, however I don't understand it yet. It seems like there is some tensor not getting propagated downstream properly
|
It looks like the dimensions are not right when using |
Yes, there is no problem with calling |
It does not crash when doing Changing this line to use n_expert_used makes everything work Line 545 in 8ba548d
|
Interestingly, if I move |
I think the special case path added in this PR does not have bounds check for the number of nodes in the graph - could this be causing the illegal memory access? |
Added the bounds check and the problem is still there. Since it only happens on warmup and |
Does it still happen if you keep the build forward expand and remove the new fusing logic? |
No doesn't happen if I remove the fusing logic and keep the build forward expand |
7a258bf
to
a275f10
Compare
Ok the bug was not handling ties properly in the kernel, after that it all works. I'm not exactly sure why though |
This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
…to registers before
4b2d2b9
to
639e954
Compare
639e954
to
e772b28
Compare
941bc9e
to
53acfe6
Compare
@JohannesGaessler will you merge once CI passes? |
As described in |
I don't have write access. Let me ping you once the CI passes to merge |
The CI failures don't seem related, so this should be good to merge. @JohannesGaessler |
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.
} | ||
} | ||
|
||
bool ggml_cuda_should_use_topk_moe(const ggml_tensor * softmax, const ggml_tensor * weights) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be robust, this ought to check that there aren't any unanticipated uses of tensors whose writes are elided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what do you mean by "elided"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writes to intermediate tensors that are removed. This is why we have the ggml_node_has_n_uses checks in ggml_can_fuse - to make sure that nothing outside of this fusion region is using any of the intermediate tensors. But this fusion is complex enough that ggml_can_fuse can't handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree, I'm working on ggml_can_fuse
which is more generic, will put a draft PR soon and tag you
This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.
This kernel does the following:
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
Should be more useful in TG than PP