-
Notifications
You must be signed in to change notification settings - Fork 13.4k
CUDA: fuse gate + up for mmvq, and mmvf #16630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
bc0dcea
to
39da6fe
Compare
39da6fe
to
0b2acfc
Compare
Without having looked at this PR, consider also fusing the Q, K, and V matrix multiplications into a single, batched operation. It's not going to reduce I/O but it's going to reduce kernel launch overhead and tail effects. |
} | ||
}; | ||
|
||
struct test_fused_ffn_gate : public test_case { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to extend the test case for matrix multiplication instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine, because it has a switch for mul_mat_id as well
Note that with quantized types and |
7f3827d
to
a0c67c7
Compare
It looks like the newly added tests for f16 and f32 are failing on the CI for Tesla T4 GPU (cuda v 13) by quite a large amount, I notice that these might be the only tests with m=1 for mul_mat. On a rented T4 I don't see this problem (cuda v 12.6), so either it's a cuda version thing or an alignment thing EDIT: it was neither, just a normal bug in not initing |
This PR adds support for fusing mmvq and mmvf with an optional gate and GLU. Currently it only supports SWIGLU and no bias, which is by far the most common pattern. Perf gains in TG of 4-9% for quantized models, lesser for fp models.
After #16130 and this PR
ggml_can_fuse
is too primitive to support fusion. What we want is a self-contained DAG with one exit point, where views are not used elsewhere in the graph. I will create a future PR for thatPerformance on a 4090