-
Notifications
You must be signed in to change notification settings - Fork 13.7k
vulkan: Submit once enough matmul work has been recorded #12406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: Submit once enough matmul work has been recorded #12406
Conversation
c065fc5 to
8bd64be
Compare
|
Interesting. I tested this and can reproduce the uplift you report, but also a number of regressions in non-FA cases, especially with smaller models. Not sure if this is an actual problem or just a difference in an extreme (big gpu small model) case unlikely to happen. RTX 3090:
AMD Radeon Pro VII:
Intel A770:
|
|
Interesting, maybe I need to scale down the threshold for smaller models. I'll poke around at it and get back to you. |
I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.
8bd64be to
656c97f
Compare
|
I think I've found a good scale factor, @0cc4m please try again. |
It's a little better, yeah. |
|
This is what I had measured for small models on 3090: |
|
The only thing I can think of is that my system is using an AMD EPYC 7302, which has rather low single-core performance. That could mean it needs more time to record and submit command buffers, exaggerating the difference. I think it's fine to merge regardless, since large models are working well. |
0cc4m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea when coopmat2 will be in a release driver?
|
It'll be in the 575 release. I can't comment specifically on when that'll be out, but in general it tends to be a few months between major releases. |
I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB. This seems to resolve the issue, and also increases perf for non-FA a bit.
Perf on RTX 4070: