-
Notifications
You must be signed in to change notification settings - Fork 13.9k
metal: TRI, FILL, EXPM1, SOFTPLUS #16623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Yikes, looks like some alternate platforms that will need to be handled. I'll dig through some of these failures |
|
My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR. |
I think @gabe-l-hart moved the TRI and CUMSUM CPU implementation to this PR as well, so I guess it's a question of adding some testcases? Those ops aren't very hard and as far as basic correctness goes I think I've verified them quite extensively during my fights with Qwen3 Next. I've mimicked the basic logic for CUMSUM to be the same as SUM_ROWS, so it's basically always done on the first dimension. |
|
The thing is though that as of right now those ops aren't used anywhere on master. My opinion is that the new ops should be added in tandem with the model that needs them just in case it turns out that further changes are needed (even if it's unlikely). |
|
@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN, so this PR is really just a checkpoint for the primitive ops. My goal is better performance for Granite 4 which is why I went through to Metal and CUDA here. |
@gabe-l-hart FYI, I've worked on the implementation for SSM_SCAN for the Vulkan backend: #16463 and that indeed helped with Granite 4 |
|
@giuseppe I saw that yesterday! That support will help a ton. The SSD formulation is an additional optimization on top of the recurrent formulation that should have big benefits for prefill with long context. It's mathematically equivalent, but much more efficient. The challenge I'm working through right now is how best to decompose the problem to minimize impact. One option is to write it as a sub-graph composed of smaller ops that is used when the sequence length is > 1, but this would have two problems:
The alternative is to implement it inside a single backend's SSM_SCAN. This has the advantage of being self-contained so it can be done on a per-backend basis, but it has the inverse problem of requiring each backend to implement it separately in order to get the performance boost. It's also much harder to write as a single kernel since it involves allocating temporary tensors of different size than the input or output tensors. |
|
Dumb question-- but can the CUDA gated deltanet code be optimized via Vidrial? It supposedly does autotuning by looking at various configurations to find an ideal configuration... (?) |
|
One of the core goals of llama.cpp/ggml is to avoid external dependencies if at all possible. These kernels are not going to be performance critical so they should just be plain CUDA code no matter what. For the performance critical kernels where I am the main developer and maintainer my opinion is that the code should be as low-level and avoid external dependencies as that makes debugging easier. |
|
Is this PR required by Qwen3-Next CUDA support? Or has all these been implemented in #17063 ? |
CUDA support is still missing for |
|
@gabe-l-hart are you planning to adapt this? If you don't have the time, I can take over and make it compatible with the current master. |
|
@pwilkin if you're on a roll, you're welcome to go for it. I can get to it next week sometime if you don't get to it first. |
|
@gabe-l-hart Aight, might just do the CUDA kernels since I have neither the know-how nor the ability to test Metal kernels ;) |
|
Deal, I can definitely tackle those |
Im open to use my framework to optimize the cuda ones once again after your initial implementation, though I dont have metal hardware to test the metal ones either 😅 |
|
It looks like @ggerganov added |
|
@pwilkin I also see |
The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <[email protected]>
This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <[email protected]>
…n kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <[email protected]>
5071fbd to
f2ad887
Compare
|
@ggerganov @JohannesGaessler I've updated this to only do |
Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
|
It looks like |
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <[email protected]>
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <[email protected]>
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <[email protected]>
Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <[email protected]>
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to adapt to the changes after #17739
| ushort sgitg[[simdgroup_index_in_threadgroup]], | ||
| ushort tiisg[[thread_index_in_simdgroup]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are unused
Description
EDIT: This PR has now been updated to match the kernel signatures in
masterand #17584. CUMSUM was already added in #17305 and there are several more ops added in #17063, so this PR now addsTRI,FILL,EXPM1, andSOFTPLUSforMetal.This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of
CUMSUMandTRItoMetalandCUDA. It also extends type support toF16andBF16.The goal of this PR is to establish these two ops in the interest of both the
DELTA_NETop for Qwen3-Next and the chunked implementation of the State Space Duality form ofSSM_SCANfor faster prefill.I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.