Skip to content

Conversation

@gabe-l-hart
Copy link
Collaborator

@gabe-l-hart gabe-l-hart commented Oct 16, 2025

Description

EDIT: This PR has now been updated to match the kernel signatures in master and #17584. CUMSUM was already added in #17305 and there are several more ops added in #17063, so this PR now adds TRI, FILL, EXPM1, and SOFTPLUS for Metal.


This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of CUMSUM and TRI to Metal and CUDA. It also extends type support to F16 and BF16.

The goal of this PR is to establish these two ops in the interest of both the DELTA_NET op for Qwen3-Next and the chunked implementation of the State Space Duality form of SSM_SCAN for faster prefill.

I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 16, 2025
@gabe-l-hart gabe-l-hart mentioned this pull request Oct 16, 2025
@gabe-l-hart
Copy link
Collaborator Author

Yikes, looks like some alternate platforms that will need to be handled. I'll dig through some of these failures

@JohannesGaessler
Copy link
Collaborator

My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR.

@pwilkin
Copy link
Collaborator

pwilkin commented Oct 17, 2025

My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR.

I think @gabe-l-hart moved the TRI and CUMSUM CPU implementation to this PR as well, so I guess it's a question of adding some testcases?

Those ops aren't very hard and as far as basic correctness goes I think I've verified them quite extensively during my fights with Qwen3 Next.

I've mimicked the basic logic for CUMSUM to be the same as SUM_ROWS, so it's basically always done on the first dimension.

@JohannesGaessler
Copy link
Collaborator

The thing is though that as of right now those ops aren't used anywhere on master. My opinion is that the new ops should be added in tandem with the model that needs them just in case it turns out that further changes are needed (even if it's unlikely).

@gabe-l-hart
Copy link
Collaborator Author

@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN, so this PR is really just a checkpoint for the primitive ops. My goal is better performance for Granite 4 which is why I went through to Metal and CUDA here.

@giuseppe
Copy link
Contributor

@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN,

@gabe-l-hart FYI, I've worked on the implementation for SSM_SCAN for the Vulkan backend: #16463 and that indeed helped with Granite 4

@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Oct 17, 2025

@giuseppe I saw that yesterday! That support will help a ton. The SSD formulation is an additional optimization on top of the recurrent formulation that should have big benefits for prefill with long context. It's mathematically equivalent, but much more efficient. The challenge I'm working through right now is how best to decompose the problem to minimize impact. One option is to write it as a sub-graph composed of smaller ops that is used when the sequence length is > 1, but this would have two problems:

  1. It would break the ability to reuse the graph objects across generation steps

  2. It would make backends that don't support the primitive ops (CUMSUM and TRI) fall back to CPU (exactly what you're trying to fix in your PR for SSM_SCAN)

The alternative is to implement it inside a single backend's SSM_SCAN. This has the advantage of being self-contained so it can be done on a per-backend basis, but it has the inverse problem of requiring each backend to implement it separately in order to get the performance boost. It's also much harder to write as a single kernel since it involves allocating temporary tensors of different size than the input or output tensors.

@fat-tire
Copy link
Contributor

Dumb question-- but can the CUDA gated deltanet code be optimized via Vidrial? It supposedly does autotuning by looking at various configurations to find an ideal configuration... (?)

@JohannesGaessler
Copy link
Collaborator

One of the core goals of llama.cpp/ggml is to avoid external dependencies if at all possible. These kernels are not going to be performance critical so they should just be plain CUDA code no matter what. For the performance critical kernels where I am the main developer and maintainer my opinion is that the code should be as low-level and avoid external dependencies as that makes debugging easier.

@gabe-l-hart gabe-l-hart mentioned this pull request Nov 3, 2025
@CarlGao4
Copy link

Is this PR required by Qwen3-Next CUDA support? Or has all these been implemented in #17063 ?

@CISC
Copy link
Collaborator

CISC commented Nov 28, 2025

Is this PR required by Qwen3-Next CUDA support? Or has all these been implemented in #17063 ?

CUDA support is still missing for CUMSUM and TRI, so will fall back to CPU.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 28, 2025

@gabe-l-hart are you planning to adapt this? If you don't have the time, I can take over and make it compatible with the current master.

@gabe-l-hart
Copy link
Collaborator Author

@pwilkin if you're on a roll, you're welcome to go for it. I can get to it next week sometime if you don't get to it first.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 28, 2025

@gabe-l-hart Aight, might just do the CUDA kernels since I have neither the know-how nor the ability to test Metal kernels ;)

@gabe-l-hart
Copy link
Collaborator Author

Deal, I can definitely tackle those

@wsbagnsv1
Copy link

wsbagnsv1 commented Nov 28, 2025

@gabe-l-hart Aight, might just do the CUDA kernels since I have neither the know-how nor the ability to test Metal kernels ;)

Im open to use my framework to optimize the cuda ones once again after your initial implementation, though I dont have metal hardware to test the metal ones either 😅

@gabe-l-hart
Copy link
Collaborator Author

It looks like @ggerganov added CUMSUM support for metal in #17305, so for metal I think GGML_OP_TRI is the only outstanding piece from this PR.

@gabe-l-hart
Copy link
Collaborator Author

@pwilkin I also see ggml_fill_inplace used in qwen3next and I don't see GGML_OP_FILL in any backends besides cpu and vulkan. Should we get that implemented for metal as well? It seems like it's only used at the top of the model build (here), so maybe they're closer to input tensors and don't need to be computed on the device?

The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>
This was added in the original draft, but later removed. With this, the
kernel now passes tests.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>
…n kernel

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart gabe-l-hart changed the title ggml: CUMSUM and TRI (CPU, Metal, CUDA) metal: TRI Dec 3, 2025
@gabe-l-hart
Copy link
Collaborator Author

@ggerganov @JohannesGaessler I've updated this to only do TRI for metal since CUDA is being handled in #17584 and CUMSUM was already added in #17305. I did the simplest optimization of removing the conditional from the kernel and dispatching it using a template, but I'm not sure who the deep metal expert is that might be able to help optimize further. Is there a correct reviewer for metal PRs?

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Co-authored-by: Georgi Gerganov <[email protected]>
@gabe-l-hart
Copy link
Collaborator Author

It looks like metal is also missing FILL, SOFTPLUS, and EXPM1 to have full support based on #17063. I'll try to add them here as well.

@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Dec 3, 2025

@pwilkin I see that in #17063 FILL is added but NOT as unary. Was that done for efficiency?

Never mind, I figured it out!

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart gabe-l-hart changed the title metal: TRI metal: TRI, FILL, EXPM1, SOFTPLUS Dec 3, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to adapt to the changes after #17739

Comment on lines +2023 to +2024
ushort sgitg[[simdgroup_index_in_threadgroup]],
ushort tiisg[[thread_index_in_simdgroup]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are unused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants