Skip to content

Conversation

gabe-l-hart
Copy link
Collaborator

Description

This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of CUMSUM and TRI to Metal and CUDA. It also extends type support to F16 and BF16.

The goal of this PR is to establish these two ops in the interest of both the DELTA_NET op for Qwen3-Next and the chunked implementation of the State Space Duality form of SSM_SCAN for faster prefill.

I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.

ggerganov and others added 29 commits October 9, 2025 19:40
* gg/metal-mul-mat-fixes:
metal : fix mul-mm condition + fix mul-mv permuted kernels
Cherry-picked and edited from 7ec2df6

The original commit contained the DELTA_NET op as well which I've removed
in this cherry-picked version.

Co-Authored-By: Piotr Wilkin <[email protected]>

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
This should be using simd operations for better parallelism, but that will
come next.

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: (32 commits)
metal : FA support F32 K and V and head size = 32 (ggml-org#16531)
graph : support cacheless embeddings with FA and iSWA (ggml-org#16528)
opencl: fix build targeting CL 2 (ggml-org#16554)
CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)
ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520)
CANN: fix CPU memory leak in CANN backend (ggml-org#16549)
fix: add remark plugin to render raw HTML as literal text (ggml-org#16505)
metal: add support for opt_step_sgd (ggml-org#16539)
ggml : fix scalar path for computing norm (ggml-org#16558)
CANN: Update several operators to support FP16 data format (ggml-org#16251)
metal : add opt_step_adamw and op_sum (ggml-org#16529)
webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506)
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521)
ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532)
common : handle unicode during partial json parsing (ggml-org#16526)
common : update presets (ggml-org#16504)
ggml : Fix FP16 ELU positive branch (ggml-org#16519)
hparams : add check for layer index in is_recurrent (ggml-org#16511)
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518)
CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)
...
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2Perf

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master:
Add server-driven parameter defaults and syncing (ggml-org#16515)
metal: optimise `GGML_OP_SUM` (ggml-org#16559)
server : fix img token logs (ggml-org#16595)
llama-quant: add support for mmproj (ggml-org#16592)
CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585)
server : fix mtmd checkpoints (ggml-org#16591)
metal : avoid using Metal's gpuAddress property (ggml-org#16576)
vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203)
CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577)
vulkan: Support FA with K/V in F32 (ggml-org#16543)
vulkan: Improve build time for MSVC (ggml-org#16545)
CUDA: enable FA for FP32 KV cache (ggml-org#16546)
CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557)
CUDA: add fp kernel for larger batch size MoE (ggml-org#16512)
cuda : remove legacy copy-op pointer indirection code (ggml-org#16485)
server : dynamic token limit for prompt cache (ggml-org#16560)
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart gabe-l-hart requested a review from slaren as a code owner October 16, 2025 20:38
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 16, 2025
@gabe-l-hart gabe-l-hart mentioned this pull request Oct 16, 2025
@gabe-l-hart
Copy link
Collaborator Author

Yikes, looks like some alternate platforms that will need to be handled. I'll dig through some of these failures

@JohannesGaessler
Copy link
Collaborator

My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR.

@pwilkin
Copy link
Collaborator

pwilkin commented Oct 17, 2025

My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR.

I think @gabe-l-hart moved the TRI and CUMSUM CPU implementation to this PR as well, so I guess it's a question of adding some testcases?

Those ops aren't very hard and as far as basic correctness goes I think I've verified them quite extensively during my fights with Qwen3 Next.

I've mimicked the basic logic for CUMSUM to be the same as SUM_ROWS, so it's basically always done on the first dimension.

@JohannesGaessler
Copy link
Collaborator

The thing is though that as of right now those ops aren't used anywhere on master. My opinion is that the new ops should be added in tandem with the model that needs them just in case it turns out that further changes are needed (even if it's unlikely).

@gabe-l-hart
Copy link
Collaborator Author

@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN, so this PR is really just a checkpoint for the primitive ops. My goal is better performance for Granite 4 which is why I went through to Metal and CUDA here.

@giuseppe
Copy link
Contributor

@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN,

@gabe-l-hart FYI, I've worked on the implementation for SSM_SCAN for the Vulkan backend: #16463 and that indeed helped with Granite 4

@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Oct 17, 2025

@giuseppe I saw that yesterday! That support will help a ton. The SSD formulation is an additional optimization on top of the recurrent formulation that should have big benefits for prefill with long context. It's mathematically equivalent, but much more efficient. The challenge I'm working through right now is how best to decompose the problem to minimize impact. One option is to write it as a sub-graph composed of smaller ops that is used when the sequence length is > 1, but this would have two problems:

  1. It would break the ability to reuse the graph objects across generation steps

  2. It would make backends that don't support the primitive ops (CUMSUM and TRI) fall back to CPU (exactly what you're trying to fix in your PR for SSM_SCAN)

The alternative is to implement it inside a single backend's SSM_SCAN. This has the advantage of being self-contained so it can be done on a per-backend basis, but it has the inverse problem of requiring each backend to implement it separately in order to get the performance boost. It's also much harder to write as a single kernel since it involves allocating temporary tensors of different size than the input or output tensors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants