-
Notifications
You must be signed in to change notification settings - Fork 13.4k
ggml: CUMSUM and TRI (CPU, Metal, CUDA) #16623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This reverts commit 00f115f.
* gg/metal-mul-mat-fixes: metal : fix mul-mm condition + fix mul-mv permuted kernels
Cherry-picked and edited from 7ec2df6 The original commit contained the DELTA_NET op as well which I've removed in this cherry-picked version. Co-Authored-By: Piotr Wilkin <[email protected]> Signed-off-by: Gabe Goodhart <[email protected]>
…sors Branch: Mamba2Perf Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
This should be using simd operations for better parallelism, but that will come next. Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: (32 commits) metal : FA support F32 K and V and head size = 32 (ggml-org#16531) graph : support cacheless embeddings with FA and iSWA (ggml-org#16528) opencl: fix build targeting CL 2 (ggml-org#16554) CUDA: fix numerical issues in tile FA kernel (ggml-org#16540) ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520) CANN: fix CPU memory leak in CANN backend (ggml-org#16549) fix: add remark plugin to render raw HTML as literal text (ggml-org#16505) metal: add support for opt_step_sgd (ggml-org#16539) ggml : fix scalar path for computing norm (ggml-org#16558) CANN: Update several operators to support FP16 data format (ggml-org#16251) metal : add opt_step_adamw and op_sum (ggml-org#16529) webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506) [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521) ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532) common : handle unicode during partial json parsing (ggml-org#16526) common : update presets (ggml-org#16504) ggml : Fix FP16 ELU positive branch (ggml-org#16519) hparams : add check for layer index in is_recurrent (ggml-org#16511) ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518) CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492) ...
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2Perf Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
* origin/master: Add server-driven parameter defaults and syncing (ggml-org#16515) metal: optimise `GGML_OP_SUM` (ggml-org#16559) server : fix img token logs (ggml-org#16595) llama-quant: add support for mmproj (ggml-org#16592) CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585) server : fix mtmd checkpoints (ggml-org#16591) metal : avoid using Metal's gpuAddress property (ggml-org#16576) vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) vulkan: Support FA with K/V in F32 (ggml-org#16543) vulkan: Improve build time for MSVC (ggml-org#16545) CUDA: enable FA for FP32 KV cache (ggml-org#16546) CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) server : dynamic token limit for prompt cache (ggml-org#16560)
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <[email protected]>
Yikes, looks like some alternate platforms that will need to be handled. I'll dig through some of these failures |
My opinion is that we should assert that the CPU implementation is correct before we move towards reviewing and merging this PR. |
I think @gabe-l-hart moved the TRI and CUMSUM CPU implementation to this PR as well, so I guess it's a question of adding some testcases? Those ops aren't very hard and as far as basic correctness goes I think I've verified them quite extensively during my fights with Qwen3 Next. I've mimicked the basic logic for CUMSUM to be the same as SUM_ROWS, so it's basically always done on the first dimension. |
The thing is though that as of right now those ops aren't used anywhere on master. My opinion is that the new ops should be added in tandem with the model that needs them just in case it turns out that further changes are needed (even if it's unlikely). |
@JohannesGaessler That makes total sense. I'm continuing to work towards the SSD formulation for SSM_SCAN, so this PR is really just a checkpoint for the primitive ops. My goal is better performance for Granite 4 which is why I went through to Metal and CUDA here. |
@gabe-l-hart FYI, I've worked on the implementation for SSM_SCAN for the Vulkan backend: #16463 and that indeed helped with Granite 4 |
@giuseppe I saw that yesterday! That support will help a ton. The SSD formulation is an additional optimization on top of the recurrent formulation that should have big benefits for prefill with long context. It's mathematically equivalent, but much more efficient. The challenge I'm working through right now is how best to decompose the problem to minimize impact. One option is to write it as a sub-graph composed of smaller ops that is used when the sequence length is > 1, but this would have two problems:
The alternative is to implement it inside a single backend's SSM_SCAN. This has the advantage of being self-contained so it can be done on a per-backend basis, but it has the inverse problem of requiring each backend to implement it separately in order to get the performance boost. It's also much harder to write as a single kernel since it involves allocating temporary tensors of different size than the input or output tensors. |
Description
This PR builds on some of the work by @pwilkin in #16095 and extends the CPU implementations of
CUMSUM
andTRI
toMetal
andCUDA
. It also extends type support toF16
andBF16
.The goal of this PR is to establish these two ops in the interest of both the
DELTA_NET
op for Qwen3-Next and the chunked implementation of the State Space Duality form ofSSM_SCAN
for faster prefill.I'm putting this up for review now in case it helps with the Qwen3-Next work and to get feedback on the kernels. I'm quite novice at kernel development, so I suspect others may find significant optimizations for both Metal and CUDA.