Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Sep 26, 2025

This PR adds mul_mat_id support for ncols_dst >= 16. It does this by packing ncols_dst tiles along with the experts

My tests on a RTX 3090 show that this is faster than the cuBLAS fallback for f16 till bs=64, and for f32 till bs=32

Performance for n=32, 64

Backend GGML op Op parameters TFLOPS master TFLOPS cuda_mmf_mmid Speedup
CUDA0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 0.40 1.34 3.35
CUDA0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 0.70 1.36 1.93
CUDA0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 0.57 0.66 1.17
CUDA0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 0.89 0.87 0.98

This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.

My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 26, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be okay, let me think for a bit about whether there is something that could be optimized.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kernel in particular would probably benefit from the use of fastdiv but I don't know how I would improve upon the overall structure without a more general refactor that optimizes large batch sizes over smaller ones.

@JohannesGaessler JohannesGaessler merged commit c0bfc57 into ggml-org:master Sep 27, 2025
61 of 67 checks passed
@am17an am17an deleted the cuda_mmf_mmid branch September 28, 2025 05:26
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
…gml-org#16277)

* CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32

This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.

My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32

* Review: refactor if statement
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025
…gml-org#16277)

* CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32

This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.

My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32

* Review: refactor if statement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants