Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Oct 31, 2025

Based on #16769.

On a 4090:

Model Test t/s master t/s cuda-rope-fusion Speedup
llama 8B Q4_K_M tg32 134.90 136.07 1.01
llama 8B Q4_K_M tg64 131.41 132.84 1.01
llama 8B Q4_K_M tg128 130.54 131.87 1.01
qwen3moe 30B.A3B Q4_0 tg32 167.18 168.23 1.01
qwen3moe 30B.A3B Q4_0 tg64 161.00 161.90 1.01
qwen3moe 30B.A3B Q4_0 tg128 158.84 159.83 1.01

@am17an am17an requested review from CISC and slaren as code owners October 31, 2025 05:20
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 31, 2025
@am17an am17an force-pushed the cuda-add-rope-fusion branch from 406c867 to dc814b8 Compare October 31, 2025 12:21
Copy link
Contributor

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the fusion itself is quite simple, I would still recommend to add a test to test-backend-ops for it nonetheless

@am17an
Copy link
Collaborator Author

am17an commented Oct 31, 2025

While the fusion itself is quite simple, I would still recommend to add a test to test-backend-ops for it nonetheless

There is already a test added in the vulkan PR #16769

Comment on lines +81 to +82
dst[idst + 0] = D(x[ix + 0]);
dst[idst + 1] = D(x[ix + 1]);
Copy link
Collaborator

@JohannesGaessler JohannesGaessler Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest you use ggml_cuda_cast defined in convert.cuh. Otherwise there will potentially be issues with FP16 <-> BF16 conversions.

Comment on lines +99 to +100
dst[idst + 0] = D(x0 * cos_theta - x1 * sin_theta);
dst[idst + 1] = D(x0 * sin_theta + x1 * cos_theta);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you're already working on optimizing RoPE: I think the memory access pattern here is suboptimal because there are gaps between each thread and I don't know whether the compiler is smart enough to combine the first and second write into a single one. I would suggest grouping the values as float2/half2 and either casting dst to that type or using ggml_cuda_memcpy_1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants