CUDA: fuse rope + set_rows #16884

am17an · 2025-10-31T05:20:51Z

Based on #16769.

On a 4090:

Model	Test	t/s master	t/s cuda-rope-fusion	Speedup
llama 8B Q4_K_M	tg32	134.90	136.07	1.01
llama 8B Q4_K_M	tg64	131.41	132.84	1.01
llama 8B Q4_K_M	tg128	130.54	131.87	1.01
qwen3moe 30B.A3B Q4_0	tg32	167.18	168.23	1.01
qwen3moe 30B.A3B Q4_0	tg64	161.00	161.90	1.01
qwen3moe 30B.A3B Q4_0	tg128	158.84	159.83	1.01

src/llama-graph.cpp

ggml/src/ggml-cuda/ggml-cuda.cu

ORippler

While the fusion itself is quite simple, I would still recommend to add a test to test-backend-ops for it nonetheless

ggml/src/ggml-cuda/rope.cu

am17an · 2025-10-31T13:44:04Z

While the fusion itself is quite simple, I would still recommend to add a test to test-backend-ops for it nonetheless

There is already a test added in the vulkan PR #16769

JohannesGaessler · 2025-11-02T09:26:55Z

ggml/src/ggml-cuda/rope.cu

+        dst[idst + 0] = D(x[ix + 0]);
+        dst[idst + 1] = D(x[ix + 1]);


I would suggest you use ggml_cuda_cast defined in convert.cuh. Otherwise there will potentially be issues with FP16 <-> BF16 conversions.

JohannesGaessler · 2025-11-02T09:30:55Z

ggml/src/ggml-cuda/rope.cu

+    dst[idst + 0] = D(x0 * cos_theta - x1 * sin_theta);
+    dst[idst + 1] = D(x0 * sin_theta + x1 * cos_theta);


When you're already working on optimizing RoPE: I think the memory access pattern here is suboptimal because there are gaps between each thread and I don't know whether the compiler is smart enough to combine the first and second write into a single one. I would suggest grouping the values as float2/half2 and either casting dst to that type or using ggml_cuda_memcpy_1.

am17an requested review from CISC and slaren as code owners October 31, 2025 05:20

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 31, 2025

am17an commented Oct 31, 2025

View reviewed changes

src/llama-graph.cpp Outdated Show resolved Hide resolved

am17an requested a review from JohannesGaessler October 31, 2025 08:47

am17an commented Oct 31, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

am17an added 3 commits October 31, 2025 20:07

CUDA: add fused rope

ea859a2

move k forward_expand up

607f73b

create helper function instead of re-using params

dc814b8

am17an force-pushed the cuda-add-rope-fusion branch from 406c867 to dc814b8 Compare October 31, 2025 12:21

ORippler reviewed Oct 31, 2025

View reviewed changes

ggml/src/ggml-cuda/rope.cu Outdated Show resolved Hide resolved

DajanaV mentioned this pull request Oct 31, 2025

UPSTREAM PR #16884: CUDA: fuse rope + set_rows auroralabs-loci/llama.cpp#21

Open

make assert statement more in line with comment

89faa24

JohannesGaessler reviewed Nov 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fuse rope + set_rows #16884

CUDA: fuse rope + set_rows #16884

am17an commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

ORippler left a comment

Uh oh!

Uh oh!

am17an commented Oct 31, 2025

Uh oh!

JohannesGaessler Nov 2, 2025 •

edited

Loading

Uh oh!

JohannesGaessler Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		dst[idst + 0] = D(x0 * cos_theta - x1 * sin_theta);
		dst[idst + 1] = D(x0 * sin_theta + x1 * cos_theta);

CUDA: fuse rope + set_rows #16884

Are you sure you want to change the base?

CUDA: fuse rope + set_rows #16884

Conversation

am17an commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

am17an commented Oct 31, 2025

Uh oh!

JohannesGaessler Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JohannesGaessler Nov 2, 2025 •

edited

Loading