CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #15132

ORippler · 2025-08-06T18:04:33Z

Investigation of Gemma3n perf on NVGPUs identified the reduce_rows_f32 kernel as a major performance bottleneck. Profiling revealed the kernel to be severely latency-limited in the regime run by by Gemma3n (nrows ~10, ncols in [2048, 8192]).

This PR addresses this issue, hiding the latency by a combination of:

Manual loop unrolling, getting the compiler to request all unrolled datapoints at once, instead of fetching data sequentially (#pragma unroll did not do the trick unfortunately).
Increasing the number of threads processing a row, where 512 threads are used for the low-parallelization regime (i.e. processing only a single row). This gives the SM 16 full warps to cycle through, further pipelining data fetching.

Since perf regressions were identified in the high-parallelization regime (nrows >= 2x SM count), we use:

128 threads for medium-to-large columns, effectively letting each SM process a single row (a SM can execute 4 warps x 32 threads=128 threads concurrently).
As perf regression were still observed for small columns (< 1024 cols = 1 unrollment loop of a threadblock with size 128 and 8 unrolls), thread count was reduced to 32 threads for small columns. An alternative to this would have been to template the number of unrolls based on the column size. However, this would lead to an increased binary size due to the required compilation of multiple kernels, and was thus not pursued further.

The high/low parallelization threshold was empirically determined:

GPU Model	Nrow SM Count Multiple, where 128 beats 512 threads
RTX 4000 SFF ADA	2.0x
RTX 6000 ADA	2.5x
RTX PRO 6000 Blackwell Max-Q	3.04x
RTX PRO 4500 Blackwell	3.15x

In total, up to ~25x perf improvement was observed on kernel-level.

Moreover, regression was not observed in any of the investigated combinations.
$speedup_comparison_fractional$

As a consequence of this general kernel optimization, Gemma3n achieves ~10% perf increase, going from 130 to 145 tok/s on a RTX PRO 6000 Blackwell-Max-Q with batch-size 1.

Naive:

  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma3n E2B Q8_0               |   4.45 GiB |     4.46 B | CUDA       |  99 |       1 |           pp100 |        147.27 ± 0.82 |
| gemma3n E2B Q8_0               |   4.45 GiB |     4.46 B | CUDA       |  99 |       1 |           tg100 |        130.75 ± 0.28 |

Optimized

  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma3n E2B Q8_0               |   4.45 GiB |     4.46 B | CUDA       |  99 |       1 |           pp100 |        168.41 ± 0.37 |
| gemma3n E2B Q8_0               |   4.45 GiB |     4.46 B | CUDA       |  99 |       1 |           tg100 |        146.68 ± 0.68 |

Side note: Similar tendencies were observed for rms_norm_f32, and we intend to optimize said kernel in a separate PR.

This increases iteration cycle speed by not having to recompile every kernel all the time

1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims

Break even point was the minimum of the following multiples. | GPU Model | Nrow SM Count Multiple | | ----------- | ----------- | | RTX 4000 SFF ADA | 2.0x | | RTX 6000 ADA | 2.5x | | RTX PRO 6000 Blackwell Max-Q | 3.04x | | RTX PRO 4500 Blackwell | 3.15x |

Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily

ORippler · 2025-08-07T07:53:09Z

Rebased on current master, resolving conflicts along the way. Reran E2E perf tests for gemma3n, and we continue to see perf gains. Nice to see some other optimizations for tg were made in master 😃

Naive:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
gemma3n E2B Q8_0	4.45 GiB	4.46 B	CUDA	99	1	pp100	146.89 ± 0.12
gemma3n E2B Q8_0	4.45 GiB	4.46 B	CUDA	99	1	tg100	145.86 ± 0.13

Optimized:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
gemma3n E2B Q8_0	4.45 GiB	4.46 B	CUDA	99	1	pp100	167.47 ± 0.29
gemma3n E2B Q8_0	4.45 GiB	4.46 B	CUDA	99	1	tg100	167.28 ± 0.35

See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though.

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/reduce_rows.cuh

ggml/src/ggml-cuda/mean.cu

@JohannesGaessler

Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect

JohannesGaessler · 2025-08-07T13:46:40Z

Thank you for answering my questions (even though I could have gotten the answers by reading the PR description more carefully). If you test using CUB for GGML_MEAN this PR would essentially be good to merge from my side.

IMbackK · 2025-08-07T15:04:04Z

Quick test shows this pr is also boradly performance positive on CDNA and performance neutral on RDNA2

Currently this branch is only executed for nrows==1

Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell

Tests should run with CUDA Graphs enabled per default on NVGPUs

ORippler · 2025-08-11T14:04:43Z

Thank you for answering my questions (even though I could have gotten the answers by reading the PR description more carefully). If you test using CUB for GGML_MEAN this PR would essentially be good to merge from my side.

@JohannesGaessler As requested, I put up a naive implementation that uses CUB for GGML_OP_MEAN. The implementation uses CUB to compute the device-wide sum, and another kernel to divide the sum by ncols (CUB does not offer a device-wide mean operation).

Benchmarks show that for small ncols, CUB-based implementation is slower than reduce_rows_f32, indicating it is worse at hiding latency of data access. Using >1 ThreadBlock for a single row however allows it to scale much better for high ncols. As a consequence, it outperforms reduce_rows_f32 there. Due to the CUB-based implementation having twice the kernel-launch-overhead of reduce_rows_f32 on CPU side, CUB-based implementation starts to outperform reduce_rows_f32 only at higher ncols when CUDA Graphs are disabled.

I reflected the above insights by branching the execution in ggml_cuda_op_mean accordingly. I did not implement nor benchmark a CUB-based implementation for nrows > 1, but expect it to be comparable to reduce_rows_f32 for most settings as we effectively parallelize rows across ThreadBlocks in current reduce_rows_f32 implementation. I also did not investigate writing a single kernel that uses more granular CUB primitives, as I presume we want to preserve a hipify-able kernel (see this comment).

ORippler · 2025-08-11T14:07:02Z

I personally feel the CUB-based implementation to be a bit beyond the original scope of this PR. However, since I am unable to create branches in the base repo and am unaware of how to represent stacked PRs in Github for PRs filed across forks, I left it in here.

JohannesGaessler

Thank you for the high-effort PR.

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/mean.cu

@JohannesGaessler

Suggested by @JohannesGaessler

See ggml-org#15132 (comment)

ORippler · 2025-08-13T07:37:32Z

@JohannesGaessler Could we get this merged whenever you have the time? Unfortunately I don't have write access 🙈

JohannesGaessler · 2025-08-13T08:05:23Z

Ah sorry, I wanted to merge this yesterday (after the CI finishes) and I forgot about it.

@JohannesGaessler

…vement on kernel-level and 10% perf increase for Gemma3n (llama/15132) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. | GPU Model | Nrow SM Count Multiple | | ----------- | ----------- | | RTX 4000 SFF ADA | 2.0x | | RTX 6000 ADA | 2.5x | | RTX PRO 6000 Blackwell Max-Q | 3.04x | | RTX PRO 4500 Blackwell | 3.15x | * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See ggml-org/llama.cpp#15132 (comment)

ggerganov · 2025-08-13T16:15:24Z

This change breaks the builds with GGML_CUDA_GRAPHS=OFF:

https://github.com/ggml-org/ci/blob/results/ggml/56/06bd129afb75630a6fe09552cdc37dc73d636d/ggml-4-x86-cuda-v100/stdall#L248

@JohannesGaessler

…vement on kernel-level and 10% perf increase for Gemma3n (llama/15132) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. | GPU Model | Nrow SM Count Multiple | | ----------- | ----------- | | RTX 4000 SFF ADA | 2.0x | | RTX 6000 ADA | 2.5x | | RTX PRO 6000 Blackwell Max-Q | 3.04x | | RTX PRO 4500 Blackwell | 3.15x | * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See ggml-org/llama.cpp#15132 (comment)

@JohannesGaessler

…vement on kernel-level and 10% perf increase for Gemma3n (llama/15132) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. | GPU Model | Nrow SM Count Multiple | | ----------- | ----------- | | RTX 4000 SFF ADA | 2.0x | | RTX 6000 ADA | 2.5x | | RTX PRO 6000 Blackwell Max-Q | 3.04x | | RTX PRO 4500 Blackwell | 3.15x | * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See ggml-org/llama.cpp#15132 (comment)

@JohannesGaessler

…vement on kernel-level and 10% perf increase for Gemma3n (llama/15132) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. | GPU Model | Nrow SM Count Multiple | | ----------- | ----------- | | RTX 4000 SFF ADA | 2.0x | | RTX 6000 ADA | 2.5x | | RTX PRO 6000 Blackwell Max-Q | 3.04x | | RTX PRO 4500 Blackwell | 3.15x | * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See ggml-org/llama.cpp#15132 (comment)

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 6, 2025

ORippler added 8 commits August 7, 2025 00:11

Factor out reduce_rows_f32 from common.cuh

3deb3b1

This increases iteration cycle speed by not having to recompile every kernel all the time

Hide memory-latency by loop unrolling in reduce_rows_f32

c270ffe

Add perf tests for reduce_rows_f32 kernel

9070af8

Ensure perf gains also for small ncols and large nrows

8e04242

Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily

Modify perf and unit-tests

8fc2c03

Apply auto-formatting by clang

9296d1f

ORippler force-pushed the osimons/optimize_reduce_rows_f32 branch from c6ed8cc to 9296d1f Compare August 7, 2025 07:46

Fix CI build failure

a6fe4dd

See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though.

JohannesGaessler reviewed Aug 7, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/reduce_rows.cuh Show resolved Hide resolved

ggml/src/ggml-cuda/reduce_rows.cuh Show resolved Hide resolved

ggml/src/ggml-cuda/mean.cu Show resolved Hide resolved

Remove sm_count property from ggml_backend_cuda_context

4a1c5bc

Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect

ORippler added 3 commits August 11, 2025 12:20

Add CUB-based implementation for GGML_OP_MEAN

7c7413e

Currently this branch is only executed for nrows==1

Add heuristics to execute CUB branch only when it brings perf

48cf9e4

Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell

Add unit-test for CUB-based mean

e8373bf

Tests should run with CUDA Graphs enabled per default on NVGPUs

ORippler requested a review from JohannesGaessler August 11, 2025 14:07

JohannesGaessler approved these changes Aug 11, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mean.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mean.cu Show resolved Hide resolved

ORippler added 2 commits August 12, 2025 10:04

Rename USE_CUB to GGML_CUDA_USE_CUB

0e9a5d8

Suggested by @JohannesGaessler

Unindent Preprocessor directives

d647028

See ggml-org#15132 (comment)

IMbackK mentioned this pull request Aug 12, 2025

Refactor: Unify indentation of preprocesor macros #15264

Closed

JohannesGaessler merged commit 6028bf7 into ggml-org:master Aug 13, 2025
47 checks passed

ORippler deleted the osimons/optimize_reduce_rows_f32 branch August 13, 2025 08:05

ggerganov mentioned this pull request Aug 13, 2025

sync : llama.cpp ggml-org/ggml#1327

Merged

CISC mentioned this pull request Aug 13, 2025

cuda : fix GGML_CUDA_GRAPHS=OFF #15300

Merged

CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #15132

CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #15132

Uh oh!

Conversation

ORippler commented Aug 6, 2025

Uh oh!

ORippler commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Aug 7, 2025

Uh oh!

IMbackK commented Aug 7, 2025

Uh oh!

ORippler commented Aug 11, 2025

Uh oh!

ORippler commented Aug 11, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ORippler commented Aug 13, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Aug 13, 2025

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #15132

CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #15132

ORippler commented Aug 7, 2025 •

edited

Loading