Skip to content

CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #25856

CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n

CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #25856

Job Run time
11m 57s
4m 46s
13m 5s
1m 42s
1m 13s
2m 10s
2m 41s
3m 56s
16m 55s
10m 7s
2m 23s
1m 16s
2m 31s
3m 19s
4m 5s
2m 13s
11m 53s
1h 0m 42s
11m 55s
16m 36s
10m 37s
8m 21s
15m 43s
1m 19s
0s
7m 51s
10m 13s
18m 14s
6m 11s
9m 53s
3m 31s
1m 41s
1m 57s
10m 14s
45m 23s
26m 12s
4m 36s
3m 11s
4m 46s
8m 25s
3m 19s
6h 27m 2s