CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n
#17276
| Job | Run time |
|---|---|
| 7m 12s | |
| 10m 12s | |
| 11m 52s | |
| 6m 54s | |
| 36m 10s |