CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n
#25856
| Job | Run time |
|---|---|
| 11m 57s | |
| 4m 46s | |
| 13m 5s | |
| 1m 42s | |
| 1m 13s | |
| 2m 10s | |
| 2m 41s | |
| 3m 56s | |
| 16m 55s | |
| 10m 7s | |
| 2m 23s | |
| 1m 16s | |
| 2m 31s | |
| 3m 19s | |
| 4m 5s | |
| 2m 13s | |
| 11m 53s | |
| 1h 0m 42s | |
| 11m 55s | |
| 16m 36s | |
| 10m 37s | |
| 8m 21s | |
| 15m 43s | |
| 1m 19s | |
| 0s | |
| 7m 51s | |
| 10m 13s | |
| 18m 14s | |
| 6m 11s | |
| 9m 53s | |
| 3m 31s | |
| 1m 41s | |
| 1m 57s | |
| 10m 14s | |
| 45m 23s | |
| 26m 12s | |
| 4m 36s | |
| 3m 11s | |
| 4m 46s | |
| 8m 25s | |
| 3m 19s | |
| 6h 27m 2s |