CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n
#17276
server.yml
on: pull_request
server-windows
7m 12s
Matrix: server