-
Notifications
You must be signed in to change notification settings - Fork 8
Description
For input sizes about 2^27 or larger, the gpu__accumulate_previous_coupled_preblocks_ call starts to heavily dominate the runtime of the scan, according to CUDA.@profile on my system (tested with int32 and float32).
For example, using 2^27 elements,
- 3.37 ms are spent in the top-level
gpu__accumulate_block_ - 13.11 µs are spent in the recursed
gpu__accumulate_block_ - 11.16 ms are spent in the
gpu__accumulate_previous_coupled_preblocks_
Since the accumulate_previous_coupled_preblocks is essentially just a vectorized add, it should not be this slow. The problem only gets much worse for larger vectors, for example 2^30 elements, where the gpu__accumulate_previous_coupled_preblocks_ takes 580ms, 92% of the total accumulate time on my system.
In comparison, in a simple C++ cub reference implementation, AcceleratedKernels.jl keeps up with cub performance very well for smaller inputs, but then suddenly falls off a cliff for these larger sizes. The cub reference takes ~12ms for 2^30 elements, and an alpaka3 implementation of the same coupled lookback takes ~27ms for this size.