Skip to content

Accumulate becomes slow for very large input sizes #75

@AntonReinhard

Description

@AntonReinhard

For input sizes about 2^27 or larger, the gpu__accumulate_previous_coupled_preblocks_ call starts to heavily dominate the runtime of the scan, according to CUDA.@profile on my system (tested with int32 and float32).

For example, using 2^27 elements,

  • 3.37 ms are spent in the top-level gpu__accumulate_block_
  • 13.11 µs are spent in the recursed gpu__accumulate_block_
  • 11.16 ms are spent in the gpu__accumulate_previous_coupled_preblocks_

Since the accumulate_previous_coupled_preblocks is essentially just a vectorized add, it should not be this slow. The problem only gets much worse for larger vectors, for example 2^30 elements, where the gpu__accumulate_previous_coupled_preblocks_ takes 580ms, 92% of the total accumulate time on my system.

In comparison, in a simple C++ cub reference implementation, AcceleratedKernels.jl keeps up with cub performance very well for smaller inputs, but then suddenly falls off a cliff for these larger sizes. The cub reference takes ~12ms for 2^30 elements, and an alpaka3 implementation of the same coupled lookback takes ~27ms for this size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions