Skip to content

[Feature Request] Better performance for T.reduce #1761

@bucket-xv

Description

@bucket-xv

Required prerequisites

  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)

Motivation

Currently, the implementation for T.reduce is far from satisfactory. Compared to CCCL implementation, there are the following limitations. Let me explain further in the solution section.

Solution

Let's see how CCCL solves this:

  1. For threads ==32, use redux.sync instruction. A specialized instruction for sm>=80 arch.
  2. For threads > 32:
    a. Do per warp reduction first. using redux.sync.
    b. Each leader thread per warp writes to temporary shared mem. Then do sync.
    c. Read all (at most 31, do unroll to jump itself) from shared mem

The current solution suffers from:

  1. Low speed, as too much block level sync, the above solution only sync once at most. Also fail to use the latest inst.
  2. More space. Should first do warp level reduction to save space.
  3. High constraint. The above solution can handle any number of threads as a multiple of 32. But the current impl can only handle the power of 2.

Alternatives

Use cub/block/block_reduce.cuh reduction directly, but need to construct a Temp storage. You may also learn from PTX of this.

#include <cub/block/block_reduce.cuh>
#include <cuda/atomic>
#include <cuda/cmath>
#include <cstdio>

template <int block_size>
__global__ void reduce(int* data, int* global_sum) {
  using BlockReduce = cub::BlockReduce<int, block_size>;
  __shared__ typename BlockReduce::TempStorage temp_storage;

  int const index = threadIdx.x + blockIdx.x * blockDim.x;
  int sum = data[index];
  sum = BlockReduce(temp_storage).Sum(sum);
  global_sum[blockIdx.x] = sum;
}


int main() {

    int N = 1024;
    int* d_data;
    int* d_result;
    cudaMalloc(&d_data, N * sizeof(int));
    cudaMalloc(&d_result, N * sizeof(int));
    reduce<1024><<<1, 1024>>>(d_data, d_result);
    return 0;
}

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions