-
Notifications
You must be signed in to change notification settings - Fork 452
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Required prerequisites
- I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
Motivation
Currently, the implementation for T.reduce is far from satisfactory. Compared to CCCL implementation, there are the following limitations. Let me explain further in the solution section.
Solution
Let's see how CCCL solves this:
- For threads ==32, use
redux.syncinstruction. A specialized instruction for sm>=80 arch. - For threads > 32:
a. Do per warp reduction first. usingredux.sync.
b. Each leader thread per warp writes to temporary shared mem. Then do sync.
c. Read all (at most 31, do unroll to jump itself) from shared mem
The current solution suffers from:
- Low speed, as too much block level sync, the above solution only sync once at most. Also fail to use the latest inst.
- More space. Should first do warp level reduction to save space.
- High constraint. The above solution can handle any number of threads as a multiple of 32. But the current impl can only handle the power of 2.
Alternatives
Use cub/block/block_reduce.cuh reduction directly, but need to construct a Temp storage. You may also learn from PTX of this.
#include <cub/block/block_reduce.cuh>
#include <cuda/atomic>
#include <cuda/cmath>
#include <cstdio>
template <int block_size>
__global__ void reduce(int* data, int* global_sum) {
using BlockReduce = cub::BlockReduce<int, block_size>;
__shared__ typename BlockReduce::TempStorage temp_storage;
int const index = threadIdx.x + blockIdx.x * blockDim.x;
int sum = data[index];
sum = BlockReduce(temp_storage).Sum(sum);
global_sum[blockIdx.x] = sum;
}
int main() {
int N = 1024;
int* d_data;
int* d_result;
cudaMalloc(&d_data, N * sizeof(int));
cudaMalloc(&d_result, N * sizeof(int));
reduce<1024><<<1, 1024>>>(d_data, d_result);
return 0;
}Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request