[Feature Request] Better performance for T.reduce

### Required prerequisites

- [x] I have searched the [Issue Tracker](https://github.com/tile-ai/tilelang/issues) that this hasn't already been reported. (comment there if it has.)

### Motivation

Currently, the implementation for T.reduce is far from satisfactory. Compared to CCCL implementation, there are the following limitations. Let me explain further in the `solution` section.


### Solution

Let's see how CCCL solves this:
1. For threads ==32, use `redux.sync` instruction. A specialized instruction for sm>=80 arch.
2. For threads > 32:
    a. Do per warp reduction first. using  `redux.sync`.
    b. Each leader thread per warp writes to temporary shared mem. Then do sync.
    c. Read all (at most 31, do unroll to jump itself) from shared mem

The current solution suffers from:
1.  Low speed, as too much block level sync, the above solution only sync once at most. Also fail to use the latest inst.
2.  More space. Should first do warp level reduction to save space.
3.  High constraint. The above solution can handle any number of threads as a multiple of 32. But the current impl can only handle the power of 2.

### Alternatives

Use cub/block/block_reduce.cuh reduction directly, but need to construct a Temp storage. You may also learn from PTX of this.
```C++
#include <cub/block/block_reduce.cuh>
#include <cuda/atomic>
#include <cuda/cmath>
#include <cstdio>

template <int block_size>
__global__ void reduce(int* data, int* global_sum) {
  using BlockReduce = cub::BlockReduce<int, block_size>;
  __shared__ typename BlockReduce::TempStorage temp_storage;

  int const index = threadIdx.x + blockIdx.x * blockDim.x;
  int sum = data[index];
  sum = BlockReduce(temp_storage).Sum(sum);
  global_sum[blockIdx.x] = sum;
}


int main() {

    int N = 1024;
    int* d_data;
    int* d_result;
    cudaMalloc(&d_data, N * sizeof(int));
    cudaMalloc(&d_result, N * sizeof(int));
    reduce<1024><<<1, 1024>>>(d_data, d_result);
    return 0;
}
```

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Better performance for T.reduce #1761

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Better performance for T.reduce #1761

Description

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions