Skip to content

Is __shfl_down_sync used incorrectly in the reduction? #398

@yuyygit

Description

@yuyygit

in reduction: kernel reduce7 has code:

...
const unsigned int shmem_extent  = (blockSize / warpSize) > 0 ? (blockSize / warpSize) : 1;
const unsigned int ballot_result = __ballot_sync(mask, tid < shmem_extent);
    if (tid < shmem_extent) {
        mySum = sdata[tid];
        // Reduce final warp using shuffle or reduce_add if T==int & CUDA_ARCH ==
        // SM 8.0
        mySum = warpReduceSum<T>(ballot_result, mySum);
    }
...

if the kernel : grid(1,1,1) block(32,1,1), the shmem_extent will be 1, and ballot_result is 1
the __shfl_dow_sync run as code:

template <class T> __device__ __forceinline__ T warpReduceSum(unsigned int mask, T mySum)
{
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        mySum += __shfl_down_sync(mask, mySum, offset);
    }
    return mySum;
}

so only threadIdx.x ==0 do:
mySum += __shfl_down_sync(1, mySum, 16)
mySum += __shfl_down_sync(1, mySum, 8)
mySum += __shfl_down_sync(1, mySum, 4)
mySum += __shfl_down_sync(1, mySum, 2)
mySum += __shfl_down_sync(1, mySum, 1)

dst thread is 0,read data: mySum form thread(16/8/4/2/1) . In this situation, is it risky to execute __shfl_down_sync?

In the CUDA Programming Guide, it states that in this situation (when the source thread is inactive), the data obtained through __shfl_sync is undefined.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is [inactive](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture-notes), the retrieved value is undefined.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions