-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Description
in reduction: kernel reduce7 has code:
...
const unsigned int shmem_extent = (blockSize / warpSize) > 0 ? (blockSize / warpSize) : 1;
const unsigned int ballot_result = __ballot_sync(mask, tid < shmem_extent);
if (tid < shmem_extent) {
mySum = sdata[tid];
// Reduce final warp using shuffle or reduce_add if T==int & CUDA_ARCH ==
// SM 8.0
mySum = warpReduceSum<T>(ballot_result, mySum);
}
...
if the kernel : grid(1,1,1) block(32,1,1), the shmem_extent will be 1, and ballot_result is 1
the __shfl_dow_sync run as code:
template <class T> __device__ __forceinline__ T warpReduceSum(unsigned int mask, T mySum)
{
for (int offset = warpSize / 2; offset > 0; offset /= 2) {
mySum += __shfl_down_sync(mask, mySum, offset);
}
return mySum;
}
so only threadIdx.x ==0 do:
mySum += __shfl_down_sync(1, mySum, 16)
mySum += __shfl_down_sync(1, mySum, 8)
mySum += __shfl_down_sync(1, mySum, 4)
mySum += __shfl_down_sync(1, mySum, 2)
mySum += __shfl_down_sync(1, mySum, 1)
dst thread is 0,read data: mySum form thread(16/8/4/2/1) . In this situation, is it risky to execute __shfl_down_sync?
In the CUDA Programming Guide, it states that in this situation (when the source thread is inactive), the data obtained through __shfl_sync is undefined.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is [inactive](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture-notes), the retrieved value is undefined.
Metadata
Metadata
Assignees
Labels
No labels