@@ -611,6 +611,70 @@ code, while the host can query it during runtime via the device properties. See
611611the :ref: `HIP language extension for warpSize <warp_size >` for information on
612612how to write portable wave-aware code.
613613
614+ Lane masks bit-shift
615+ ================================================================================
616+
617+ A thread in a warp is also called a lane, and a lane mask is a bitmask where
618+ each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
619+ 0 if it's inactive. Bit-shift operations are typically used to create lane masks
620+ and on AMD GPUs the ``warpSize `` can differ between different architectures,
621+ that's why it's essential to use correct bitmask type, when porting code.
622+
623+ Example:
624+
625+ .. code-block :: cpp
626+
627+ // Get the thread's position in the warp
628+ unsigned int laneId = threadIdx.x % warpSize;
629+
630+ // Use lane ID for bit-shift
631+ val & ((1 << (threadIdx.x % warpSize) )-1 );
632+
633+ // Shift 32 bit integer with val variable
634+ WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
635+
636+ Lane masks are 32-bit integer types as this is the integer precision that C
637+ assigns to such constants by default. GCN/CDNA architectures have a warp size of
638+ 64, :code: `threadIdx.x % warpSize ` and :code: `val ` in the example may obtain
639+ values greater than 31. Consequently, shifting by such values would clear the
640+ 32-bit register to which the shift operation is applied. For AMD
641+ architectures, a straightforward fix could look as follows:
642+
643+ .. code-block :: cpp
644+
645+ // Get the thread's position in the warp
646+ unsigned int laneId = threadIdx.x % warpSize;
647+
648+ // Use lane ID for bit-shift
649+ val & ((1ull << (threadIdx.x % warpSize) )-1 );
650+
651+ // Shift 64 bit integer with val variable
652+ WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
653+
654+ For portability reasons, it is better to introduce appropriately
655+ typed placeholders as shown below:
656+
657+ .. code-block :: cpp
658+
659+ #if defined(__GFX8__) || defined(__GFX9__)
660+ typedef uint64_t lane_mask_t;
661+ #else
662+ typedef uint32_t lane_mask_t;
663+ #endif
664+
665+ The use of :code: `lane_mask_t ` with the previous example:
666+
667+ .. code-block :: cpp
668+
669+ // Get the thread's position in the warp
670+ unsigned int laneId = threadIdx.x % warpSize;
671+
672+ // Use lane ID for bit-shift
673+ val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
674+
675+ // Shift 32 or 64 bit integer with val variable
676+ WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
677+
614678 Porting from CUDA __launch_bounds__
615679================================================================================
616680
0 commit comments