Add best practice for warpSize handling

neon60 · neon60 · commit 6c47c692049b · 2025-06-13T12:42:42.000+02:00
diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst
@@ -411,18 +411,18 @@ warpSize
 ================================================================================
 
 The ``warpSize`` constant contains the number of threads per warp for the given
-target device. It can differ between different architectures, see the
-:doc:`hardware features <../reference/hardware_features>` for more
-information.
+target device. On AMD hardware, this is referred to as ``wavefront size``, which 
+may vary depending on the architecture. For more details, see the
+:doc:`hardware features <../reference/hardware_features>`.
 
 Since ``warpSize`` can differ between devices, it can not be assumed to be a
 compile-time constant on the host. It has to be queried using
 :cpp:func:`hipDeviceGetAttribute` or :cpp:func:`hipDeviceGetProperties`, e.g.:
 
 .. code-block:: cpp
 
-    int val;
-    hipDeviceGetAttribute(&val, hipDeviceAttributeWarpSize, deviceId);
+    int warpSizeHost;
+    hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId);
 
 .. note::
 
@@ -433,6 +433,130 @@ compile-time constant on the host. It has to be queried using
   of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of
   the compute resources.
 
+The ``warpSize`` parameter will no longer be a compile-time constant in a future
+release of ROCm, however it will be still early folded by the compiler, which
+means it can be used for loop bounds and supports loop unrolling similarly to
+compile-time warp size.
+
+If the compile time warp size is still required, for example to select the correct
+mask type or code path at compile time, the recommended approach is to determine
+the warp size of the GPU on host side and setup the kernel accordingly, as shown
+in the following block reduce example.
+
+The ``block_reduce`` kernel has a template parameter for warp size and performs
+a reduction operation in two main phases:
+
+- Shared memory reduction: Reduction is performed iteratively, halving the
+  number of active threads each step until only a warp remains
+  (32 or 64 threads, depending on the device).
+
+- Warp-level reduction: Once the shared memory reduction completes, the
+  remaining threads use warp-level shuffling to sum the remaining values. This
+  is done efficiently with the ``__shfl_down`` intrinsic, which allows threads within
+  the warp to exchange values without explicit synchronization.
+
+.. tab-set::
+
+    .. tab-item:: WarpSize Template Parameter
+       :sync: template-warpsize
+
+       .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+          :start-after: // [Sphinx template warp size block reduction kernel start]
+          :end-before: // [Sphinx template warp size block reduction kernel end]
+          :language: cpp
+
+
+    .. tab-item:: HIP warpSize
+       :sync: hip-warpsize
+
+       .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+          :start-after: // [Sphinx HIP warp size block reduction kernel start]
+          :end-before: // [Sphinx HIP warp size block reduction kernel end]
+          :language: cpp
+
+The host code with the main function:
+
+- Retrieves the warp size of the GPU (``warpSizeHost``) to determine the optimal
+  kernel configuration.
+
+- Allocates device memory (``d_data`` for input, ``d_results`` for block-wise
+  output) and initializes the input vector to 1.
+
+- Generates the mask variables for every warp and copies them to the device.
+
+  .. tab-set::
+
+      .. tab-item:: Compile-time WarpSize
+         :sync: template-warpsize
+
+         .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+            :start-after: // [Sphinx template warp size mask generation start]
+            :end-before: // [Sphinx template warp size mask generation end]
+            :language: cpp
+
+
+      .. tab-item:: HIP warpSize
+         :sync: hip-warpsize
+
+         .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+            :start-after:  // [Sphinx HIP warp size mask generation start]
+            :end-before:  // [Sphinx HIP warp size mask generation end]
+            :language: cpp
+
+- Selects the appropriate kernel specialization based on the warp
+  size (either 32 or 64) and launches the kernel.
+
+  .. tab-set::
+
+      .. tab-item:: Compile-time WarpSize
+         :sync: template-warpsize
+
+         .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
+            :start-after: // [Sphinx template warp size select kernel start]
+            :end-before: // [Sphinx template warp size select kernel end]
+            :language: cpp
+
+
+      .. tab-item:: HIP warpSize
+         :sync: hip-warpsize
+
+         .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
+            :start-after: // [Sphinx HIP warp size select kernel start]
+            :end-before: // [Sphinx HIP warp size select kernel end]
+            :language: cpp
+
+- Synchronizes the device and copies the results back to the host.
+
+- Checks that each block's sum is equal with the expected mask bit count, 
+  verifying the reduction's correctness.
+
+- Frees the device memory to prevent memory leaks.
+
+.. note::
+
+  The ``warpSize`` runtime example code is also provided for comparison purposes
+  and the full example codes are located in the `tools folder <https://github.com/ROCm/hip/tree/docs/develop/docs/tools/example_codes>`_.
+
+  The variable ``warpSize`` can be used for loop bounds and supports 
+  loop unrolling similarly to the template parameter ``WarpSize``.
+
+For users who still require a compile-time constant warp size as a macro on the
+device side, it can be defined manually based on the target device architecture,
+as shown in the following example.
+
+.. code-block:: cpp
+
+  #if defined(__GFX8__) || defined(__GFX9__)
+    #define WarpSize 64
+  #else
+    #define WarpSize 32
+  #endif
+
+.. note:: 
+
+  ``mwavefrontsize64`` compiler option is not supported by HIP runtime, that's
+  why the architecture based compile time selector is an acceptable approach.
+
 ********************************************************************************
 Vector types
 ********************************************************************************
@@ -855,7 +979,7 @@ The different shuffle functions behave as following:
   of range, the thread returns its own ``var``.
 
 ``__shfl_down``
-  The thread reads ``var`` from lane ``laneIdx - delta``, thereby "shuffling"
+  The thread reads ``var`` from lane ``laneIdx + delta``, thereby "shuffling"
   the values of the lanes of the warp "down". If the resulting source lane is
   out of range, the thread returns its own ``var``.
 
diff --git a/docs/tools/example_codes/template_warp_size_reduction.hip b/docs/tools/example_codes/template_warp_size_reduction.hip
@@ -0,0 +1,207 @@
+// MIT License
+//
+// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+
+#include <hip/hip_runtime.h>
+#include <type_traits>
+#include <iostream>
+#include <vector>
+#include <random>
+
+#define HIP_CHECK(expression)                \
+{                                            \
+    const hipError_t status = expression;    \
+    if(status != hipSuccess){                \
+            std::cerr << "HIP error "        \
+                << status << ": "            \
+                << hipGetErrorString(status) \
+                << " at " << __FILE__ << ":" \
+                << __LINE__ << std::endl;    \
+    }                                        \
+}
+
+// [Sphinx template warp size block reduction kernel start]
+template<uint32_t WarpSize>
+using lane_mask_t = typename std::conditional<WarpSize == 32, uint32_t, uint64_t>::type;
+
+template<uint32_t WarpSize>
+__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size) {
+  extern __shared__ int shared[];
+
+  // Read of input with bounds check
+  auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
+  {
+    lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
+    return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
+  };
+
+  const uint32_t tid = threadIdx.x,
+                 lid = threadIdx.x % WarpSize,
+                 wid = threadIdx.x / WarpSize,
+                 bid = blockIdx.x,
+                 gid = bid * blockDim.x + tid;
+
+  // Read input buffer to shared
+  shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
+  __syncthreads();
+
+  // Shared reduction
+  for (uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
+  {
+    if (tid < i)
+      shared[tid] = shared[tid] + shared[tid + i];
+    __syncthreads();
+  }
+
+  // Use local variable in warp reduction  
+  int result =  shared[tid];
+  __syncthreads();
+
+  // This loop would be unrolled the same with the runtime warpSize.
+  #pragma unroll
+  for (uint32_t i = WarpSize/2; i >= 1; i /= 2) {
+    result = result + __shfl_down(result, i);
+  }
+
+  // Write result to output buffer
+  if (tid == 0)
+    output[bid] = result;
+};
+// [Sphinx template warp size block reduction kernel end]
+
+// [Sphinx template warp size mask generation start]
+template<uint32_t WarpSize>
+void generate_and_copy_mask(
+  void *d_mask, 
+  std::vector<int>& vectorExpected, 
+  int numOfBlocks,
+  int numberOfWarp,
+  int mask_size,
+  int mask_element_size) {
+  
+  std::random_device rd;
+  std::mt19937_64 eng(rd());
+
+  // Host side mask vector
+  std::vector<lane_mask_t<WarpSize>> mask(mask_size);
+  // Define uniform unsigned int distribution
+  std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
+  // Fill up the mask 
+  for(int i=0; i < numOfBlocks; i++) {
+    int count = 0;
+    for(int j=0; j < numberOfWarp; j++) {
+      int mask_index = i * numberOfWarp + j;
+      mask[mask_index] = distr(eng);
+      if constexpr(WarpSize == 32)
+        count += __builtin_popcount(mask[mask_index]);
+      else
+        count += __builtin_popcountll(mask[mask_index]);
+    }
+    vectorExpected[i]= count;
+  }
+
+  // Copy the mask array
+  HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
+}
+// [Sphinx template warp size mask generation end]
+
+int main() {
+
+  int deviceId = 0;
+  int warpSizeHost;
+  HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
+  std::cout << "Warp size: " << warpSizeHost << std::endl;
+
+  constexpr int numOfBlocks = 16;
+  constexpr int threadsPerBlock = 1024;
+  const int numberOfWarp = threadsPerBlock / warpSizeHost;
+  const int mask_element_size = warpSizeHost == 32 ? sizeof(uint32_t) : sizeof(uint64_t);
+  const int mask_size = numOfBlocks * numberOfWarp;
+  constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
+
+  int *d_data, *d_results;
+  void *d_mask;
+  int initValue = 1;
+  std::vector<int> vectorInput(arraySize, initValue);
+  std::vector<int> vectorOutput(numOfBlocks);
+  std::vector<int> vectorExpected(numOfBlocks);
+  // Allocate device memory
+  HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
+  HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
+  HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
+  // Host to Device copy of the input array
+  HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
+  
+  // [Sphinx template warp size select kernel start]
+  // Fill up the mask variable, copy to device and select the right kernel.
+  if(warpSizeHost == 32) {
+    // Generate and copy mask arrays
+    generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
+
+    // Start the kernel
+    block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+      d_data,
+      static_cast<uint32_t*>(d_mask),
+      d_results,
+      arraySize);
+  } else if(warpSizeHost == 64) {
+    // Generate and copy mask arrays
+    generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
+
+    // Start the kernel
+    block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
+      d_data,
+      static_cast<uint64_t*>(d_mask),
+      d_results,
+      arraySize);
+  } else {
+    std::cerr << "Unsupported warp size." << std::endl;
+    return 0;
+  }
+  // [Sphinx template warp size select kernel end]
+
+  // Check the kernel launch
+  HIP_CHECK(hipGetLastError());
+  // Check for kernel execution error
+  HIP_CHECK(hipDeviceSynchronize());
+  // Device to Host copy of the result
+  HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
+
+  // Verify results
+  bool passed = true;
+  for(size_t i = 0; i < numOfBlocks; ++i) {
+    if(vectorOutput[i] != vectorExpected[i]) {
+      passed = false;
+      std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
+    }
+  }
+  if(passed){
+    std::cout << "Execution completed successfully." << std::endl;
+  }else{
+    std::cerr << "Execution failed." << std::endl;
+  }
+
+  // Cleanup
+  HIP_CHECK(hipFree(d_data));
+  HIP_CHECK(hipFree(d_mask));
+  HIP_CHECK(hipFree(d_results));
+  return 0;
+}
diff --git a/docs/tools/example_codes/warp_size_reduction.hip b/docs/tools/example_codes/warp_size_reduction.hip