Document hip extra arg behavior

Ewan Crawford · Ewan Crawford · commit e578228aa259 · 2024-12-02T13:22:36.000Z
diff --git a/scripts/core/CUDA.rst b/scripts/core/CUDA.rst
@@ -157,25 +157,28 @@ space pointer arguments, which are set by the user with
 ``urKernelSetArgLocal`` with the number of bytes of local memory to allocate
 and make available from the pointer argument.
 
-The CUDA adapter implements local memory arguments to a kernel as a single
-``__shared__`` memory allocation, with each local address space pointer argument
-to the kernel converted to a byte offset parameter to the single memory
-allocation. Therefore for ``N`` local arguments that need set on a kernel with
-``urKernelSetArgLocal``, the total aligned size is calculated for the single
+The CUDA adapter implements local memory in a kernel as a single ``__shared__``
+memory allocation, and each individual local memory argument is a ``u32`` byte
+offset kernel parameter which is combined inside the kernel with the
+``__shared__`` memory allocation. Therefore for ``N`` local arguments that need
+set on a kernel with ``urKernelSetArgLocal``, the total aligned size across the
+``N`` calls to ``urKernelSetArgLocal`` is calculated for the ``__shared__``
 memory allocation by the CUDA adapter and passed as the ``sharedMemBytes``
 argument to ``cuLaunchKernel`` (or variants like ``cuLaunchCooperativeKernel``
-or ``cudaGraphAddKernelNode``).
+or ``cuGraphAddKernelNode``).
 
-For each kernel local memory parameter, aligned offsets into the single memory location
-are calculated and passed at runtime via ``kernelParams`` when launching the kernel (or
-adding as a graph node). When a user calls ``urKernelSetArgLocal`` with an
-argument index that has already been set the CUDA adapter recalculates the size of the
-single memory allocation and offsets of any local memory arguments at following indices.
+For each kernel ``u32`` local memory offset parameter, aligned offsets into the
+single memory location are calculated and passed at runtime by the adapter via
+``kernelParams`` when launching the kernel (or adding the kernel as a graph
+node). When a user calls ``urKernelSetArgLocal`` with an argument index that
+has already been set on the kernel, the adapter recalculates the size of the
+``__shared__`` memory allocation and offset for the index, as well as the
+offsets of any local memory arguments at following indices.
 
 .. warning::
 
   The CUDA UR adapter implementation of local memory assumes the kernel created
-  has been created by DPC++, instumenting the device code so that local memory
+  has been created by DPC++, instrumenting the device code so that local memory
   arguments are offsets rather than pointers.
 
 Other Notes
diff --git a/scripts/core/HIP.rst b/scripts/core/HIP.rst
@@ -94,11 +94,42 @@ the user does not wish to use the global offset.
 Local Memory Arguments
 ----------------------
 
-.. todo::
-   Copy and update CUDA doc
-
-.. todo::
-   Document what extra args needed on HIP arg with local accessors
+In UR local memory is a region of memory shared by all the work-items in
+a work-group. A kernel function signature can include local memory address
+space pointer arguments, which are set by the user with
+``urKernelSetArgLocal`` with the number of bytes of local memory to allocate
+and make available from the pointer argument.
+
+The HIP adapter implements local memory in a kernel as a single ``__shared__``
+memory allocation, and each individual local memory argument is a ``u32`` byte
+offset kernel parameter which is combined inside the kernel with the
+``__shared__`` memory allocation. Therefore for ``N`` local arguments that need
+set on a kernel with ``urKernelSetArgLocal``, the total aligned size across the
+``N`` calls to ``urKernelSetArgLocal`` is calculated for the ``__shared__``
+memory allocation by the HIP adapter and passed as the ``sharedMemBytes``
+argument to ``hipModuleLaunchKernel`` or ``hipGraphAddKernelNode``.
+
+For each kernel ``u32`` local memory offset parameter, aligned offsets into the
+single memory location are calculated and passed at runtime by the adapter via
+``kernelParams`` when launching the kernel (or adding the kernel as a graph
+node). When a user calls ``urKernelSetArgLocal`` with an argument index that
+has already been set on the kernel, the adapter recalculates the size of the
+``__shared__`` memory allocation and offset for the index, as well as the
+offsets of any local memory arguments at following indices.
+
+.. warning::
+
+  The HIP UR adapter implementation of local memory assumes the kernel created
+  has been created by DPC++, instrumenting the device code so that local memory
+  arguments are offsets rather than pointers.
+
+
+HIP kernels that are generated for DPC++ kernels with SYCL local accessors
+contain extra value arguments on top of the local memory argument for the
+local accessor. For each ``urKernelSetArgLocal`` argument, a user needs
+to make 3 calls to ``urKernelSetArgValue`` with each of the next 3 consecutive
+argument indexes. This represents a 3 dimensional offset into the local
+accessor.
 
 Other Notes
 ===========
diff --git a/source/adapters/cuda/kernel.hpp b/source/adapters/cuda/kernel.hpp
@@ -158,8 +158,7 @@ struct ur_kernel_handle_t_ {
 
     void addLocalArg(size_t Index, size_t Size) {
       // Get the aligned argument size and offset into local data
-      size_t AlignedLocalSize, AlignedLocalOffset;
-      std::tie(AlignedLocalSize, AlignedLocalOffset) =
+      auto [AlignedLocalSize, AlignedLocalOffset] =
           calcAlignedLocalArgument(Index, Size);
 
       // Store argument details
@@ -178,8 +177,7 @@ struct ur_kernel_handle_t_ {
         }
 
         // Recalculate alignment
-        size_t SuccAlignedLocalSize, SuccAlignedLocalOffset;
-        std::tie(SuccAlignedLocalSize, SuccAlignedLocalOffset) =
+        auto [SuccAlignedLocalSize, SuccAlignedLocalOffset] =
             calcAlignedLocalArgument(SuccIndex, OriginalLocalSize);
 
         // Store new local memory size
diff --git a/source/adapters/hip/kernel.hpp b/source/adapters/hip/kernel.hpp
@@ -153,8 +153,7 @@ struct ur_kernel_handle_t_ {
 
     void addLocalArg(size_t Index, size_t Size) {
       // Get the aligned argument size and offset into local data
-      size_t AlignedLocalSize, AlignedLocalOffset;
-      std::tie(AlignedLocalSize, AlignedLocalOffset) =
+      auto [AlignedLocalSize, AlignedLocalOffset] =
           calcAlignedLocalArgument(Index, Size);
 
       // Store argument details
@@ -173,8 +172,7 @@ struct ur_kernel_handle_t_ {
         }
 
         // Recalculate alignment
-        size_t SuccAlignedLocalSize, SuccAlignedLocalOffset;
-        std::tie(SuccAlignedLocalSize, SuccAlignedLocalOffset) =
+        auto [SuccAlignedLocalSize, SuccAlignedLocalOffset] =
             calcAlignedLocalArgument(SuccIndex, OriginalLocalSize);
 
         // Store new local memory size
diff --git a/test/conformance/exp_command_buffer/update/local_memory_update.cpp b/test/conformance/exp_command_buffer/update/local_memory_update.cpp
diff --git a/test/conformance/kernel/urKernelSetArgLocal.cpp b/test/conformance/kernel/urKernelSetArgLocal.cpp