@@ -157,25 +157,28 @@ space pointer arguments, which are set by the user with
157157``urKernelSetArgLocal `` with the number of bytes of local memory to allocate
158158and make available from the pointer argument.
159159
160- The CUDA adapter implements local memory arguments to a kernel as a single
161- ``__shared__ `` memory allocation, with each local address space pointer argument
162- to the kernel converted to a byte offset parameter to the single memory
163- allocation. Therefore for ``N `` local arguments that need set on a kernel with
164- ``urKernelSetArgLocal ``, the total aligned size is calculated for the single
160+ The CUDA adapter implements local memory in a kernel as a single ``__shared__ ``
161+ memory allocation, and each individual local memory argument is a ``u32 `` byte
162+ offset kernel parameter which is combined inside the kernel with the
163+ ``__shared__ `` memory allocation. Therefore for ``N `` local arguments that need
164+ set on a kernel with ``urKernelSetArgLocal ``, the total aligned size across the
165+ ``N `` calls to ``urKernelSetArgLocal `` is calculated for the ``__shared__ ``
165166memory allocation by the CUDA adapter and passed as the ``sharedMemBytes ``
166167argument to ``cuLaunchKernel `` (or variants like ``cuLaunchCooperativeKernel ``
167- or ``cudaGraphAddKernelNode ``).
168+ or ``cuGraphAddKernelNode ``).
168169
169- For each kernel local memory parameter, aligned offsets into the single memory location
170- are calculated and passed at runtime via ``kernelParams `` when launching the kernel (or
171- adding as a graph node). When a user calls ``urKernelSetArgLocal `` with an
172- argument index that has already been set the CUDA adapter recalculates the size of the
173- single memory allocation and offsets of any local memory arguments at following indices.
170+ For each kernel ``u32 `` local memory offset parameter, aligned offsets into the
171+ single memory location are calculated and passed at runtime by the adapter via
172+ ``kernelParams `` when launching the kernel (or adding the kernel as a graph
173+ node). When a user calls ``urKernelSetArgLocal `` with an argument index that
174+ has already been set on the kernel, the adapter recalculates the size of the
175+ ``__shared__ `` memory allocation and offset for the index, as well as the
176+ offsets of any local memory arguments at following indices.
174177
175178.. warning ::
176179
177180 The CUDA UR adapter implementation of local memory assumes the kernel created
178- has been created by DPC++, instumenting the device code so that local memory
181+ has been created by DPC++, instrumenting the device code so that local memory
179182 arguments are offsets rather than pointers.
180183
181184Other Notes
0 commit comments