Skip to content

Commit 7c3cc1e

Browse files
committed
Revise the section on CU & WGP modes
Signed-off-by: Jan Stephan <[email protected]>
1 parent abdf2b0 commit 7c3cc1e

File tree

2 files changed

+30
-17
lines changed

2 files changed

+30
-17
lines changed

.wordlist.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ Dereferencing
3939
DFT
4040
dll
4141
DirectX
42+
DPP
43+
dst
4244
EIGEN
4345
enqueue
4446
enqueues

docs/how-to/hip_rtc.rst

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -319,31 +319,42 @@ using the bitcode APIs provided by HIPRTC.
319319
vector<char> kernel_bitcode(bitCodeSize);
320320
hiprtcGetBitcode(prog, kernel_bitcode.data());
321321
322-
CU Mode vs WGP mode
322+
CU mode vs WGP mode
323323
-------------------------------------------------------------------------------
324324

325-
AMD GPUs consist of an array of workgroup processors, each built with 2 compute
326-
units (CUs) capable of executing SIMD32. All the CUs inside a workgroup
327-
processor use local data share (LDS).
325+
All :doc:`supported AMD GPUs <rocm-install-on-linux:reference/system-requirements>` are built around a data-parallel
326+
processor (DPP) array.
328327

329-
gfx10+ support execution of wavefront in CU mode and work-group processor mode
330-
(WGP). Please refer to section 2.3 of `RDNA3 ISA reference <https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf>`_.
328+
On CDNA GPUs, the DPP is organized as a set of compute unit (CU) pipelines, with each CU containing a single SIMD64
329+
unit. Each CU has its own low-latency memory space called local data share (LDS), which threads from a warp running on
330+
the CU can access.
331331

332-
gfx9 and below only supports CU mode.
332+
On RDNA GPUs, the DPP is organized as a set of workgroup processor (WGP) pipelines. Each WGP contains two CUs, and each
333+
CU contains two SIMD32 units. The LDS is attached to the WGP, so threads from different warps can access the same LDS if
334+
they run on CUs within the same WGP.
333335

334-
In WGP mode, 4 warps of a block can simultaneously be executed on the workgroup
335-
processor, where as in CU mode only 2 warps of a block can simultaneously
336-
execute on a CU. In theory, WGP mode might help with occupancy and increase the
337-
performance of certain HIP programs (if not bound to inter warp communication),
338-
but might incur performance penalty on other HIP programs which rely on atomics
339-
and inter warp communication. This also has effect of how the LDS is split
340-
between warps, please refer to `RDNA3 ISA reference <https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf>`_ for more information.
336+
.. note::
337+
338+
Because CDNA GPUs do not use workgroup processors and have a different CU layout, the following information applies
339+
only to RDNA GPUs.
340+
341+
Warps are dispatched in one of two modes. These control whether warps are distributed across two SIMD32s (**CU mode**)
342+
or across all four SIMD32s within a WGP (**WGP mode**).
343+
344+
CU mode executes two warps per block on a single CU and provides only half the LDS to those warps. Independence between
345+
CUs can improve performance for workloads avoiding inter-warp communication, but LDS capacity per CU is limited.
346+
347+
WGP mode executes four warps per block on a WGP with a shared LDS. It can increase occupancy and improve performance
348+
for workloads without heavy inter-warp communication, but it can degrade performance for programs relying on atomics or
349+
extensive inter-warp communication.
350+
351+
For more information on the differences between CU and WGP modes, please refer to the appropriate ISA reference under
352+
`AMD RDNA architecture <https://gpuopen.com/amd-gpu-architecture-programming-documentation/>`__.
341353

342354
.. note::
343355

344-
HIPRTC assumes **WGP mode by default** for gfx10+. This can be overridden by
345-
passing ``-mcumode`` to HIPRTC compile options in
346-
:cpp:func:`hiprtcCompileProgram`.
356+
HIPRTC assumes **WGP mode by default** for RDNA GPUs. This can be overridden by passing ``-mcumode`` as a compile
357+
option in :cpp:func:`hiprtcCompileProgram`.
347358

348359
Linker APIs
349360
===============================================================================

0 commit comments

Comments
 (0)