You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AMD GPUs consist of an array of workgroup processors, each built with 2 compute
326
-
units (CUs) capable of executing SIMD32. All the CUs inside a workgroup
327
-
processor use local data share (LDS).
325
+
All :doc:`supported AMD GPUs <rocm-install-on-linux:reference/system-requirements>` are built around a data-parallel
326
+
processor (DPP) array.
328
327
329
-
gfx10+ support execution of wavefront in CU mode and work-group processor mode
330
-
(WGP). Please refer to section 2.3 of `RDNA3 ISA reference <https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf>`_.
328
+
On CDNA GPUs, the DPP is organized as a set of compute unit (CU) pipelines, with each CU containing a single SIMD64
329
+
unit. Each CU has its own low-latency memory space called local data share (LDS), which threads from a warp running on
330
+
the CU can access.
331
331
332
-
gfx9 and below only supports CU mode.
332
+
On RDNA GPUs, the DPP is organized as a set of workgroup processor (WGP) pipelines. Each WGP contains two CUs, and each
333
+
CU contains two SIMD32 units. The LDS is attached to the WGP, so threads from different warps can access the same LDS if
334
+
they run on CUs within the same WGP.
333
335
334
-
In WGP mode, 4 warps of a block can simultaneously be executed on the workgroup
335
-
processor, where as in CU mode only 2 warps of a block can simultaneously
336
-
execute on a CU. In theory, WGP mode might help with occupancy and increase the
337
-
performance of certain HIP programs (if not bound to inter warp communication),
338
-
but might incur performance penalty on other HIP programs which rely on atomics
339
-
and inter warp communication. This also has effect of how the LDS is split
340
-
between warps, please refer to `RDNA3 ISA reference <https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf>`_ for more information.
336
+
.. note::
337
+
338
+
Because CDNA GPUs do not use workgroup processors and have a different CU layout, the following information applies
339
+
only to RDNA GPUs.
340
+
341
+
Warps are dispatched in one of two modes. These control whether warps are distributed across two SIMD32s (**CU mode**)
342
+
or across all four SIMD32s within a WGP (**WGP mode**).
343
+
344
+
CU mode executes two warps per block on a single CU and provides only half the LDS to those warps. Independence between
345
+
CUs can improve performance for workloads avoiding inter-warp communication, but LDS capacity per CU is limited.
346
+
347
+
WGP mode executes four warps per block on a WGP with a shared LDS. It can increase occupancy and improve performance
348
+
for workloads without heavy inter-warp communication, but it can degrade performance for programs relying on atomics or
349
+
extensive inter-warp communication.
350
+
351
+
For more information on the differences between CU and WGP modes, please refer to the appropriate ISA reference under
0 commit comments