@@ -58,23 +58,45 @@ scopes and lifetimes:
5858
5959- ** generic** : Can point to any memory space; requires runtime resolution of
6060 actual address space. Use when pointer origin is unknown at compile time.
61- Performance varies based on the underlying memory space.
61+ Performance varies based on the underlying memory space. A pointer to this
62+ memory space is represented by ` LLVM_PointerGeneric ` in the NVVM Ops.
6263- ** global** : Accessible by all threads across all blocks; persists across
6364 kernel launches. Highest latency but largest capacity (device memory). Best
64- for large data and inter-kernel communication.
65+ for large data and inter-kernel communication. A pointer to this memory space
66+ is represented by ` LLVM_PointerGlobal ` in the NVVM Ops.
6567- ** shared** : Shared within a thread block (CTA); very fast on-chip memory for
6668 cooperation between threads in the same block. Limited capacity. Ideal for
6769 block-level collaboration, caching, and reducing global memory traffic.
70+ This memory is usually referred as ` shared_cta ` in the NVVMOps and as
71+ ` shared::cta ` in the PTX ISA. A pointer to this memory space is represented
72+ by the ` LLVM_PointerShared ` type in the NVVM Ops.
6873- ** constant** : Read-only memory cached per SM. Size typically limited to 64KB.
69- Best for read-only data and uniform values accessed by all threads.
74+ Best for read-only data and uniform values accessed by all threads. A pointer
75+ to this memory space is represented by ` LLVM_PointerConst ` type in NVVM Ops.
7076- ** local** : Private to each thread. Use for per-thread private data and
71- automatic variables that don't fit in registers.
77+ automatic variables that don't fit in registers. A pointer to this memory is
78+ represented by ` LLVM_PointerLocal ` type in NVVM Ops.
7279- ** tensor** : Special memory space for tensor core operations. Used by
7380 ` tcgen05 ` instructions on SM 100+ for tensor input/output operations.
81+ A pointer to this memory space is represented by the ` LLVM_PointerTensor `
82+ type in the NVVM Ops.
7483- ** shared_cluster** : Distributed shared memory across thread blocks within a
7584 cluster (SM 90+). Enables collaboration beyond single-block scope with fast
76- access across cluster threads.
77-
85+ access across cluster threads. This memory is usually referred as
86+ ` shared_cluster ` in the NVVMOps and as ` shared::cluster ` in the PTX ISA.
87+ A pointer to this memory space is represented by the ` LLVM_PointerSharedCluster `
88+ type in the NVVM Ops.
89+
90+ ## MBarrier objects
91+
92+ An `` mbarrier `` is a barrier created in shared memory that supports
93+ synchronizing any subset of threads within a CTA. An * mbarrier object*
94+ is an opaque object in shared memory with ` .b64 ` type and an alignment of
95+ 8-bytes. Unlike `` nvvm.barrier `` Op which can access only a limited number
96+ of barriers per CTA, the * mbarrier objects* are user-defined and are only
97+ limited by the total shared memory size available. The list of operations
98+ supported on an * mbarrier object* is exposed through the `` nvvm.mbarrier.* ``
99+ family of NVVM Ops.
78100
79101## Non-Goals
80102
0 commit comments