@@ -79,6 +79,45 @@ def NVVM_Dialect : Dialect {
7979 sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
8080 embed PTX inline as a last-resort escape hatch, with explicit operands and
8181 results.
82+
83+
84+ **Memory Spaces:** The NVVM dialect introduces the following memory spaces,
85+ each with distinct scopes and lifetimes:
86+
87+ | Memory Space | Scope | Lifetime |
88+ |-------------------|----------------------|-------------------|
89+ | `generic` | All threads | Context-dependent |
90+ | `global` | All threads (device) | Application |
91+ | `shared` | Thread block (CTA) | Kernel execution |
92+ | `constant` | All threads (RO) | Application |
93+ | `local` | Single thread | Kernel execution |
94+ | `tensor` | Thread block (CTA) | Kernel execution |
95+ | `shared_cluster` | Thread block cluster | Kernel execution |
96+
97+ **Memory Space Details:**
98+ - **generic**: Can point to any memory space; requires runtime resolution of
99+ actual address space. Use when pointer origin is unknown at compile time.
100+ Performance varies based on the underlying memory space.
101+ - **global**: Accessible by all threads across all blocks; persists across
102+ kernel launches. Highest latency (~400-800 cycles) but largest capacity
103+ (device memory). Best for large data and inter-kernel communication.
104+ - **shared**: Shared within a thread block (CTA); very fast on-chip memory
105+ (~20-40 cycles) for cooperation between threads in the same block. Limited
106+ capacity (48-164KB depending on architecture). Ideal for block-level
107+ collaboration, caching, and reducing global memory traffic.
108+ - **constant**: Read-only memory cached per SM; optimized for broadcast
109+ patterns where all threads access the same location. Fast access when cached
110+ (~20 cycles). Size typically limited to 64KB. Best for read-only data and
111+ uniform values accessed by all threads.
112+ - **local**: Private to each thread; used for stack frames and register spills.
113+ Actually resides in global memory but cached in L1 (~100-200 cycles). Use for
114+ per-thread private data and automatic variables that don't fit in registers.
115+ - **tensor**: Special memory space for Tensor Memory Accelerator (TMA)
116+ operations on SM 80+ architectures; used with async tensor operations and
117+ wgmma instructions. Provides very fast access for matrix operations.
118+ - **shared_cluster**: Shared across thread blocks within a cluster (SM 90+);
119+ enables collaboration beyond single-block scope with distributed shared
120+ memory. Fast access (~40-80 cycles) across cluster threads.
82121 }];
83122
84123 let name = "nvvm";
0 commit comments