@@ -84,15 +84,15 @@ def NVVM_Dialect : Dialect {
8484 **Memory Spaces:** The NVVM dialect introduces the following memory spaces,
8585 each with distinct scopes and lifetimes:
8686
87- | Memory Space | Scope | Lifetime |
88- |-------------------|----------------------|-------------------|
89- | `generic` | All threads | Context-dependent |
90- | `global` | All threads (device) | Application |
91- | `shared` | Thread block (CTA) | Kernel execution |
92- | `constant` | All threads (RO) | Application |
93- | `local` | Single thread | Kernel execution |
94- | `tensor` | Thread block (CTA) | Kernel execution |
95- | `shared_cluster` | Thread block cluster | Kernel execution |
87+ | Memory Space | Address Space | Scope | Lifetime |
88+ |-------------------|---------------|--------------- -------|-------------------|
89+ | `generic` | 0 | All threads | Context-dependent |
90+ | `global` | 1 | All threads (device) | Application |
91+ | `shared` | 3 | Thread block (CTA) | Kernel execution |
92+ | `constant` | 4 | All threads (RO) | Application |
93+ | `local` | 5 | Single thread | Kernel execution |
94+ | `tensor` | 6 | Thread block (CTA) | Kernel execution |
95+ | `shared_cluster` | 7 | Thread block cluster | Kernel execution |
9696
9797 **Memory Space Details:**
9898 - **generic**: Can point to any memory space; requires runtime resolution of
@@ -104,19 +104,15 @@ def NVVM_Dialect : Dialect {
104104 - **shared**: Shared within a thread block (CTA); very fast on-chip memory for
105105 cooperation between threads in the same block. Limited capacity. Ideal for
106106 block-level collaboration, caching, and reducing global memory traffic.
107- - **constant**: Read-only memory cached per SM; optimized for broadcast
108- patterns where all threads access the same location. Fast access when cached.
109- Size typically limited to 64KB. Best for read-only data and uniform values
110- accessed by all threads.
111- - **local**: Private to each thread; used for stack frames and register spills.
112- Actually resides in global memory but cached in L1. Use for per-thread
113- private data and automatic variables that don't fit in registers.
114- - **tensor**: Special memory space for Tensor Memory Accelerator (TMA)
115- operations on SM 80+ architectures; used with async tensor operations and
116- wgmma instructions. Provides very fast access for matrix operations.
117- - **shared_cluster**: Shared across thread blocks within a cluster (SM 90+);
118- enables collaboration beyond single-block scope with distributed shared
119- memory. Fast access across cluster threads.
107+ - **constant**: Read-only memory cached per SM. Size typically limited to
108+ 64KB. Best for read-only data and uniform values accessed by all threads.
109+ - **local**: Private to each thread. Use for per-thread private data and
110+ automatic variables that don't fit in registers.
111+ - **tensor**: Special memory space for tensor core operations. Used by
112+ `tcgen05` instructions on SM 100+ for tensor input/output operations.
113+ - **shared_cluster**: Distributed shared memory across thread blocks within
114+ a cluster (SM 90+). Enables collaboration beyond single-block scope with
115+ fast access across cluster threads.
120116 }];
121117
122118 let name = "nvvm";
0 commit comments