@@ -37,84 +37,6 @@ def LLVM_PointerSharedCluster : LLVM_PointerInAddressSpace<7>;
3737//===----------------------------------------------------------------------===//
3838
3939def NVVM_Dialect : Dialect {
40- let summary = "The NVVM dialect that models NVIDIA's public ISA";
41-
42- let description = [{
43- The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
44- models NVVM intrinsics and public ISA functionality and introduces NVIDIA
45- extensions to the MLIR/LLVM type system and address spaces (e.g., global,
46- shared, and cluster memory), enabling faithful lowering of GPU kernels to the
47- NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
48- the NVVM dialect uses type polymorphism and other attributes so that a single
49- NVVM op can map to different LLVM intrinsics.
50-
51- **Scope and capabilities:** The dialect covers core GPU features such as
52- thread/block builtins, barriers and atomics, warp-level collectives (e.g.,
53- shuffle/vote), matrix/tensor core operations (e.g., `mma.sync`, `wgmma`),
54- tensor memory accelerator (TMA) operations, asynchronous copies (`cp.async`,
55- bulk/tensor variants) with memory barriers, cache and prefetch controls, and
56- NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes,
57- and MMA types/layouts).
58-
59- **Non-goals:** NVVM is not a place for convenience or “wrapper” ops. It is
60- not intended to introduce high-level ops that expand into multiple unrelated
61- NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong
62- in higher-level dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects).
63- The design intent is a thin, predictable, low-level surface with
64- near-mechanical lowering to NVVM/LLVM IR.
65-
66- **Placement in the lowering pipeline:** NVVM sits below target-agnostic
67- dialects like `gpu` and NVIDIA's `nvgpu`. Typical pipelines convert
68- `gpu`/`nvgpu` ops into NVVM using `-convert-gpu-to-nvvm` and
69- `-convert-nvgpu-to-nvvm`, then translate into LLVM for final code
70- generation via NVPTX backend.
71-
72- **Target configuration and serialization:** NVVM provides a `#nvvm.target`
73- attribute to describe the GPU target (SM, features, and flags). In
74- conjunction with `gpu` serialization (e.g., `gpu-module-to-binary`), this
75- enables producing architecture-specific GPU binaries (such as CUBIN) from
76- nested GPU modules.
77-
78- **Inline PTX:** When an intrinsic is unavailable or a performance-critical
79- sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
80- embed PTX inline as a last-resort escape hatch, with explicit operands and
81- results.
82-
83-
84- **Memory Spaces:** The NVVM dialect introduces the following memory spaces,
85- each with distinct scopes and lifetimes:
86- ```
87- | Memory Space | Address Space | Scope | Lifetime |
88- |-------------------|---------------|----------------------|-------------------|
89- | `generic` | 0 | All threads | Context-dependent |
90- | `global` | 1 | All threads (device) | Application |
91- | `shared` | 3 | Thread block (CTA) | Kernel execution |
92- | `constant` | 4 | All threads (RO) | Application |
93- | `local` | 5 | Single thread | Kernel execution |
94- | `tensor` | 6 | Thread block (CTA) | Kernel execution |
95- | `shared_cluster` | 7 | Thread block cluster | Kernel execution |
96- ```
97- **Memory Space Details:**
98- - **generic**: Can point to any memory space; requires runtime resolution of
99- actual address space. Use when pointer origin is unknown at compile time.
100- Performance varies based on the underlying memory space.
101- - **global**: Accessible by all threads across all blocks; persists across
102- kernel launches. Highest latency but largest capacity (device memory). Best
103- for large data and inter-kernel communication.
104- - **shared**: Shared within a thread block (CTA); very fast on-chip memory for
105- cooperation between threads in the same block. Limited capacity. Ideal for
106- block-level collaboration, caching, and reducing global memory traffic.
107- - **constant**: Read-only memory cached per SM. Size typically limited to
108- 64KB. Best for read-only data and uniform values accessed by all threads.
109- - **local**: Private to each thread. Use for per-thread private data and
110- automatic variables that don't fit in registers.
111- - **tensor**: Special memory space for tensor core operations. Used by
112- `tcgen05` instructions on SM 100+ for tensor input/output operations.
113- - **shared_cluster**: Distributed shared memory across thread blocks within
114- a cluster (SM 90+). Enables collaboration beyond single-block scope with
115- fast access across cluster threads.
116- }];
117-
11840 let name = "nvvm";
11941 let cppNamespace = "::mlir::NVVM";
12042 let dependentDialects = ["LLVM::LLVMDialect"];
0 commit comments