Skip to content

Commit 76dac58

Browse files
authored
[MLIR][NVVM] Move the docs to markdown file (#168375)
1 parent 591c463 commit 76dac58

File tree

2 files changed

+84
-78
lines changed

2 files changed

+84
-78
lines changed

mlir/docs/Dialects/NVVM/_index.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# NVVM Dialect
2+
3+
The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
4+
models NVVM intrinsics and public ISA functionality and introduces NVIDIA
5+
extensions to the MLIR/LLVM type system and address spaces (e.g., global,
6+
shared, and cluster memory), enabling faithful lowering of GPU kernels to the
7+
NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
8+
the NVVM dialect uses type polymorphism and other attributes so that a single
9+
NVVM op can map to different LLVM intrinsics.
10+
11+
## Scope and Capabilities
12+
13+
The dialect covers core GPU features such as thread/block builtins, barriers
14+
and atomics, warp-level collectives (e.g., shuffle/vote), matrix/tensor core
15+
operations (e.g., `mma.sync`, `wgmma`), tensor memory accelerator (TMA)
16+
operations, asynchronous copies (`cp.async`, bulk/tensor variants) with memory
17+
barriers, cache and prefetch controls, and NVVM-specific attributes and enums
18+
(e.g., FP rounding modes, memory scopes, and MMA types/layouts).
19+
20+
## Placement in the Lowering Pipeline
21+
22+
NVVM sits below target-agnostic dialects like `gpu` and NVIDIA's `nvgpu`.
23+
Typical pipelines convert `gpu`/`nvgpu` ops into NVVM using
24+
`-convert-gpu-to-nvvm` and `-convert-nvgpu-to-nvvm`, then translate into LLVM
25+
for final code generation via NVPTX backend.
26+
27+
## Target Configuration and Serialization
28+
29+
NVVM provides a `#nvvm.target` attribute to describe the GPU target (SM,
30+
features, and flags). In conjunction with `gpu` serialization (e.g.,
31+
`gpu-module-to-binary`), this enables producing architecture-specific GPU
32+
binaries (such as CUBIN) from nested GPU modules.
33+
34+
## Inline PTX
35+
36+
When an intrinsic is unavailable or a performance-critical sequence must be
37+
expressed directly, NVVM provides an `nvvm.inline_ptx` op to embed PTX inline
38+
as a last-resort escape hatch, with explicit operands and results.
39+
40+
## Memory Spaces
41+
42+
The NVVM dialect introduces the following memory spaces, each with distinct
43+
scopes and lifetimes:
44+
45+
| Memory Space | Address Space | Scope |
46+
|-------------------|---------------|----------------------|
47+
| `generic` | 0 | All threads |
48+
| `global` | 1 | All threads (device) |
49+
| `shared` | 3 | Thread block (CTA) |
50+
| `constant` | 4 | All threads |
51+
| `local` | 5 | Single thread |
52+
| `tensor` | 6 | Thread block (CTA) |
53+
| `shared_cluster` | 7 | Thread block cluster |
54+
55+
### Memory Space Details
56+
57+
- **generic**: Can point to any memory space; requires runtime resolution of
58+
actual address space. Use when pointer origin is unknown at compile time.
59+
Performance varies based on the underlying memory space.
60+
- **global**: Accessible by all threads across all blocks; persists across
61+
kernel launches. Highest latency but largest capacity (device memory). Best
62+
for large data and inter-kernel communication.
63+
- **shared**: Shared within a thread block (CTA); very fast on-chip memory for
64+
cooperation between threads in the same block. Limited capacity. Ideal for
65+
block-level collaboration, caching, and reducing global memory traffic.
66+
- **constant**: Read-only memory cached per SM. Size typically limited to 64KB.
67+
Best for read-only data and uniform values accessed by all threads.
68+
- **local**: Private to each thread. Use for per-thread private data and
69+
automatic variables that don't fit in registers.
70+
- **tensor**: Special memory space for tensor core operations. Used by
71+
`tcgen05` instructions on SM 100+ for tensor input/output operations.
72+
- **shared_cluster**: Distributed shared memory across thread blocks within a
73+
cluster (SM 90+). Enables collaboration beyond single-block scope with fast
74+
access across cluster threads.
75+
76+
77+
## Non-Goals
78+
79+
NVVM is not a place for convenience or "wrapper" ops. It is not intended to
80+
introduce high-level ops that expand into multiple unrelated NVVM intrinsics or
81+
that lower to no intrinsic at all. Such abstractions belong in higher-level
82+
dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects). The design
83+
intent is a thin, predictable, low-level surface with near-mechanical lowering
84+
to NVVM/LLVM IR.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

Lines changed: 0 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -37,84 +37,6 @@ def LLVM_PointerSharedCluster : LLVM_PointerInAddressSpace<7>;
3737
//===----------------------------------------------------------------------===//
3838

3939
def NVVM_Dialect : Dialect {
40-
let summary = "The NVVM dialect that models NVIDIA's public ISA";
41-
42-
let description = [{
43-
The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
44-
models NVVM intrinsics and public ISA functionality and introduces NVIDIA
45-
extensions to the MLIR/LLVM type system and address spaces (e.g., global,
46-
shared, and cluster memory), enabling faithful lowering of GPU kernels to the
47-
NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
48-
the NVVM dialect uses type polymorphism and other attributes so that a single
49-
NVVM op can map to different LLVM intrinsics.
50-
51-
**Scope and capabilities:** The dialect covers core GPU features such as
52-
thread/block builtins, barriers and atomics, warp-level collectives (e.g.,
53-
shuffle/vote), matrix/tensor core operations (e.g., `mma.sync`, `wgmma`),
54-
tensor memory accelerator (TMA) operations, asynchronous copies (`cp.async`,
55-
bulk/tensor variants) with memory barriers, cache and prefetch controls, and
56-
NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes,
57-
and MMA types/layouts).
58-
59-
**Non-goals:** NVVM is not a place for convenience or “wrapper” ops. It is
60-
not intended to introduce high-level ops that expand into multiple unrelated
61-
NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong
62-
in higher-level dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects).
63-
The design intent is a thin, predictable, low-level surface with
64-
near-mechanical lowering to NVVM/LLVM IR.
65-
66-
**Placement in the lowering pipeline:** NVVM sits below target-agnostic
67-
dialects like `gpu` and NVIDIA's `nvgpu`. Typical pipelines convert
68-
`gpu`/`nvgpu` ops into NVVM using `-convert-gpu-to-nvvm` and
69-
`-convert-nvgpu-to-nvvm`, then translate into LLVM for final code
70-
generation via NVPTX backend.
71-
72-
**Target configuration and serialization:** NVVM provides a `#nvvm.target`
73-
attribute to describe the GPU target (SM, features, and flags). In
74-
conjunction with `gpu` serialization (e.g., `gpu-module-to-binary`), this
75-
enables producing architecture-specific GPU binaries (such as CUBIN) from
76-
nested GPU modules.
77-
78-
**Inline PTX:** When an intrinsic is unavailable or a performance-critical
79-
sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
80-
embed PTX inline as a last-resort escape hatch, with explicit operands and
81-
results.
82-
83-
84-
**Memory Spaces:** The NVVM dialect introduces the following memory spaces,
85-
each with distinct scopes and lifetimes:
86-
```
87-
| Memory Space | Address Space | Scope | Lifetime |
88-
|-------------------|---------------|----------------------|-------------------|
89-
| `generic` | 0 | All threads | Context-dependent |
90-
| `global` | 1 | All threads (device) | Application |
91-
| `shared` | 3 | Thread block (CTA) | Kernel execution |
92-
| `constant` | 4 | All threads (RO) | Application |
93-
| `local` | 5 | Single thread | Kernel execution |
94-
| `tensor` | 6 | Thread block (CTA) | Kernel execution |
95-
| `shared_cluster` | 7 | Thread block cluster | Kernel execution |
96-
```
97-
**Memory Space Details:**
98-
- **generic**: Can point to any memory space; requires runtime resolution of
99-
actual address space. Use when pointer origin is unknown at compile time.
100-
Performance varies based on the underlying memory space.
101-
- **global**: Accessible by all threads across all blocks; persists across
102-
kernel launches. Highest latency but largest capacity (device memory). Best
103-
for large data and inter-kernel communication.
104-
- **shared**: Shared within a thread block (CTA); very fast on-chip memory for
105-
cooperation between threads in the same block. Limited capacity. Ideal for
106-
block-level collaboration, caching, and reducing global memory traffic.
107-
- **constant**: Read-only memory cached per SM. Size typically limited to
108-
64KB. Best for read-only data and uniform values accessed by all threads.
109-
- **local**: Private to each thread. Use for per-thread private data and
110-
automatic variables that don't fit in registers.
111-
- **tensor**: Special memory space for tensor core operations. Used by
112-
`tcgen05` instructions on SM 100+ for tensor input/output operations.
113-
- **shared_cluster**: Distributed shared memory across thread blocks within
114-
a cluster (SM 90+). Enables collaboration beyond single-block scope with
115-
fast access across cluster threads.
116-
}];
117-
11840
let name = "nvvm";
11941
let cppNamespace = "::mlir::NVVM";
12042
let dependentDialects = ["LLVM::LLVMDialect"];

0 commit comments

Comments
 (0)