Skip to content

Commit 99bff65

Browse files
author
git apple-llvm automerger
committed
Merge commit '6ecbed8b3d65' from llvm.org/main into next
2 parents 6cd76dc + 6ecbed8 commit 99bff65

File tree

1 file changed

+46
-2
lines changed

1 file changed

+46
-2
lines changed

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,50 @@ def LLVM_PointerSharedCluster : LLVM_PointerInAddressSpace<7>;
3535
//===----------------------------------------------------------------------===//
3636

3737
def NVVM_Dialect : Dialect {
38+
let summary = "The NVVM dialect that models NVIDIA's public ISA";
39+
40+
let description = [{
41+
The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
42+
models NVVM intrinsics and public ISA functionality and introduces NVIDIA
43+
extensions to the MLIR/LLVM type system and address spaces (e.g., global,
44+
shared, and cluster memory), enabling faithful lowering of GPU kernels to the
45+
NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
46+
the NVVM dialect uses type polymorphism and other attributes so that a single
47+
NVVM op can map to different LLVM intrinsics.
48+
49+
**Scope and capabilities:** The dialect covers core GPU features such as
50+
thread/block builtins, barriers and atomics, warp-level collectives (e.g.,
51+
shuffle/vote), matrix/tensor core operations (e.g., `mma.sync`, `wgmma`),
52+
tensor memory accelerator (TMA) operations, asynchronous copies (`cp.async`,
53+
bulk/tensor variants) with memory barriers, cache and prefetch controls, and
54+
NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes,
55+
and MMA types/layouts).
56+
57+
**Non-goals:** NVVM is not a place for convenience or “wrapper” ops. It is
58+
not intended to introduce high-level ops that expand into multiple unrelated
59+
NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong
60+
in higher-level dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects).
61+
The design intent is a thin, predictable, low-level surface with
62+
near-mechanical lowering to NVVM/LLVM IR.
63+
64+
**Placement in the lowering pipeline:** NVVM sits below target-agnostic
65+
dialects like `gpu` and NVIDIA's `nvgpu`. Typical pipelines convert
66+
`gpu`/`nvgpu` ops into NVVM using `-convert-gpu-to-nvvm` and
67+
`-convert-nvgpu-to-nvvm`, then translate into LLVM for final code
68+
generation via NVPTX backend.
69+
70+
**Target configuration and serialization:** NVVM provides a `#nvvm.target`
71+
attribute to describe the GPU target (SM, features, and flags). In
72+
conjunction with `gpu` serialization (e.g., `gpu-module-to-binary`), this
73+
enables producing architecture-specific GPU binaries (such as CUBIN) from
74+
nested GPU modules.
75+
76+
**Inline PTX:** When an intrinsic is unavailable or a performance-critical
77+
sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
78+
embed PTX inline as a last-resort escape hatch, with explicit operands and
79+
results.
80+
}];
81+
3882
let name = "nvvm";
3983
let cppNamespace = "::mlir::NVVM";
4084
let dependentDialects = ["LLVM::LLVMDialect"];
@@ -976,7 +1020,7 @@ def NVVM_ShflOp :
9761020
let description = [{
9771021
The `shfl.sync` Op implements data shuffle within threads of a warp.
9781022
The `thread_mask` denotes the threads participating in the Op where
979-
the bit position corresponds to a particular threads laneid.
1023+
the bit position corresponds to a particular thread's laneid.
9801024
The `offset` specifies a source lane or source lane offset
9811025
(depending on `kind`). The `val` is the input value to be copied from
9821026
the source. The `mask_and_clamp` contains two packed values specifying
@@ -1031,7 +1075,7 @@ def NVVM_VoteSyncOp
10311075
- `ballot`: In the ballot form, the destination result is a 32 bit integer.
10321076
In this form, the predicate from each thread in membermask are copied into
10331077
the corresponding bit position of the result, where the bit position
1034-
corresponds to the threads lane id.
1078+
corresponds to the thread's lane id.
10351079

10361080
[For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-vote-sync)
10371081
}];

0 commit comments

Comments
 (0)