You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bulk/tensor variants) with memory barriers, cache and prefetch controls, and
54
+
NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes,
55
+
and MMA types/layouts).
56
+
57
+
**Non-goals:** NVVM is not a place for convenience or “wrapper” ops. It is
58
+
not intended to introduce high-level ops that expand into multiple unrelated
59
+
NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong
60
+
in higher-level dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects).
61
+
The design intent is a thin, predictable, low-level surface with
62
+
near-mechanical lowering to NVVM/LLVM IR.
63
+
64
+
**Placement in the lowering pipeline:** NVVM sits below target-agnostic
65
+
dialects like `gpu` and NVIDIA's `nvgpu`. Typical pipelines convert
66
+
`gpu`/`nvgpu` ops into NVVM using `-convert-gpu-to-nvvm` and
67
+
`-convert-nvgpu-to-nvvm`, then translate into LLVM for final code
68
+
generation via NVPTX backend.
69
+
70
+
**Target configuration and serialization:** NVVM provides a `#nvvm.target`
71
+
attribute to describe the GPU target (SM, features, and flags). In
72
+
conjunction with `gpu` serialization (e.g., `gpu-module-to-binary`), this
73
+
enables producing architecture-specific GPU binaries (such as CUBIN) from
74
+
nested GPU modules.
75
+
76
+
**Inline PTX:** When an intrinsic is unavailable or a performance-critical
77
+
sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
78
+
embed PTX inline as a last-resort escape hatch, with explicit operands and
79
+
results.
80
+
}];
81
+
38
82
let name = "nvvm";
39
83
let cppNamespace = "::mlir::NVVM";
40
84
let dependentDialects = ["LLVM::LLVMDialect"];
@@ -976,7 +1020,7 @@ def NVVM_ShflOp :
976
1020
let description = [{
977
1021
The `shfl.sync` Op implements data shuffle within threads of a warp.
978
1022
The `thread_mask` denotes the threads participating in the Op where
979
-
the bit position corresponds to a particular thread’s laneid.
1023
+
the bit position corresponds to a particular thread's laneid.
980
1024
The `offset` specifies a source lane or source lane offset
981
1025
(depending on `kind`). The `val` is the input value to be copied from
982
1026
the source. The `mask_and_clamp` contains two packed values specifying
@@ -1031,7 +1075,7 @@ def NVVM_VoteSyncOp
1031
1075
- `ballot`: In the ballot form, the destination result is a 32 bit integer.
1032
1076
In this form, the predicate from each thread in membermask are copied into
1033
1077
the corresponding bit position of the result, where the bit position
1034
-
corresponds to the thread’s lane id.
1078
+
corresponds to the thread's lane id.
1035
1079
1036
1080
[For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-vote-sync)
0 commit comments