[MLIR][NVVM] Move the docs to markdown file #168375

grypp · 2025-11-17T14:17:42Z

No description provided.

llvmbot · 2025-11-17T14:18:26Z

@llvm/pr-subscribers-mlir-llvm

Author: Guray Ozen (grypp)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/168375.diff

2 Files Affected:

(added) mlir/docs/Dialects/NVVM/_index.md (+84)
(modified) mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td (-43)

diff --git a/mlir/docs/Dialects/NVVM/_index.md b/mlir/docs/Dialects/NVVM/_index.md
new file mode 100644
index 0000000000000..d8b64cda6054b
--- /dev/null
+++ b/mlir/docs/Dialects/NVVM/_index.md
@@ -0,0 +1,84 @@
+# NVVM Dialect
+
+The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
+models NVVM intrinsics and public ISA functionality and introduces NVIDIA
+extensions to the MLIR/LLVM type system and address spaces (e.g., global,
+shared, and cluster memory), enabling faithful lowering of GPU kernels to the
+NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
+the NVVM dialect uses type polymorphism and other attributes so that a single
+NVVM op can map to different LLVM intrinsics.
+
+## Scope and Capabilities
+
+The dialect covers core GPU features such as thread/block builtins, barriers
+and atomics, warp-level collectives (e.g., shuffle/vote), matrix/tensor core
+operations (e.g., `mma.sync`, `wgmma`), tensor memory accelerator (TMA)
+operations, asynchronous copies (`cp.async`, bulk/tensor variants) with memory
+barriers, cache and prefetch controls, and NVVM-specific attributes and enums
+(e.g., FP rounding modes, memory scopes, and MMA types/layouts).
+
+## Placement in the Lowering Pipeline
+
+NVVM sits below target-agnostic dialects like `gpu` and NVIDIA's `nvgpu`.
+Typical pipelines convert `gpu`/`nvgpu` ops into NVVM using
+`-convert-gpu-to-nvvm` and `-convert-nvgpu-to-nvvm`, then translate into LLVM
+for final code generation via NVPTX backend.
+
+## Target Configuration and Serialization
+
+NVVM provides a `#nvvm.target` attribute to describe the GPU target (SM,
+features, and flags). In conjunction with `gpu` serialization (e.g.,
+`gpu-module-to-binary`), this enables producing architecture-specific GPU
+binaries (such as CUBIN) from nested GPU modules.
+
+## Inline PTX
+
+When an intrinsic is unavailable or a performance-critical sequence must be
+expressed directly, NVVM provides an `nvvm.inline_ptx` op to embed PTX inline
+as a last-resort escape hatch, with explicit operands and results.
+
+## Memory Spaces
+
+The NVVM dialect introduces the following memory spaces, each with distinct
+scopes and lifetimes:
+
+| Memory Space      | Address Space | Scope                | Lifetime          |
+|-------------------|---------------|----------------------|-------------------|
+| `generic`         | 0             | All threads          | Context-dependent |
+| `global`          | 1             | All threads (device) | Application       |
+| `shared`          | 3             | Thread block (CTA)   | Kernel execution  |
+| `constant`        | 4             | All threads (RO)     | Application       |
+| `local`           | 5             | Single thread        | Kernel execution  |
+| `tensor`          | 6             | Thread block (CTA)   | Kernel execution  |
+| `shared_cluster`  | 7             | Thread block cluster | Kernel execution  |
+
+### Memory Space Details
+
+- **generic**: Can point to any memory space; requires runtime resolution of
+  actual address space. Use when pointer origin is unknown at compile time.
+  Performance varies based on the underlying memory space.
+- **global**: Accessible by all threads across all blocks; persists across
+  kernel launches. Highest latency but largest capacity (device memory). Best
+  for large data and inter-kernel communication.
+- **shared**: Shared within a thread block (CTA); very fast on-chip memory for
+  cooperation between threads in the same block. Limited capacity. Ideal for
+  block-level collaboration, caching, and reducing global memory traffic.
+- **constant**: Read-only memory cached per SM. Size typically limited to 64KB.
+  Best for read-only data and uniform values accessed by all threads.
+- **local**: Private to each thread. Use for per-thread private data and
+  automatic variables that don't fit in registers.
+- **tensor**: Special memory space for tensor core operations. Used by
+  `tcgen05` instructions on SM 100+ for tensor input/output operations.
+- **shared_cluster**: Distributed shared memory across thread blocks within a
+  cluster (SM 90+). Enables collaboration beyond single-block scope with fast
+  access across cluster threads.
+
+
+## Non-Goals
+
+NVVM is not a place for convenience or "wrapper" ops. It is not intended to
+introduce high-level ops that expand into multiple unrelated NVVM intrinsics or
+that lower to no intrinsic at all. Such abstractions belong in higher-level
+dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects). The design
+intent is a thin, predictable, low-level surface with near-mechanical lowering
+to NVVM/LLVM IR.
\ No newline at end of file
diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index 1cc5b74a3cb67..627a6bfe21eef 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -38,49 +38,6 @@ def LLVM_PointerSharedCluster : LLVM_PointerInAddressSpace<7>;
 
 def NVVM_Dialect : Dialect {
   let summary = "The NVVM dialect that models NVIDIA's public ISA";
-
-  let description = [{
-    The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
-    models NVVM intrinsics and public ISA functionality and introduces NVIDIA
-    extensions to the MLIR/LLVM type system and address spaces (e.g., global,
-    shared, and cluster memory), enabling faithful lowering of GPU kernels to the
-    NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
-    the NVVM dialect uses type polymorphism and other attributes so that a single
-    NVVM op can map to different LLVM intrinsics.
-
-    **Scope and capabilities:** The dialect covers core GPU features such as
-    thread/block builtins, barriers and atomics, warp-level collectives (e.g.,
-    shuffle/vote), matrix/tensor core operations (e.g., `mma.sync`, `wgmma`),
-    tensor memory accelerator (TMA) operations, asynchronous copies (`cp.async`,
-    bulk/tensor variants) with memory barriers, cache and prefetch controls, and
-    NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes,
-    and MMA types/layouts).
-
-    **Non-goals:** NVVM is not a place for convenience or “wrapper” ops. It is
-    not intended to introduce high-level ops that expand into multiple unrelated
-    NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong
-    in higher-level dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects).
-    The design intent is a thin, predictable, low-level surface with
-    near-mechanical lowering to NVVM/LLVM IR.
-
-    **Placement in the lowering pipeline:** NVVM sits below target-agnostic
-    dialects like `gpu` and NVIDIA's `nvgpu`. Typical pipelines convert
-    `gpu`/`nvgpu` ops into NVVM using `-convert-gpu-to-nvvm` and
-    `-convert-nvgpu-to-nvvm`, then translate into LLVM for final code
-    generation via NVPTX backend.
-
-    **Target configuration and serialization:** NVVM provides a `#nvvm.target`
-    attribute to describe the GPU target (SM, features, and flags). In
-    conjunction with `gpu` serialization (e.g., `gpu-module-to-binary`), this
-    enables producing architecture-specific GPU binaries (such as CUBIN) from
-    nested GPU modules.
-
-    **Inline PTX:** When an intrinsic is unavailable or a performance-critical
-    sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
-    embed PTX inline as a last-resort escape hatch, with explicit operands and
-    results.
-  }];
-
   let name = "nvvm";
   let cppNamespace = "::mlir::NVVM";
   let dependentDialects = ["LLVM::LLVMDialect"];

llvmbot · 2025-11-17T14:18:26Z

@llvm/pr-subscribers-mlir

Author: Guray Ozen (grypp)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/168375.diff

2 Files Affected:

(added) mlir/docs/Dialects/NVVM/_index.md (+84)
(modified) mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td (-43)

diff --git a/mlir/docs/Dialects/NVVM/_index.md b/mlir/docs/Dialects/NVVM/_index.md
new file mode 100644
index 0000000000000..d8b64cda6054b
--- /dev/null
+++ b/mlir/docs/Dialects/NVVM/_index.md
@@ -0,0 +1,84 @@
+# NVVM Dialect
+
+The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
+models NVVM intrinsics and public ISA functionality and introduces NVIDIA
+extensions to the MLIR/LLVM type system and address spaces (e.g., global,
+shared, and cluster memory), enabling faithful lowering of GPU kernels to the
+NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
+the NVVM dialect uses type polymorphism and other attributes so that a single
+NVVM op can map to different LLVM intrinsics.
+
+## Scope and Capabilities
+
+The dialect covers core GPU features such as thread/block builtins, barriers
+and atomics, warp-level collectives (e.g., shuffle/vote), matrix/tensor core
+operations (e.g., `mma.sync`, `wgmma`), tensor memory accelerator (TMA)
+operations, asynchronous copies (`cp.async`, bulk/tensor variants) with memory
+barriers, cache and prefetch controls, and NVVM-specific attributes and enums
+(e.g., FP rounding modes, memory scopes, and MMA types/layouts).
+
+## Placement in the Lowering Pipeline
+
+NVVM sits below target-agnostic dialects like `gpu` and NVIDIA's `nvgpu`.
+Typical pipelines convert `gpu`/`nvgpu` ops into NVVM using
+`-convert-gpu-to-nvvm` and `-convert-nvgpu-to-nvvm`, then translate into LLVM
+for final code generation via NVPTX backend.
+
+## Target Configuration and Serialization
+
+NVVM provides a `#nvvm.target` attribute to describe the GPU target (SM,
+features, and flags). In conjunction with `gpu` serialization (e.g.,
+`gpu-module-to-binary`), this enables producing architecture-specific GPU
+binaries (such as CUBIN) from nested GPU modules.
+
+## Inline PTX
+
+When an intrinsic is unavailable or a performance-critical sequence must be
+expressed directly, NVVM provides an `nvvm.inline_ptx` op to embed PTX inline
+as a last-resort escape hatch, with explicit operands and results.
+
+## Memory Spaces
+
+The NVVM dialect introduces the following memory spaces, each with distinct
+scopes and lifetimes:
+
+| Memory Space      | Address Space | Scope                | Lifetime          |
+|-------------------|---------------|----------------------|-------------------|
+| `generic`         | 0             | All threads          | Context-dependent |
+| `global`          | 1             | All threads (device) | Application       |
+| `shared`          | 3             | Thread block (CTA)   | Kernel execution  |
+| `constant`        | 4             | All threads (RO)     | Application       |
+| `local`           | 5             | Single thread        | Kernel execution  |
+| `tensor`          | 6             | Thread block (CTA)   | Kernel execution  |
+| `shared_cluster`  | 7             | Thread block cluster | Kernel execution  |
+
+### Memory Space Details
+
+- **generic**: Can point to any memory space; requires runtime resolution of
+  actual address space. Use when pointer origin is unknown at compile time.
+  Performance varies based on the underlying memory space.
+- **global**: Accessible by all threads across all blocks; persists across
+  kernel launches. Highest latency but largest capacity (device memory). Best
+  for large data and inter-kernel communication.
+- **shared**: Shared within a thread block (CTA); very fast on-chip memory for
+  cooperation between threads in the same block. Limited capacity. Ideal for
+  block-level collaboration, caching, and reducing global memory traffic.
+- **constant**: Read-only memory cached per SM. Size typically limited to 64KB.
+  Best for read-only data and uniform values accessed by all threads.
+- **local**: Private to each thread. Use for per-thread private data and
+  automatic variables that don't fit in registers.
+- **tensor**: Special memory space for tensor core operations. Used by
+  `tcgen05` instructions on SM 100+ for tensor input/output operations.
+- **shared_cluster**: Distributed shared memory across thread blocks within a
+  cluster (SM 90+). Enables collaboration beyond single-block scope with fast
+  access across cluster threads.
+
+
+## Non-Goals
+
+NVVM is not a place for convenience or "wrapper" ops. It is not intended to
+introduce high-level ops that expand into multiple unrelated NVVM intrinsics or
+that lower to no intrinsic at all. Such abstractions belong in higher-level
+dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects). The design
+intent is a thin, predictable, low-level surface with near-mechanical lowering
+to NVVM/LLVM IR.
\ No newline at end of file
diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index 1cc5b74a3cb67..627a6bfe21eef 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -38,49 +38,6 @@ def LLVM_PointerSharedCluster : LLVM_PointerInAddressSpace<7>;
 
 def NVVM_Dialect : Dialect {
   let summary = "The NVVM dialect that models NVIDIA's public ISA";
-
-  let description = [{
-    The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
-    models NVVM intrinsics and public ISA functionality and introduces NVIDIA
-    extensions to the MLIR/LLVM type system and address spaces (e.g., global,
-    shared, and cluster memory), enabling faithful lowering of GPU kernels to the
-    NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
-    the NVVM dialect uses type polymorphism and other attributes so that a single
-    NVVM op can map to different LLVM intrinsics.
-
-    **Scope and capabilities:** The dialect covers core GPU features such as
-    thread/block builtins, barriers and atomics, warp-level collectives (e.g.,
-    shuffle/vote), matrix/tensor core operations (e.g., `mma.sync`, `wgmma`),
-    tensor memory accelerator (TMA) operations, asynchronous copies (`cp.async`,
-    bulk/tensor variants) with memory barriers, cache and prefetch controls, and
-    NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes,
-    and MMA types/layouts).
-
-    **Non-goals:** NVVM is not a place for convenience or “wrapper” ops. It is
-    not intended to introduce high-level ops that expand into multiple unrelated
-    NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong
-    in higher-level dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects).
-    The design intent is a thin, predictable, low-level surface with
-    near-mechanical lowering to NVVM/LLVM IR.
-
-    **Placement in the lowering pipeline:** NVVM sits below target-agnostic
-    dialects like `gpu` and NVIDIA's `nvgpu`. Typical pipelines convert
-    `gpu`/`nvgpu` ops into NVVM using `-convert-gpu-to-nvvm` and
-    `-convert-nvgpu-to-nvvm`, then translate into LLVM for final code
-    generation via NVPTX backend.
-
-    **Target configuration and serialization:** NVVM provides a `#nvvm.target`
-    attribute to describe the GPU target (SM, features, and flags). In
-    conjunction with `gpu` serialization (e.g., `gpu-module-to-binary`), this
-    enables producing architecture-specific GPU binaries (such as CUBIN) from
-    nested GPU modules.
-
-    **Inline PTX:** When an intrinsic is unavailable or a performance-critical
-    sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
-    embed PTX inline as a last-resort escape hatch, with explicit operands and
-    results.
-  }];
-
   let name = "nvvm";
   let cppNamespace = "::mlir::NVVM";
   let dependentDialects = ["LLVM::LLVMDialect"];

durga4github · 2025-11-17T16:50:22Z

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

For future:
May be, we should add here a link to that MD web-page after this PR lands.

durga4github · 2025-11-17T16:51:28Z

Change LGTM.

Is there a link from the CI builder where we can view the final rendering before we land?

github-actions · 2025-11-18T09:02:25Z

🐧 Linux x64 Test Results

7081 tests passed
594 tests skipped

llvmbot added mlir:llvm mlir labels Nov 17, 2025

grypp requested review from durga4github, joker-eph and schwarzschild-radius November 17, 2025 14:18

durga4github reviewed Nov 17, 2025

View reviewed changes

durga4github approved these changes Nov 17, 2025

View reviewed changes

grypp closed this Nov 18, 2025

grypp force-pushed the nvvm-docs branch from d7650f8 to 9f69321 Compare November 18, 2025 08:50

[MLIR][NVVM] Move the docs into markdown file

ccfd2a6

grypp reopened this Nov 18, 2025

grypp merged commit 76dac58 into llvm:main Nov 18, 2025
13 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLIR][NVVM] Move the docs to markdown file #168375

[MLIR][NVVM] Move the docs to markdown file #168375

Uh oh!

grypp commented Nov 17, 2025

Uh oh!

llvmbot commented Nov 17, 2025

Uh oh!

llvmbot commented Nov 17, 2025

Uh oh!

durga4github Nov 17, 2025

Uh oh!

durga4github commented Nov 17, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[MLIR][NVVM] Move the docs to markdown file #168375

[MLIR][NVVM] Move the docs to markdown file #168375

Uh oh!

Conversation

grypp commented Nov 17, 2025

Uh oh!

llvmbot commented Nov 17, 2025

Uh oh!

llvmbot commented Nov 17, 2025

Uh oh!

durga4github Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

durga4github commented Nov 17, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

🐧 Linux x64 Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants