Skip to content

Conversation

@abhilash1910
Copy link
Contributor

[NVPTX] Add Intrinsics for discard.*
This PR adds intrinsics for all variations of discard.*

  • These intrinsics supports generic or global for all variations.
  • The lowering is handled from nvvm to nvptx tablegen directly.
  • Lit tests are added as part of discard.ll
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer to the PTX ISA
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-discard_.

@llvmbot
Copy link
Member

llvmbot commented Feb 23, 2025

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Abhilash Majumder (abhilash1910)

Changes

[NVPTX] Add Intrinsics for discard.*
This PR adds intrinsics for all variations of discard.*

  • These intrinsics supports generic or global for all variations.
  • The lowering is handled from nvvm to nvptx tablegen directly.
  • Lit tests are added as part of discard.ll
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer to the PTX ISA
<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-discard>_.


Full diff: https://github.com/llvm/llvm-project/pull/128404.diff

4 Files Affected:

  • (modified) llvm/docs/NVPTXUsage.rst (+25)
  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+9)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+11)
  • (added) llvm/test/CodeGen/NVPTX/discard.ll (+35)
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 675b458c41e7b..d905d9b56fe29 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -630,6 +630,31 @@ uses and eviction priority which can be accessed by the '``.level::eviction_prio
 For more information, refer to the PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.
 
+``llvm.nvvm.discard.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void  @llvm.nvvm.discard.global.L2(ptr addrspace(1) %global_ptr, i64 %size)
+  declare void  @llvm.nvvm.discard.L2(ptr %ptr, i64 %size)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.discard.*``'  invalidates the data at the address range [a .. a + (size - 1)] 
+in the cache level specified by the .level qualifier without writing back the data 
+in the cache to the memory. The operand size is an integer constant that specifies the amount of data, 
+in bytes, in the cache level specified by the .level qualifier to be discarded. The only supported value 
+for the size operand is 128. If no state space is specified then Generic Addressing is used. 
+If the specified address does not fall within the address window of .global state space then 
+the behavior is undefined.
+
+For more information, refer to the PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-discard>`_.
+
 '``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d``'
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index c32bf0318b5d6..00613eb7d2d17 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -5043,6 +5043,15 @@ def int_nvvm_prefetch_global_L2_evict_last: Intrinsic<[], [llvm_global_ptr_ty],
 def int_nvvm_prefetchu_L1 : Intrinsic<[], [llvm_ptr_ty],
   [IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>]>;
 
+// Intrinsics for discard
+def int_nvvm_discard_global_L2 : DefaultAttrsIntrinsic<[], 
+    [llvm_global_ptr_ty, llvm_i64_ty], [IntrArgMemOnly, ReadOnly<ArgIndex<0>>, 
+    NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<1>>]>;
+
+def int_nvvm_discard_L2 : DefaultAttrsIntrinsic<[], 
+    [llvm_ptr_ty, llvm_i64_ty], [IntrArgMemOnly, ReadOnly<ArgIndex<0>>, 
+    NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<1>>]>;
+
 
 // Intrinsics for Bulk Copy using TMA (non-tensor)
 // From Global to Shared Cluster
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index acb9fc9867b0f..fd93ce312c9db 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -789,6 +789,17 @@ def PREFETCH_GLOBAL_L2_EVICT_LAST   : NVPTXInst<(outs), (ins Int64Regs:$addr),
 
 def PREFETCHU_L1 : PREFETCH_INTRS<"prefetchu.L1">;
 
+//Discard Intrinsics
+class DISCARD_L2_INTRS<string Addr> :
+          NVPTXInst<(outs), (ins Int64Regs:$addr, Int64Regs:$size),
+          StrJoin<".", ["discard", Addr , "L2"]>.ret # " [$addr], $size;",
+          [(!cast<Intrinsic>(StrJoin<"_", ["int_nvvm_discard", Addr , "L2"]>.ret)
+          i64:$addr, i64:$size)]>,
+          Requires<[hasPTX<74>, hasSM<80>]>;
+
+def DISCARD_L2        : DISCARD_L2_INTRS<"">;
+def DISCARD_GLOBAL_L2 : DISCARD_L2_INTRS<"global">;
+
 //-----------------------------------
 // MBarrier Functions
 //-----------------------------------
diff --git a/llvm/test/CodeGen/NVPTX/discard.ll b/llvm/test/CodeGen/NVPTX/discard.ll
new file mode 100644
index 0000000000000..80217807765d0
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/discard.ll
@@ -0,0 +1,35 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx74| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: %if ptxas-11.4 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx74| %ptxas-verify -arch=sm_80 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void  @llvm.nvvm.discard.global.L2(ptr addrspace(1) %global_ptr, i64 %size)
+declare void  @llvm.nvvm.discard.L2(ptr %ptr, i64 %size)
+
+define void @discard_global_L2(ptr addrspace(1) %global_ptr, i64 %size) {
+; CHECK-PTX64-LABEL: discard_global_L2(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [discard_global_L2_param_0];
+; CHECK-PTX64-NEXT:    discard.global.L2 [%rd1], 128;
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.discard.global.L2(ptr addrspace(1) %global_ptr, i64 128)
+  ret void
+}
+
+define void @discard_L2(ptr %ptr, i64 %size) {
+; CHECK-PTX64-LABEL: discard_L2(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.u64 %rd1, [discard_L2_param_0];
+; CHECK-PTX64-NEXT:    discard.L2 [%rd1], 128;
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.discard.L2(ptr %ptr, i64 128)
+  ret void
+}
+

Copy link
Contributor

@durga4github durga4github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@durga4github durga4github requested a review from Artem-B February 24, 2025 07:44
Copy link
Contributor

@gonzalobg gonzalobg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is still incorrect for LLVM and needs work.

Copy link
Contributor

@gonzalobg gonzalobg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gonzalobg
Copy link
Contributor

gonzalobg commented Feb 28, 2025 via email

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo documenting 128b alignment requirement.

@abhilash1910
Copy link
Contributor Author

@durga4github could you please help to merge? Thanks.

@durga4github durga4github merged commit 7c58089 into llvm:main Mar 3, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants