Skip to content

Commit 522177c

Browse files
[NVPTX] Add a few more missing fence intrinsics (#166352)
This commit adds the below fence intrinsics: - llvm.nvvm.fence.acquire.sync_restrict.space.cluster.scope.cluster - llvm.nvvm.fence.release.sync_restrict.space.cta.scope.cluster - llvm.nvvm.fence.mbarrier_init.release.cluster - llvm.nvvm.fence.proxy.async.generic.acquire.sync_restrict.space.cluster.scope.cluster - llvm.nvvm.fence.proxy.async.generic.release.sync_restrict.space.cta.scope.cluster llvm.nvvm.fence.proxy.alias - llvm.nvvm.fence.proxy.async - llvm.nvvm.fence.proxy.async.global - llvm.nvvm.fence.proxy.async.shared_cluster - llvm.nvvm.fence.proxy.async.shared_cta For more information, please refere the [PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar)
1 parent 9a15556 commit 522177c

File tree

9 files changed

+335
-13
lines changed

9 files changed

+335
-13
lines changed

llvm/docs/NVPTXUsage.rst

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -796,6 +796,112 @@ every time. For more information, refer PTX ISA
796796
Membar/Fences
797797
-------------
798798

799+
'``llvm.nvvm.fence.acquire/release.sync_restrict.*``'
800+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
801+
802+
Syntax:
803+
"""""""
804+
805+
.. code-block:: llvm
806+
807+
declare void @llvm.nvvm.fence.acquire.sync_restrict.space.cluster.scope.cluster()
808+
declare void @llvm.nvvm.fence.release.sync_restrict.space.cta.scope.cluster()
809+
810+
Overview:
811+
"""""""""
812+
813+
The `nvvm.fence.{semantics}.sync_restrict.*` restrict the class of memory
814+
operations for which the fence instruction provides the memory ordering guarantees.
815+
When `.sync_restrict` is restricted to `shared_cta`, then memory semantics must
816+
be `release` and the effect of the fence operation only applies to operations
817+
performed on objects in `shared_cta` space. Likewise, when `sync_restrict` is
818+
restricted to `shared_cluster`, then memory semantics must be `acquire` and the
819+
effect of the fence operation only applies to operations performed on objects in
820+
`shared_cluster` memory space. The scope for both operations is `cluster`. For more details,
821+
please refer the `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar>`__
822+
823+
'``llvm.nvvm.fence.mbarrier_init.release.cluster``'
824+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
825+
826+
Syntax:
827+
"""""""
828+
829+
.. code-block:: llvm
830+
831+
declare void @llvm.nvvm.fence.mbarrier_init.release.cluster()
832+
833+
Overview:
834+
"""""""""
835+
836+
`nvvm.fence.mbarrier_init.release.cluster` intrinsic restrict the class of
837+
memory operations for which the fence instruction provides the memory ordering
838+
guarantees. The `mbarrier_init` modifiers restricts the synchronizing effect to
839+
the prior `mbarrier_init` operation executed by the same thread on mbarrier objects
840+
in `shared_cta` memory space. For more details, please refer the `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar>`__
841+
842+
'``llvm.nvvm.fence.proxy.async_generic.acquire/release.sync_restrict``'
843+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
844+
845+
Syntax:
846+
"""""""
847+
848+
.. code-block:: llvm
849+
850+
declare void @llvm.nvvm.fence.proxy.async.generic.acquire.sync_restrict.space.cluster.scope.cluster()
851+
declare void @llvm.nvvm.fence.proxy.async.generic.release.sync_restrict.space.cta.scope.cluster()
852+
853+
Overview:
854+
"""""""""
855+
856+
`nvvm.fence.proxy.async_generic.{semantics}.sync_restrict` are used to establish
857+
ordering between a prior memory access performed via the `async proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>__`
858+
and a subsequent memory access performed via the generic proxy.
859+
``nvvm.fence.proxy.async_generic.release.sync_restrict`` can form a release
860+
sequence that synchronizes with an acquire sequence that contains the
861+
``nvvm.fence.proxy.async_generic.acquire.sync_restrict`` proxy fence. When
862+
`.sync_restrict` is restricted to `shared_cta`, then memory semantics must
863+
be `release` and the effect of the fence operation only applies to operations
864+
performed on objects in `shared_cta` space. Likewise, when `sync_restrict` is
865+
restricted to `shared_cluster`, then memory semantics must be `acquire` and the
866+
effect of the fence operation only applies to operations performed on objects in
867+
`shared_cluster` memory space. The scope for both operations is `cluster`.
868+
For more details, please refer the `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar>`__
869+
870+
'``llvm.nvvm.fence.proxy.<proxykind>``'
871+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
872+
873+
Syntax:
874+
"""""""
875+
876+
.. code-block:: llvm
877+
878+
declare void @llvm.nvvm.fence.proxy.alias()
879+
declare void @llvm.nvvm.fence.proxy.async()
880+
declare void @llvm.nvvm.fence.proxy.async.global()
881+
declare void @llvm.nvvm.fence.proxy.async.shared_cluster()
882+
declare void @llvm.nvvm.fence.proxy.async.shared_cta()
883+
884+
Overview:
885+
"""""""""
886+
887+
`nvvm.fence.proxy.{proxykind}` intrinsics represent a fence with bi-directional
888+
proxy ordering that is established between the memory accesses done between the
889+
`generic proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>__`
890+
and the proxy specified by `proxykind`. A `bi-directional proxy` ordering between
891+
two proxykinds establishes two `uni-directional` proxy orderings: one from the
892+
first proxykind to the second proxykind and the other from the second proxykind
893+
to the first proxykind.
894+
895+
`alias` proxykind refers to memory accesses performed using virtually aliased
896+
addresses to the same memory location
897+
898+
`async` proxykind specifies that the memory ordering is established between the
899+
`async proxy` and the `generic proxy`. The memory ordering is limited only to
900+
operations performed on objects in the state space specified (`generic`, `global`,
901+
`shared_cluster`, `shared_cta`). If no state space is specified, then the memory
902+
ordering applies on all state spaces. For more details, please refer the
903+
`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar>`__
904+
799905
'``llvm.nvvm.fence.proxy.tensormap_generic.*``'
800906
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
801907

llvm/include/llvm/IR/IntrinsicsNVVM.td

Lines changed: 45 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1746,33 +1746,65 @@ let TargetPrefix = "nvvm" in {
17461746
def int_nvvm_barrier_cluster_wait_aligned : Intrinsic<[]>;
17471747
}
17481748

1749-
//
1750-
// Membar
1751-
//
1752-
let IntrProperties = [IntrNoCallback] in {
1749+
//
1750+
// Membar / Fence
1751+
//
1752+
let IntrProperties = [IntrNoCallback] in {
17531753
def int_nvvm_membar_cta : NVVMBuiltin, Intrinsic<[]>;
17541754
def int_nvvm_membar_gl : NVVMBuiltin, Intrinsic<[]>;
17551755
def int_nvvm_membar_sys : NVVMBuiltin, Intrinsic<[]>;
17561756
def int_nvvm_fence_sc_cluster : Intrinsic<[]>;
1757-
}
17581757

1759-
//
1760-
// Proxy fence (uni-directional)
1761-
//
1758+
// Operation fence
1759+
def int_nvvm_fence_mbarrier_init_release_cluster: Intrinsic<[], [], [],
1760+
"llvm.nvvm.fence.mbarrier_init.release.cluster">;
1761+
1762+
// Thread fence
1763+
def int_nvvm_fence_acquire_sync_restrict_space_cluster_scope_cluster :
1764+
Intrinsic<[], [], [],
1765+
"llvm.nvvm.fence.acquire.sync_restrict.space.cluster.scope.cluster">;
1766+
1767+
def int_nvvm_fence_release_sync_restrict_space_cta_scope_cluster :
1768+
Intrinsic<[], [], [],
1769+
"llvm.nvvm.fence.release.sync_restrict.space.cta.scope.cluster">;
1770+
1771+
//
1772+
// Proxy fence (uni-directional)
1773+
//
1774+
1775+
def int_nvvm_fence_proxy_async_generic_acquire_sync_restrict_space_cluster_scope_cluster :
1776+
Intrinsic<[], [], [],
1777+
"llvm.nvvm.fence.proxy.async_generic.acquire.sync_restrict.space.cluster.scope.cluster">;
1778+
1779+
def int_nvvm_fence_proxy_async_generic_release_sync_restrict_space_cta_scope_cluster :
1780+
Intrinsic<[], [], [],
1781+
"llvm.nvvm.fence.proxy.async_generic.release.sync_restrict.space.cta.scope.cluster">;
1782+
17621783
foreach scope = ["cta", "cluster", "gpu", "sys"] in {
17631784

17641785
def int_nvvm_fence_proxy_tensormap_generic_release_ # scope :
1765-
Intrinsic<[], [], [IntrNoCallback],
1786+
Intrinsic<[], [], [],
17661787
"llvm.nvvm.fence.proxy.tensormap_generic.release." # scope>;
17671788

17681789
// The imm-arg 'size' can only be 128.
17691790
def int_nvvm_fence_proxy_tensormap_generic_acquire_ # scope :
1770-
Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
1771-
[IntrNoCallback, IntrArgMemOnly, ImmArg<ArgIndex<1>>,
1772-
Range<ArgIndex<1>, 128, 129>],
1773-
"llvm.nvvm.fence.proxy.tensormap_generic.acquire." # scope>;
1791+
Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty], [],
1792+
"llvm.nvvm.fence.proxy.tensormap_generic.acquire." # scope> {
1793+
let IntrProperties = [IntrNoCallback, IntrArgMemOnly,
1794+
ImmArg<ArgIndex<1>>, Range<ArgIndex<1>, 128, 129>];
1795+
}
17741796
}
17751797

1798+
//
1799+
// Proxy fence (bi-directional)
1800+
//
1801+
foreach proxykind = ["alias", "async", "async.global", "async.shared_cta",
1802+
"async.shared_cluster"] in {
1803+
defvar Intr = IntrinsicName<"llvm.nvvm.fence.proxy." # proxykind>;
1804+
def Intr.record_name: Intrinsic<[], [], [], Intr.intr_name>;
1805+
}
1806+
}
1807+
17761808
//
17771809
// Async Copy
17781810
//

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -364,7 +364,42 @@ def INT_FENCE_SC_CLUSTER:
364364
NullaryInst<"fence.sc.cluster", int_nvvm_fence_sc_cluster>,
365365
Requires<[hasPTX<78>, hasSM<90>]>;
366366

367+
def INT_FENCE_MBARRIER_INIT_RELEASE_CLUSTER:
368+
NullaryInst<"fence.mbarrier_init.release.cluster",
369+
int_nvvm_fence_mbarrier_init_release_cluster>,
370+
Requires<[hasPTX<80>, hasSM<90>]>;
371+
372+
let Predicates = [hasPTX<86>, hasSM<90>] in {
373+
def INT_FENCE_ACQUIRE_SYNC_RESTRICT_CLUSTER_CLUSTER:
374+
NullaryInst<"fence.acquire.sync_restrict::shared::cluster.cluster",
375+
int_nvvm_fence_acquire_sync_restrict_space_cluster_scope_cluster>;
376+
377+
def INT_FENCE_RELEASE_SYNC_RESTRICT_CTA_CLUSTER:
378+
NullaryInst<"fence.release.sync_restrict::shared::cta.cluster",
379+
int_nvvm_fence_release_sync_restrict_space_cta_scope_cluster>;
380+
}
381+
367382
// Proxy fence (uni-directional)
383+
let Predicates = [hasPTX<86>, hasSM<90>] in {
384+
def INT_NVVM_FENCE_PROXY_ASYNC_GENERIC_ACQUIRE_SYNC_RESTRICT_SPACE_CLUSTER_SCOPE_CLUSTER:
385+
NullaryInst<"fence.proxy.async::generic.acquire.sync_restrict::shared::cluster.cluster",
386+
int_nvvm_fence_proxy_async_generic_acquire_sync_restrict_space_cluster_scope_cluster>;
387+
388+
def INT_NVVM_FENCE_PROXY_ASYNC_GENERIC_RELEASE_SYNC_RESTRICT_SPACE_CTA_SCOPE_CLUSTER:
389+
NullaryInst<"fence.proxy.async::generic.release.sync_restrict::shared::cta.cluster",
390+
int_nvvm_fence_proxy_async_generic_release_sync_restrict_space_cta_scope_cluster>;
391+
}
392+
393+
// Proxy fence (bi-directional)
394+
foreach proxykind = ["alias", "async", "async.global", "async.shared_cta",
395+
"async.shared_cluster"] in {
396+
defvar Preds = !if(!eq(proxykind, "alias"), [hasPTX<75>, hasSM<70>],
397+
[hasPTX<80>, hasSM<90>]);
398+
defvar Intr = IntrinsicName<"llvm.nvvm.fence.proxy." # proxykind>;
399+
def : NullaryInst<"fence.proxy." # !subst("_", "::", proxykind),
400+
!cast<Intrinsic>(Intr.record_name)>, Requires<Preds>;
401+
}
402+
368403
class FENCE_PROXY_TENSORMAP_GENERIC_RELEASE<string Scope, Intrinsic Intr> :
369404
NullaryInst<"fence.proxy.tensormap::generic.release." # Scope, Intr>,
370405
Requires<[hasPTX<83>, hasSM<90>]>;
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
2+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 | FileCheck --check-prefixes=CHECK %s
3+
; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 | %ptxas-verify -arch=sm_90 %}
4+
5+
define void @test_nvvm_fence_proxy_async_generic_acquire_sync_restrict() {
6+
; CHECK-LABEL: test_nvvm_fence_proxy_async_generic_acquire_sync_restrict(
7+
; CHECK: {
8+
; CHECK-EMPTY:
9+
; CHECK-EMPTY:
10+
; CHECK-NEXT: // %bb.0:
11+
; CHECK-NEXT: fence.proxy.async::generic.acquire.sync_restrict::shared::cluster.cluster;
12+
; CHECK-NEXT: ret;
13+
call void @llvm.nvvm.fence.proxy.async_generic.acquire.sync_restrict.space.cluster.scope.cluster()
14+
ret void
15+
}
16+
17+
define void @test_nvvm_fence_proxy_async_generic_release_sync_restrict() {
18+
; CHECK-LABEL: test_nvvm_fence_proxy_async_generic_release_sync_restrict(
19+
; CHECK: {
20+
; CHECK-EMPTY:
21+
; CHECK-EMPTY:
22+
; CHECK-NEXT: // %bb.0:
23+
; CHECK-NEXT: fence.proxy.async::generic.release.sync_restrict::shared::cta.cluster;
24+
; CHECK-NEXT: ret;
25+
call void @llvm.nvvm.fence.proxy.async_generic.release.sync_restrict.space.cta.scope.cluster()
26+
ret void
27+
}
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
2+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 | FileCheck --check-prefixes=CHECK %s
3+
; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 | %ptxas-verify -arch=sm_90 %}
4+
5+
define void @test_nvvm_fence_proxy_async() {
6+
; CHECK-LABEL: test_nvvm_fence_proxy_async(
7+
; CHECK: {
8+
; CHECK-EMPTY:
9+
; CHECK-EMPTY:
10+
; CHECK-NEXT: // %bb.0:
11+
; CHECK-NEXT: fence.proxy.async;
12+
; CHECK-NEXT: ret;
13+
call void @llvm.nvvm.fence.proxy.async()
14+
ret void
15+
}
16+
17+
define void @test_nvvm_fence_proxy_async_global() {
18+
; CHECK-LABEL: test_nvvm_fence_proxy_async_global(
19+
; CHECK: {
20+
; CHECK-EMPTY:
21+
; CHECK-EMPTY:
22+
; CHECK-NEXT: // %bb.0:
23+
; CHECK-NEXT: fence.proxy.async.global;
24+
; CHECK-NEXT: ret;
25+
call void @llvm.nvvm.fence.proxy.async.global()
26+
ret void
27+
}
28+
29+
define void @test_nvvm_fence_proxy_async_shared_cluster() {
30+
; CHECK-LABEL: test_nvvm_fence_proxy_async_shared_cluster(
31+
; CHECK: {
32+
; CHECK-EMPTY:
33+
; CHECK-EMPTY:
34+
; CHECK-NEXT: // %bb.0:
35+
; CHECK-NEXT: fence.proxy.async.shared::cluster;
36+
; CHECK-NEXT: ret;
37+
call void @llvm.nvvm.fence.proxy.async.shared_cluster()
38+
ret void
39+
}
40+
41+
define void @test_nvvm_fence_proxy_async_shared_cta() {
42+
; CHECK-LABEL: test_nvvm_fence_proxy_async_shared_cta(
43+
; CHECK: {
44+
; CHECK-EMPTY:
45+
; CHECK-EMPTY:
46+
; CHECK-NEXT: // %bb.0:
47+
; CHECK-NEXT: fence.proxy.async.shared::cta;
48+
; CHECK-NEXT: ret;
49+
call void @llvm.nvvm.fence.proxy.async.shared_cta()
50+
ret void
51+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
; RUN: not llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx83 -o /dev/null 2>&1 | FileCheck %s
2+
3+
define void @test_fence_proxy_tensormap_generic_acquire(ptr addrspace(0) %addr) {
4+
; CHECK: immarg value 130 out of range [128, 129)
5+
call void @llvm.nvvm.fence.proxy.tensormap_generic.acquire.cta(ptr addrspace(0) %addr, i32 130);
6+
7+
ret void
8+
}
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
2+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_70 -mattr=+ptx75 | FileCheck --check-prefixes=CHECK %s
3+
; RUN: %if ptxas-sm_70 && ptxas-isa-7.5 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_70 -mattr=+ptx75 | %ptxas-verify -arch=sm_70 %}
4+
5+
define void @test_nvvm_fence_proxy_alias() {
6+
; CHECK-LABEL: test_nvvm_fence_proxy_alias(
7+
; CHECK: {
8+
; CHECK-EMPTY:
9+
; CHECK-EMPTY:
10+
; CHECK-NEXT: // %bb.0:
11+
; CHECK-NEXT: fence.proxy.alias;
12+
; CHECK-NEXT: ret;
13+
call void @llvm.nvvm.fence.proxy.alias()
14+
ret void
15+
}
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
2+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 | FileCheck --check-prefixes=CHECK %s
3+
; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 | %ptxas-verify -arch=sm_90 %}
4+
5+
; CHECK-LABEL: test_fence_mbarrier_init
6+
define void @test_fence_mbarrier_init() {
7+
; CHECK-LABEL: test_fence_mbarrier_init(
8+
; CHECK: {
9+
; CHECK-EMPTY:
10+
; CHECK-EMPTY:
11+
; CHECK-NEXT: // %bb.0:
12+
; CHECK-NEXT: fence.mbarrier_init.release.cluster;
13+
; CHECK-NEXT: ret;
14+
call void @llvm.nvvm.fence.mbarrier_init.release.cluster();
15+
16+
ret void
17+
}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
2+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 | FileCheck --check-prefixes=CHECK %s
3+
; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 | %ptxas-verify -arch=sm_90 %}
4+
5+
; CHECK-LABEL: test_fence_acquire
6+
define void @test_fence_acquire() {
7+
; CHECK-LABEL: test_fence_acquire(
8+
; CHECK: {
9+
; CHECK-EMPTY:
10+
; CHECK-EMPTY:
11+
; CHECK-NEXT: // %bb.0:
12+
; CHECK-NEXT: fence.acquire.sync_restrict::shared::cluster.cluster;
13+
; CHECK-NEXT: ret;
14+
call void @llvm.nvvm.fence.acquire.sync_restrict.space.cluster.scope.cluster();
15+
16+
ret void
17+
}
18+
19+
; CHECK-LABEL: test_fence_release
20+
define void @test_fence_release() {
21+
; CHECK-LABEL: test_fence_release(
22+
; CHECK: {
23+
; CHECK-EMPTY:
24+
; CHECK-EMPTY:
25+
; CHECK-NEXT: // %bb.0:
26+
; CHECK-NEXT: fence.release.sync_restrict::shared::cta.cluster;
27+
; CHECK-NEXT: ret;
28+
call void @llvm.nvvm.fence.release.sync_restrict.space.cta.scope.cluster();
29+
30+
ret void
31+
}

0 commit comments

Comments
 (0)