[flang][cuda] Add interface and lowering for fence_proxy_async #163014

clementval · 2025-10-11T19:10:20Z

Part of TMA operation defined here: https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/#load-and-store-functions-using-bulk-tma-operations

llvmbot · 2025-10-11T19:10:51Z

@llvm/pr-subscribers-flang-fir-hlfir

Author: Valentin Clement (バレンタインクレメン) (clementval)

Changes

Part of TMA operation defined here: https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/#load-and-store-functions-using-bulk-tma-operations

Full diff: https://github.com/llvm/llvm-project/pull/163014.diff

4 Files Affected:

(modified) flang/include/flang/Optimizer/Builder/IntrinsicCall.h (+1)
(modified) flang/lib/Optimizer/Builder/IntrinsicCall.cpp (+15)
(modified) flang/module/cudadevice.f90 (+5)
(modified) flang/test/Lower/CUDA/cuda-device-proc.cuf (+7)

diff --git a/flang/include/flang/Optimizer/Builder/IntrinsicCall.h b/flang/include/flang/Optimizer/Builder/IntrinsicCall.h
index ca02693c53aeb..d0a96a512c2e7 100644
--- a/flang/include/flang/Optimizer/Builder/IntrinsicCall.h
+++ b/flang/include/flang/Optimizer/Builder/IntrinsicCall.h
@@ -274,6 +274,7 @@ struct IntrinsicLibrary {
                                       llvm::ArrayRef<fir::ExtendedValue>);
   template <Extremum, ExtremumBehavior>
   mlir::Value genExtremum(mlir::Type, llvm::ArrayRef<mlir::Value>);
+  void genFenceProxyAsync(llvm::ArrayRef<fir::ExtendedValue>);
   mlir::Value genFloor(mlir::Type, llvm::ArrayRef<mlir::Value>);
   mlir::Value genFraction(mlir::Type resultType,
                           mlir::ArrayRef<mlir::Value> args);
diff --git a/flang/lib/Optimizer/Builder/IntrinsicCall.cpp b/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
index c9cf6c23a81a5..4890225db452f 100644
--- a/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
+++ b/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
@@ -502,6 +502,10 @@ static constexpr IntrinsicHandler handlers[]{
      &I::genExtendsTypeOf,
      {{{"a", asBox}, {"mold", asBox}}},
      /*isElemental=*/false},
+    {"fence_proxy_async",
+     &I::genFenceProxyAsync,
+     {},
+     /*isElemental=*/false},
     {"findloc",
      &I::genFindloc,
      {{{"array", asBox},
@@ -4354,6 +4358,17 @@ IntrinsicLibrary::genExtendsTypeOf(mlir::Type resultType,
                                      fir::getBase(args[1])));
 }
 
+// FENCE_PROXY_ASYNC (CUDA)
+void IntrinsicLibrary::genFenceProxyAsync(
+    llvm::ArrayRef<fir::ExtendedValue> args) {
+  assert(args.size() == 0);
+  auto kind = mlir::NVVM::ProxyKindAttr::get(
+      builder.getContext(), mlir::NVVM::ProxyKind::async_shared);
+  auto space = mlir::NVVM::SharedSpaceAttr::get(
+      builder.getContext(), mlir::NVVM::SharedSpace::shared_cta);
+  mlir::NVVM::FenceProxyOp::create(builder, loc, kind, space);
+}
+
 // FINDLOC
 fir::ExtendedValue
 IntrinsicLibrary::genFindloc(mlir::Type resultType,
diff --git a/flang/module/cudadevice.f90 b/flang/module/cudadevice.f90
index e6c9e958af365..548298ef854c9 100644
--- a/flang/module/cudadevice.f90
+++ b/flang/module/cudadevice.f90
@@ -2008,6 +2008,11 @@ attributes(device) function barrier_arrive_cnt(barrier, count) result(token)
     end function
   end interface
 
+  interface
+    attributes(device) subroutine fence_proxy_async()
+    end subroutine
+  end interface
+
 contains
 
   attributes(device) subroutine syncthreads()
diff --git a/flang/test/Lower/CUDA/cuda-device-proc.cuf b/flang/test/Lower/CUDA/cuda-device-proc.cuf
index 1bf714010f5d3..378d8ddf65ad9 100644
--- a/flang/test/Lower/CUDA/cuda-device-proc.cuf
+++ b/flang/test/Lower/CUDA/cuda-device-proc.cuf
@@ -419,3 +419,10 @@ end subroutine
 ! CHECK: %[[LLVM_PTR:.*]] = fir.convert %[[DECL_SHARED]]#0 : (!fir.ref<i64>) -> !llvm.ptr
 ! CHECK: %[[SHARED_PTR:.*]] = llvm.addrspacecast %[[LLVM_PTR]] : !llvm.ptr to !llvm.ptr<3>
 ! CHECK: nvvm.mbarrier.arrive.expect_tx %[[SHARED_PTR]], %{{.*}} : !llvm.ptr<3>, i32
+
+attributes(global) subroutine test_fence()
+  call fence_proxy_async()
+end subroutine
+
+! CHECK-LABEL: func.func @_QPtest_fence()
+! CHECK: nvvm.fence.proxy {kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cta>}

flang/test/Lower/CUDA/cuda-device-proc.cuf

…163014) Part of TMA operation defined here: https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/#load-and-store-functions-using-bulk-tma-operations

[flang][cuda] Add interface and lowering for fence_proxy_async

d400fe5

clementval requested a review from wangzpgi October 11, 2025 19:10

llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Oct 11, 2025

wangzpgi approved these changes Oct 11, 2025

View reviewed changes

Merge branch 'main' into cuf_fence_proxy

d87528f

clementval commented Oct 11, 2025

View reviewed changes

flang/test/Lower/CUDA/cuda-device-proc.cuf Outdated Show resolved Hide resolved

Update flang/test/Lower/CUDA/cuda-device-proc.cuf

070d11e

clementval enabled auto-merge (squash) October 11, 2025 19:31

clementval merged commit 9f06843 into llvm:main Oct 11, 2025
10 checks passed

clementval deleted the cuf_fence_proxy branch October 11, 2025 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[flang][cuda] Add interface and lowering for fence_proxy_async #163014

[flang][cuda] Add interface and lowering for fence_proxy_async #163014

Uh oh!

clementval commented Oct 11, 2025

Uh oh!

llvmbot commented Oct 11, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[flang][cuda] Add interface and lowering for fence_proxy_async #163014

[flang][cuda] Add interface and lowering for fence_proxy_async #163014

Uh oh!

Conversation

clementval commented Oct 11, 2025

Uh oh!

llvmbot commented Oct 11, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants