[flang][OpenMP] Allocate `allocatable` init temps on the stack for GPUs #164761

ergawy · 2025-10-23T06:40:37Z

Temps needed for the allocatable reduction/privatization init regions are now allocated on the heap all the time. However, this is performance killer for GPUs since malloc calls are prohibitively expensive. Therefore, we should do these allocations on the stack for GPU reductions.

This is similar to what we do for arrays. Additionally, I am working on getting reductions-by-ref to work on GPUs which is a bit of a challenge given the many involved steps (e.g. intra-warp and inter-warp reuctions, shuffling data from remote lanes, ...). But this is a prerequisite step.

Temps needed for the allocatable reduction/privatization init regions are now allocated on the heap all the time. However, this is performance killer for GPUs since malloc calls are prohibitively expensive. Therefore, we should do these allocations on the stack for GPU reductions.

llvmbot · 2025-10-23T06:41:18Z

@llvm/pr-subscribers-flang-fir-hlfir

@llvm/pr-subscribers-flang-openmp

Author: Kareem Ergawy (ergawy)

Changes

Temps needed for the allocatable reduction/privatization init regions are now allocated on the heap all the time. However, this is performance killer for GPUs since malloc calls are prohibitively expensive. Therefore, we should do these allocations on the stack for GPU reductions.

Full diff: https://github.com/llvm/llvm-project/pull/164761.diff

2 Files Affected:

(modified) flang/lib/Lower/Support/PrivateReductionUtils.cpp (+22-13)
(modified) flang/test/Lower/OpenMP/DelayedPrivatization/target-private-allocatable.f90 (+66-41)

diff --git a/flang/lib/Lower/Support/PrivateReductionUtils.cpp b/flang/lib/Lower/Support/PrivateReductionUtils.cpp
index d433ce367d259..c6c428860bca1 100644
--- a/flang/lib/Lower/Support/PrivateReductionUtils.cpp
+++ b/flang/lib/Lower/Support/PrivateReductionUtils.cpp
@@ -376,6 +376,8 @@ class PopulateInitAndCleanupRegionsHelper {
     loadedMoldArg = builder.loadIfRef(loc, moldArg);
     return loadedMoldArg;
   }
+
+  bool shouldAllocateTempOnStack() const;
 };
 
 } // namespace
@@ -438,8 +440,14 @@ void PopulateInitAndCleanupRegionsHelper::initAndCleanupBoxedScalar(
     builder.setInsertionPointToStart(&ifUnallocated.getElseRegion().front());
   }
 
-  mlir::Value valAlloc = builder.createHeapTemporary(loc, innerTy, /*name=*/{},
-                                                     /*shape=*/{}, lenParams);
+  bool shouldAllocateOnStack = shouldAllocateTempOnStack();
+  mlir::Value valAlloc =
+      (shouldAllocateOnStack)
+          ? builder.createTemporary(loc, innerTy, /*name=*/{},
+                                    /*shape=*/{}, lenParams)
+          : builder.createHeapTemporary(loc, innerTy, /*name=*/{},
+                                        /*shape=*/{}, lenParams);
+
   if (scalarInitValue)
     builder.createStoreWithConvert(loc, scalarInitValue, valAlloc);
   mlir::Value box = fir::EmboxOp::create(builder, loc, valType, valAlloc,
@@ -451,8 +459,9 @@ void PopulateInitAndCleanupRegionsHelper::initAndCleanupBoxedScalar(
   fir::StoreOp lastOp =
       fir::StoreOp::create(builder, loc, box, allocatedPrivVarArg);
 
-  createCleanupRegion(converter, loc, argType, cleanupRegion, sym,
-                      isDoConcurrent);
+  if (!shouldAllocateOnStack)
+    createCleanupRegion(converter, loc, argType, cleanupRegion, sym,
+                        isDoConcurrent);
 
   if (ifUnallocated)
     builder.setInsertionPointAfter(ifUnallocated);
@@ -462,6 +471,14 @@ void PopulateInitAndCleanupRegionsHelper::initAndCleanupBoxedScalar(
   createYield(allocatedPrivVarArg);
 }
 
+bool PopulateInitAndCleanupRegionsHelper::shouldAllocateTempOnStack() const {
+  // On the GPU, always allocate on the stack since heap allocatins are very
+  // expensive.
+  auto offloadMod =
+      llvm::dyn_cast<mlir::omp::OffloadModuleInterface>(*builder.getModule());
+  return offloadMod && offloadMod.getIsGPU();
+}
+
 void PopulateInitAndCleanupRegionsHelper::initAndCleanupBoxedArray(
     fir::BaseBoxType boxTy, bool needsInitialization) {
   bool isAllocatableOrPointer =
@@ -504,15 +521,7 @@ void PopulateInitAndCleanupRegionsHelper::initAndCleanupBoxedArray(
   // Allocating on the heap in case the whole reduction/privatization is nested
   // inside of a loop
   auto temp = [&]() {
-    bool shouldAllocateOnStack = false;
-
-    // On the GPU, always allocate on the stack since heap allocatins are very
-    // expensive.
-    if (auto offloadMod = llvm::dyn_cast<mlir::omp::OffloadModuleInterface>(
-            *builder.getModule()))
-      shouldAllocateOnStack = offloadMod.getIsGPU();
-
-    if (shouldAllocateOnStack)
+    if (shouldAllocateTempOnStack())
       return createStackTempFromMold(loc, builder, source);
 
     auto [temp, needsDealloc] = createTempFromMold(loc, builder, source);
diff --git a/flang/test/Lower/OpenMP/DelayedPrivatization/target-private-allocatable.f90 b/flang/test/Lower/OpenMP/DelayedPrivatization/target-private-allocatable.f90
index 3d93fbc6e446e..272f34fc0fd1a 100644
--- a/flang/test/Lower/OpenMP/DelayedPrivatization/target-private-allocatable.f90
+++ b/flang/test/Lower/OpenMP/DelayedPrivatization/target-private-allocatable.f90
@@ -1,9 +1,22 @@
 ! Tests delayed privatization for `targets ... private(..)` for allocatables.
 
 ! RUN: %flang_fc1 -emit-hlfir -fopenmp -mmlir --enable-delayed-privatization-staging \
-! RUN:   -o - %s 2>&1 | FileCheck %s
+! RUN:   -o - %s 2>&1 | FileCheck %s --check-prefix=CPU
+
 ! RUN: bbc -emit-hlfir -fopenmp --enable-delayed-privatization-staging -o - %s 2>&1 \
-! RUN:   | FileCheck %s
+! RUN:   | FileCheck %s --check-prefix=CPU
+
+! RUN: %if amdgpu-registered-target %{ \
+! RUN:   %flang_fc1 -triple amdgcn-amd-amdhsa -emit-hlfir  \
+! RUN:     -fopenmp -fopenmp-is-target-device \
+! RUN:     -mmlir --enable-delayed-privatization-staging \
+! RUN:     -o - %s 2>&1 | \
+! RUN:   FileCheck %s --check-prefix=GPU  \
+! RUN: %}
+
+! RUN: bbc -emit-hlfir -fopenmp --enable-delayed-privatization-staging \
+! RUN:    -fopenmp-is-target-device -fopenmp-is-gpu -o - %s 2>&1 \
+! RUN:   | FileCheck %s --check-prefix=GPU
 
 subroutine target_allocatable
   implicit none
@@ -14,53 +27,65 @@ subroutine target_allocatable
   !$omp end target
 end subroutine target_allocatable
 
-! CHECK-LABEL: omp.private {type = private}
-! CHECK-SAME:    @[[VAR_PRIVATIZER_SYM:.*]] :
-! CHECK-SAME:      [[DESC_TYPE:!fir.box<!fir.heap<i32>>]] init {
-! CHECK:  ^bb0(%[[PRIV_ARG:.*]]: [[TYPE:!fir.ref<!fir.box<!fir.heap<i32>>>]], %[[PRIV_ALLOC:.*]]: [[TYPE]]):
+! CPU-LABEL: omp.private {type = private}
+! CPU-SAME:    @[[VAR_PRIVATIZER_SYM:.*]] :
+! CPU-SAME:      [[DESC_TYPE:!fir.box<!fir.heap<i32>>]] init {
+! CPU:  ^bb0(%[[PRIV_ARG:.*]]: [[TYPE:!fir.ref<!fir.box<!fir.heap<i32>>>]], %[[PRIV_ALLOC:.*]]: [[TYPE]]):
+
+! CPU-NEXT:   %[[PRIV_ARG_VAL:.*]] = fir.load %[[PRIV_ARG]] : [[TYPE]]
+! CPU-NEXT:   %[[PRIV_ARG_BOX:.*]] = fir.box_addr %[[PRIV_ARG_VAL]] : ([[DESC_TYPE]]) -> !fir.heap<i32>
+! CPU-NEXT:   %[[PRIV_ARG_ADDR:.*]] = fir.convert %[[PRIV_ARG_BOX]] : (!fir.heap<i32>) -> i64
+! CPU-NEXT:   %[[C0:.*]] = arith.constant 0 : i64
+! CPU-NEXT:   %[[ALLOC_COND:.*]] = arith.cmpi eq, %[[PRIV_ARG_ADDR]], %[[C0]] : i64
 
-! CHECK-NEXT:   %[[PRIV_ARG_VAL:.*]] = fir.load %[[PRIV_ARG]] : [[TYPE]]
-! CHECK-NEXT:   %[[PRIV_ARG_BOX:.*]] = fir.box_addr %[[PRIV_ARG_VAL]] : ([[DESC_TYPE]]) -> !fir.heap<i32>
-! CHECK-NEXT:   %[[PRIV_ARG_ADDR:.*]] = fir.convert %[[PRIV_ARG_BOX]] : (!fir.heap<i32>) -> i64
-! CHECK-NEXT:   %[[C0:.*]] = arith.constant 0 : i64
-! CHECK-NEXT:   %[[ALLOC_COND:.*]] = arith.cmpi eq, %[[PRIV_ARG_ADDR]], %[[C0]] : i64
+! CPU-NEXT:   fir.if %[[ALLOC_COND]] {
+! CPU-NEXT:     %[[ZERO_BOX:.*]] = fir.embox %[[PRIV_ARG_BOX]] : (!fir.heap<i32>) -> [[DESC_TYPE]]
+! CPU-NEXT:     fir.store %[[ZERO_BOX]] to %[[PRIV_ALLOC]] : [[TYPE]]
+! CPU-NEXT:   } else {
+! CPU-NEXT:     %[[PRIV_ALLOCMEM:.*]] = fir.allocmem i32
+! CPU-NEXT:     %[[PRIV_ALLOCMEM_BOX:.*]] = fir.embox %[[PRIV_ALLOCMEM]] : (!fir.heap<i32>) -> [[DESC_TYPE]]
+! CPU-NEXT:     fir.store %[[PRIV_ALLOCMEM_BOX]] to %[[PRIV_ALLOC]] : [[TYPE]]
+! CPU-NEXT:   }
 
-! CHECK-NEXT:   fir.if %[[ALLOC_COND]] {
-! CHECK-NEXT:     %[[ZERO_BOX:.*]] = fir.embox %[[PRIV_ARG_BOX]] : (!fir.heap<i32>) -> [[DESC_TYPE]]
-! CHECK-NEXT:     fir.store %[[ZERO_BOX]] to %[[PRIV_ALLOC]] : [[TYPE]]
-! CHECK-NEXT:   } else {
-! CHECK-NEXT:     %[[PRIV_ALLOCMEM:.*]] = fir.allocmem i32
-! CHECK-NEXT:     %[[PRIV_ALLOCMEM_BOX:.*]] = fir.embox %[[PRIV_ALLOCMEM]] : (!fir.heap<i32>) -> [[DESC_TYPE]]
-! CHECK-NEXT:     fir.store %[[PRIV_ALLOCMEM_BOX]] to %[[PRIV_ALLOC]] : [[TYPE]]
-! CHECK-NEXT:   }
+! CPU-NEXT:   omp.yield(%[[PRIV_ALLOC]] : [[TYPE]])
 
-! CHECK-NEXT:   omp.yield(%[[PRIV_ALLOC]] : [[TYPE]])
+! CPU-NEXT: } dealloc {
+! CPU-NEXT: ^bb0(%[[PRIV_ARG:.*]]: [[TYPE]]):
 
-! CHECK-NEXT: } dealloc {
-! CHECK-NEXT: ^bb0(%[[PRIV_ARG:.*]]: [[TYPE]]):
+! CPU-NEXT:   %[[PRIV_VAL:.*]] = fir.load %[[PRIV_ARG]]
+! CPU-NEXT:   %[[PRIV_ADDR:.*]] = fir.box_addr %[[PRIV_VAL]]
+! CPU-NEXT:   %[[PRIV_ADDR_I64:.*]] = fir.convert %[[PRIV_ADDR]]
+! CPU-NEXT:   %[[C0:.*]] = arith.constant 0 : i64
+! CPU-NEXT:   %[[PRIV_NULL_COND:.*]] = arith.cmpi ne, %[[PRIV_ADDR_I64]], %[[C0]] : i64
 
-! CHECK-NEXT:   %[[PRIV_VAL:.*]] = fir.load %[[PRIV_ARG]]
-! CHECK-NEXT:   %[[PRIV_ADDR:.*]] = fir.box_addr %[[PRIV_VAL]]
-! CHECK-NEXT:   %[[PRIV_ADDR_I64:.*]] = fir.convert %[[PRIV_ADDR]]
-! CHECK-NEXT:   %[[C0:.*]] = arith.constant 0 : i64
-! CHECK-NEXT:   %[[PRIV_NULL_COND:.*]] = arith.cmpi ne, %[[PRIV_ADDR_I64]], %[[C0]] : i64
+! CPU-NEXT:   fir.if %[[PRIV_NULL_COND]] {
+! CPU-NEXT:     fir.freemem %[[PRIV_ADDR]]
+! CPU-NEXT:   }
 
-! CHECK-NEXT:   fir.if %[[PRIV_NULL_COND]] {
-! CHECK-NEXT:     fir.freemem %[[PRIV_ADDR]]
-! CHECK-NEXT:   }
+! CPU-NEXT:   omp.yield
+! CPU-NEXT: }
 
-! CHECK-NEXT:   omp.yield
-! CHECK-NEXT: }
 
+! CPU-LABEL: func.func @_QPtarget_allocatable() {
 
-! CHECK-LABEL: func.func @_QPtarget_allocatable() {
+! CPU:  %[[VAR_ALLOC:.*]] = fir.alloca [[DESC_TYPE]]
+! CPU-SAME: {bindc_name = "alloc_var", {{.*}}}
+! CPU:  %[[VAR_DECL:.*]]:2 = hlfir.declare %[[VAR_ALLOC]]
+! CPU:  %[[BASE_ADDR:.*]] = fir.box_offset %[[VAR_DECL]]#0 base_addr : (!fir.ref<!fir.box<!fir.heap<i32>>>) -> [[MEMBER_TYPE:.*]]
+! CPU:  %[[MEMBER:.*]] = omp.map.info var_ptr(%[[VAR_DECL]]#0 : [[TYPE]], i32) map_clauses(to) capture(ByRef) var_ptr_ptr(%[[BASE_ADDR]] : [[MEMBER_TYPE:.*]]) -> {{.*}}
+! CPU:  %[[MAP_VAR:.*]] = omp.map.info var_ptr(%[[VAR_DECL]]#0 : [[TYPE]], [[DESC_TYPE]]) map_clauses(to) capture(ByRef) members(%[[MEMBER]] : [0] : !fir.llvm_ptr<!fir.ref<i32>>) -> !fir.ref<!fir.box<!fir.heap<i32>>>
 
-! CHECK:  %[[VAR_ALLOC:.*]] = fir.alloca [[DESC_TYPE]]
-! CHECK-SAME: {bindc_name = "alloc_var", {{.*}}}
-! CHECK:  %[[VAR_DECL:.*]]:2 = hlfir.declare %[[VAR_ALLOC]]
-! CHECK:  %[[BASE_ADDR:.*]] = fir.box_offset %[[VAR_DECL]]#0 base_addr : (!fir.ref<!fir.box<!fir.heap<i32>>>) -> [[MEMBER_TYPE:.*]]
-! CHECK:  %[[MEMBER:.*]] = omp.map.info var_ptr(%[[VAR_DECL]]#0 : [[TYPE]], i32) map_clauses(to) capture(ByRef) var_ptr_ptr(%[[BASE_ADDR]] : [[MEMBER_TYPE:.*]]) -> {{.*}}
-! CHECK:  %[[MAP_VAR:.*]] = omp.map.info var_ptr(%[[VAR_DECL]]#0 : [[TYPE]], [[DESC_TYPE]]) map_clauses(to) capture(ByRef) members(%[[MEMBER]] : [0] : !fir.llvm_ptr<!fir.ref<i32>>) -> !fir.ref<!fir.box<!fir.heap<i32>>>
+! CPU:  omp.target map_entries(%[[MAP_VAR]] -> %arg0, %[[MEMBER]] -> %arg1 : [[TYPE]], [[MEMBER_TYPE]]) private(
+! CPU-SAME: @[[VAR_PRIVATIZER_SYM]] %[[VAR_DECL]]#0 -> %{{.*}} [map_idx=0] : [[TYPE]]) {
 
-! CHECK:  omp.target map_entries(%[[MAP_VAR]] -> %arg0, %[[MEMBER]] -> %arg1 : [[TYPE]], [[MEMBER_TYPE]]) private(
-! CHECK-SAME: @[[VAR_PRIVATIZER_SYM]] %[[VAR_DECL]]#0 -> %{{.*}} [map_idx=0] : [[TYPE]]) {
+! GPU-LABEL: omp.private {type = private} {{.*}} init {
+! GPU:         fir.if %{{.*}} {
+! GPU-NEXT:    %[[ZERO_BOX:.*]] = fir.embox %{{.*}}
+! GPU-NEXT:     fir.store %[[ZERO_BOX]] to %{{.*}}
+! GPU-NEXT:   } else {
+! GPU-NOT:      fir.allocmem i32
+! GPU-NEXT:     %[[PRIV_ALLOC:.*]] = fir.alloca i32
+! GPU-NEXT:     %[[PRIV_ALLOC_BOX:.*]] = fir.embox %[[PRIV_ALLOC]]
+! GPU-NEXT:     fir.store %[[PRIV_ALLOC_BOX]] to %{{.*}}
+! GPU-NEXT:   }
+! GPU-NEXT:   omp.yield(%{{.*}})

tblah

I'm a little surprised that this is allowed in Fortran but so far as I can tell the Fortran standard does not actually require ALLOCATABLEsto be allocated on the heap so long as the allocation lifetime required by the standard is obeyed.

I agree the case of a private variable, the expected lifetime is the duration of the outlined openmp construct and so a stack allocation should give us the correct lifetime.

Please wait for a second opinion on this before merging.

jeanPerier · 2025-10-23T12:05:39Z

Is it legal to reallocate/deallocate the private ALLOCATABLE on the device?
If so, I am not sure this would work after your patch.

Maybe the StackArrays could be extended to deal with scalar allocmem to alloca (assuming you only care about this under optimizations). Although it would likely not be able to deal with this because of the branching logic around the allocation/deallocation.

tblah · 2025-10-23T12:26:44Z

Is it legal to reallocate/deallocate the private ALLOCATABLE on the device? If so, I am not sure this would work after your patch.

Good point.

@ergawy note that even in cases when the OpenMP standard says explicit reallocation is not allowed (I haven't checked this case), there can still be implicit reallocations due to assignments (this has caught me out before). You will have to take care that these do not try to free() the stack allocation.

tblah

Requesting changes to remove approval until allocatable assignments are handled.

kkwli · 2025-10-24T18:03:29Z

If the allocatable variable is mapped, reallocation by assignment is not allowed

"If a list item of a map clause is an allocatable variable or is the subobject of an allocatable variable, the original list item must not be allocated, deallocated or reshaped while the corresponding list item has allocated storage."

"If the allocation status of a mapped variable or a list item that appears in a has_device_addr clause that has the ALLOCATABLE attribute is unallocated on entry to a target region, the allocation status of the corresponding variable in the device data environment must be unallocated upon exiting the region."

"If the allocation status of a mapped variable or a list item that appears in a has_device_addr clause that has the ALLOCATABLE attribute is allocated on entry to a target region, the allocation status and shape of the corresponding variable in the device data environment may not be changed, either explicitly or implicitly, in the region after entry to it."

mjklemm · 2025-10-24T18:40:52Z

As per OpenMP spec, one is not allowed to alter the allocation status of the ALLOCATABLE entity that is used in a reduction clause. If that happens, the code is non-conforming. So, I do not think it will be harmful if the ALLOCATABLE in a reduction are privatized using alloca.

ergawy · 2025-10-27T08:17:51Z

Thanks all for the feedback!

This is the most relevant part of the spec (5.2) according to which I think the changes in this PR are legal:

5.5.5 Properties Common to All Reduction Clauses
...
An original list item with the ALLOCATABLE attribute or any allocatable component of an
original list item that corresponds to a special variable identifier in the combiner expression or
the initializer expression must be in the allocated state at entry to the construct that contains the
reduction clause. Additionally, the list item or the allocatable component of the list item must be
neither deallocated nor allocated, explicitly or implicitly, within the region
...

OpenMP 6.0 maintains the same restriction.

@tblah @jeanPerier based on this and above comments, do you have objections to the changes in the PR? Please take another look and let me know.

tblah

Thanks for the clarification everyone.

…Us (llvm#164761) Temps needed for the allocatable reduction/privatization init regions are now allocated on the heap all the time. However, this is performance killer for GPUs since malloc calls are prohibitively expensive. Therefore, we should do these allocations on the stack for GPU reductions. This is similar to what we do for arrays. Additionally, I am working on getting reductions-by-ref to work on GPUs which is a bit of a challenge given the many involved steps (e.g. intra-warp and inter-warp reuctions, shuffling data from remote lanes, ...). But this is a prerequisite step.

llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir flang:openmp labels Oct 23, 2025

ergawy requested review from bhandarkar-pranav and tblah October 23, 2025 06:40

ergawy requested a review from mjklemm October 23, 2025 06:54

tblah approved these changes Oct 23, 2025

View reviewed changes

tblah requested changes Oct 23, 2025

View reviewed changes

ergawy requested a review from jsjodin October 23, 2025 14:20

tblah approved these changes Oct 27, 2025

View reviewed changes

ergawy merged commit 585b6e2 into main Oct 27, 2025
14 checks passed

ergawy deleted the users/ergawy/alloc_on_stack branch October 27, 2025 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[flang][OpenMP] Allocate `allocatable` init temps on the stack for GPUs #164761

[flang][OpenMP] Allocate `allocatable` init temps on the stack for GPUs #164761

Uh oh!

ergawy commented Oct 23, 2025 •

edited

Loading

Uh oh!

llvmbot commented Oct 23, 2025 •

edited

Loading

Uh oh!

tblah left a comment

Uh oh!

jeanPerier commented Oct 23, 2025

Uh oh!

tblah commented Oct 23, 2025

Uh oh!

tblah left a comment

Uh oh!

kkwli commented Oct 24, 2025

Uh oh!

mjklemm commented Oct 24, 2025

Uh oh!

ergawy commented Oct 27, 2025

Uh oh!

tblah left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[flang][OpenMP] Allocate allocatable init temps on the stack for GPUs #164761

[flang][OpenMP] Allocate allocatable init temps on the stack for GPUs #164761

Uh oh!

Conversation

ergawy commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tblah left a comment

Choose a reason for hiding this comment

Uh oh!

jeanPerier commented Oct 23, 2025

Uh oh!

tblah commented Oct 23, 2025

Uh oh!

tblah left a comment

Choose a reason for hiding this comment

Uh oh!

kkwli commented Oct 24, 2025

Uh oh!

mjklemm commented Oct 24, 2025

Uh oh!

ergawy commented Oct 27, 2025

Uh oh!

tblah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[flang][OpenMP] Allocate `allocatable` init temps on the stack for GPUs #164761

[flang][OpenMP] Allocate `allocatable` init temps on the stack for GPUs #164761

ergawy commented Oct 23, 2025 •

edited

Loading

llvmbot commented Oct 23, 2025 •

edited

Loading