[flang][OpenMP] Allow saving first block of an OMP region for allocas #121886

ergawy · 2025-01-07T05:32:49Z

Problem:

Consider the following example:

program test
  real :: x(1)
  integer :: i
  !$omp parallel do reduction(+:x)
    do i = 1,1
      x = 1
    end do
  !$omp end parallel do
end program

The HLFIR+OMP IR for this example looks like this:

  func.func @_QQmain() {
    ...
    omp.parallel {
      %5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>>
      %6 = fir.alloca !fir.box<!fir.array<1xf32>>
      ...
      omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) {
        omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) {
          ...
          omp.yield
        }
      }
      omp.terminator
    }
    return
  }

The problem addressed by this PR is related to: the alloca in the omp.parallel region + the related reduction clause on the omp.wsloop op. When we try translate the reduction from MLIR to LLVM, we have to choose an alloca insertion point. This happens in convertOmpWsloop where at entry to that function, this is what the LLVM module looks like:

define void @_QQmain() {
  %tid.addr = alloca i32, align 4
  ...

entry:
  %omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1)
  br label %omp.par.entry

omp.par.entry:
  %tid.addr.local = alloca i32, align 4
  ...
  br label %omp.par.region

omp.par.region:
  br label %omp.par.region1

omp.par.region1:
  ...
  %5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8

Possible solution:

Now, when we choose an alloca insertion point for the reduction, this is the chosen block omp.par.entry (without the changes in this PR). The problem is that the allocation needed for the reduction needs to reference the %5 SSA value. This results in inserting allocations in omp.par.entry that reference allocations in a later block omp.par.region1 which causes the Instruction does not dominate all uses! error.

The solution proposed by this PR is to allow convertOmpOpRegions to optionally save the first block of the OpenMP region being converted as an alloca block. This means that, for the above example, the allocation point chosen for the reduction will be in the omp.par.region1 block.

For now, this new optional argument is enbled only for parallel and target ops.

Fixes #120254

@1

Problem: -------- Consider the following example: ```fortran program test real :: x(1) integer :: i !$omp parallel do reduction(+:x) do i = 1,1 x = 1 end do !$omp end parallel do end program ``` The HLFIR+OMP IR for this example looks like this: ```mlir func.func @_QQmain() { ... omp.parallel { %5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>> %6 = fir.alloca !fir.box<!fir.array<1xf32>> ... omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) { omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) { ... omp.yield } } omp.terminator } return } ``` The problem addressed by this PR is related to: the `alloca` in the `omp.parallel` region + the related `reduction` clause on the `omp.wsloop` op. When we try translate the reduction from MLIR to LLVM, we have to choose an `alloca` insertion point. This happens in `convertOmpWsloop` where at entry to that function, this is what the LLVM module looks like: ```llvm define void @_QQmain() { %tid.addr = alloca i32, align 4 ... entry: %omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1) br label %omp.par.entry omp.par.entry: %tid.addr.local = alloca i32, align 4 ... br label %omp.par.region omp.par.region: br label %omp.par.region1 omp.par.region1: ... %5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8 ``` Possible solution: ------------------ Now, when we choose an `alloca` insertion point for the reduction, this is the chosen block `omp.par.entry` (without the changes in this PR). The problem is that the allocation needed for the reduction needs to reference the `%5` SSA value. This results in inserting allocations in `omp.par.entry` that reference allocations in a later block `omp.par.region1` which causes the `Instruction does not dominate all uses!` error. The solution proposed by this PR is to allow `convertOmpOpRegions` to optionally save the first block of the OpenMP region being converted as an alloca block. This means that, for the above example, the allocation point chosen for the reduction will be in the `omp.par.region1` block. For now, this new optional argument is enbled only for `parallel` and `target` ops.

llvmbot · 2025-01-07T05:33:22Z

@llvm/pr-subscribers-mlir-openmp

@llvm/pr-subscribers-mlir-llvm

Author: Kareem Ergawy (ergawy)

Changes

Problem:

Consider the following example:

program test
  real :: x(1)
  integer :: i
  !$omp parallel do reduction(+:x)
    do i = 1,1
      x = 1
    end do
  !$omp end parallel do
end program

The HLFIR+OMP IR for this example looks like this:

  func.func @<!-- -->_QQmain() {
    ...
    omp.parallel {
      %5 = fir.embox %4#<!-- -->0(%3) : (!fir.ref&lt;!fir.array&lt;1xf32&gt;&gt;, !fir.shape&lt;1&gt;) -&gt; !fir.box&lt;!fir.array&lt;1xf32&gt;&gt;
      %6 = fir.alloca !fir.box&lt;!fir.array&lt;1xf32&gt;&gt;
      ...
      omp.wsloop private(@<!-- -->_QFEi_private_ref_i32 %1#<!-- -->0 -&gt; %arg0 : !fir.ref&lt;i32&gt;) reduction(byref @<!-- -->add_reduction_byref_box_1xf32 %6 -&gt; %arg1 : !fir.ref&lt;!fir.box&lt;!fir.array&lt;1xf32&gt;&gt;&gt;) {
        omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) {
          ...
          omp.yield
        }
      }
      omp.terminator
    }
    return
  }

The problem addressed by this PR is related to: the alloca in the omp.parallel region + the related reduction clause on the omp.wsloop op. When we try translate the reduction from MLIR to LLVM, we have to choose an alloca insertion point. This happens in convertOmpWsloop where at entry to that function, this is what the LLVM module looks like:

define void @<!-- -->_QQmain() {
  %tid.addr = alloca i32, align 4
  ...

entry:
  %omp_global_thread_num = call i32 @<!-- -->__kmpc_global_thread_num(ptr @<!-- -->1)
  br label %omp.par.entry

omp.par.entry:
  %tid.addr.local = alloca i32, align 4
  ...
  br label %omp.par.region

omp.par.region:
  br label %omp.par.region1

omp.par.region1:
  ...
  %5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8

Possible solution:

Now, when we choose an alloca insertion point for the reduction, this is the chosen block omp.par.entry (without the changes in this PR). The problem is that the allocation needed for the reduction needs to reference the %5 SSA value. This results in inserting allocations in omp.par.entry that reference allocations in a later block omp.par.region1 which causes the Instruction does not dominate all uses! error.

The solution proposed by this PR is to allow convertOmpOpRegions to optionally save the first block of the OpenMP region being converted as an alloca block. This means that, for the above example, the allocation point chosen for the reduction will be in the omp.par.region1 block.

For now, this new optional argument is enbled only for parallel and target ops.

Patch is 35.54 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/121886.diff

10 Files Affected:

(modified) flang/test/Integration/OpenMP/atomic-capture-complex.f90 (+3-3)
(modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+37-24)
(modified) mlir/test/Target/LLVMIR/omptarget-byref-bycopy-generation-host.mlir (+1-1)
(modified) mlir/test/Target/LLVMIR/omptarget-llvm.mlir (+8-8)
(modified) mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir (+2-2)
(modified) mlir/test/Target/LLVMIR/openmp-llvm.mlir (+10-10)
(modified) mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir (+1-1)
(modified) mlir/test/Target/LLVMIR/openmp-reduction-array-sections.mlir (+33-29)
(modified) mlir/test/Target/LLVMIR/openmp-reduction-init-arg.mlir (+1-1)
(modified) mlir/test/Target/LLVMIR/openmp-reduction-sections.mlir (+26-18)

diff --git a/flang/test/Integration/OpenMP/atomic-capture-complex.f90 b/flang/test/Integration/OpenMP/atomic-capture-complex.f90
index 4ffd18097d79ee..a76cbb643ef8ce 100644
--- a/flang/test/Integration/OpenMP/atomic-capture-complex.f90
+++ b/flang/test/Integration/OpenMP/atomic-capture-complex.f90
@@ -13,16 +13,16 @@
 !CHECK: %[[VAL_1:.*]] = alloca { float, float }, i64 1, align 8
 !CHECK: %[[ORIG_VAL:.*]] = alloca { float, float }, i64 1, align 8
 !CHECK: store { float, float } { float 2.000000e+00, float 2.000000e+00 }, ptr %[[ORIG_VAL]], align 4
-!CHECK: br label %entry
+!CHECK: br label %[[ENTRY:.*]]
 
-!CHECK: entry:
+!CHECK: [[ENTRY]]:
 !CHECK: %[[ATOMIC_TEMP_LOAD:.*]] = alloca { float, float }, align 8
 !CHECK: call void @__atomic_load(i64 8, ptr %[[ORIG_VAL]], ptr %[[ATOMIC_TEMP_LOAD]], i32 0)
 !CHECK: %[[PHI_NODE_ENTRY_1:.*]] = load { float, float }, ptr %[[ATOMIC_TEMP_LOAD]], align 8
 !CHECK: br label %.atomic.cont
 
 !CHECK: .atomic.cont
-!CHECK: %[[VAL_4:.*]] = phi { float, float } [ %[[PHI_NODE_ENTRY_1]], %entry ], [ %{{.*}}, %.atomic.cont ]
+!CHECK: %[[VAL_4:.*]] = phi { float, float } [ %[[PHI_NODE_ENTRY_1]], %[[ENTRY]] ], [ %{{.*}}, %.atomic.cont ]
 !CHECK: %[[VAL_5:.*]] = extractvalue { float, float } %[[VAL_4]], 0
 !CHECK: %[[VAL_6:.*]] = extractvalue { float, float } %[[VAL_4]], 1
 !CHECK: %[[VAL_7:.*]] = fadd contract float %[[VAL_5]], 1.000000e+00
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 87cb7f03fec6aa..384799c021ac00 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -345,31 +345,37 @@ findAllocaInsertPoint(llvm::IRBuilderBase &builder,
         allocaInsertPoint = frame.allocaInsertPoint;
         return WalkResult::interrupt();
       });
-  if (walkResult.wasInterrupted())
-    return allocaInsertPoint;
 
   // Otherwise, insert to the entry block of the surrounding function.
-  // If the current IRBuilder InsertPoint is the function's entry, it cannot
-  // also be used for alloca insertion which would result in insertion order
-  // confusion. Create a new BasicBlock for the Builder and use the entry block
-  // for the allocs.
+  if (!walkResult.wasInterrupted()) {
+    llvm::BasicBlock &funcEntryBlock =
+        builder.GetInsertBlock()->getParent()->getEntryBlock();
+    allocaInsertPoint = llvm::OpenMPIRBuilder::InsertPointTy(
+        &funcEntryBlock, funcEntryBlock.getFirstInsertionPt());
+  }
+
+  // If the current IRBuilder insertion block is the same as the alloca
+  // insertion block, it cannot also be used for alloca insertion which would
+  // result in insertion order confusion. Create a new BasicBlock for the
+  // Builder and use the entry block for the allocs.
+  //
   // TODO: Create a dedicated alloca BasicBlock at function creation such that
   // we do not need to move the current InertPoint here.
-  if (builder.GetInsertBlock() ==
-      &builder.GetInsertBlock()->getParent()->getEntryBlock()) {
+  if (builder.GetInsertBlock() == allocaInsertPoint.getBlock()) {
     assert(builder.GetInsertPoint() == builder.GetInsertBlock()->end() &&
            "Assuming end of basic block");
-    llvm::BasicBlock *entryBB = llvm::BasicBlock::Create(
-        builder.getContext(), "entry", builder.GetInsertBlock()->getParent(),
-        builder.GetInsertBlock()->getNextNode());
-    builder.CreateBr(entryBB);
-    builder.SetInsertPoint(entryBB);
+    auto *insertCont = splitBB(
+        llvm::OpenMPIRBuilder::InsertPointTy(
+            allocaInsertPoint.getBlock(), allocaInsertPoint.getBlock()->end()),
+        true, "insert.cont");
+    builder.SetInsertPoint(insertCont, insertCont->end());
   }
 
-  llvm::BasicBlock &funcEntryBlock =
-      builder.GetInsertBlock()->getParent()->getEntryBlock();
   return llvm::OpenMPIRBuilder::InsertPointTy(
-      &funcEntryBlock, funcEntryBlock.getFirstInsertionPt());
+      allocaInsertPoint.getBlock(),
+      allocaInsertPoint.getPoint() != allocaInsertPoint.getBlock()->end()
+          ? allocaInsertPoint.getPoint()
+          : allocaInsertPoint.getBlock()->getFirstInsertionPt());
 }
 
 /// Converts the given region that appears within an OpenMP dialect operation to
@@ -380,7 +386,8 @@ findAllocaInsertPoint(llvm::IRBuilderBase &builder,
 static llvm::Expected<llvm::BasicBlock *> convertOmpOpRegions(
     Region &region, StringRef blockName, llvm::IRBuilderBase &builder,
     LLVM::ModuleTranslation &moduleTranslation,
-    SmallVectorImpl<llvm::PHINode *> *continuationBlockPHIs = nullptr) {
+    SmallVectorImpl<llvm::PHINode *> *continuationBlockPHIs = nullptr,
+    bool saveFirstBlockForAlloca = false) {
   llvm::BasicBlock *continuationBlock =
       splitBB(builder, true, "omp.region.cont");
   llvm::BasicBlock *sourceBlock = builder.GetInsertBlock();
@@ -441,6 +448,14 @@ static llvm::Expected<llvm::BasicBlock *> convertOmpOpRegions(
   // Convert blocks one by one in topological order to ensure
   // defs are converted before uses.
   SetVector<Block *> blocks = getBlocksSortedByDominance(region);
+  llvm::BasicBlock *firstLLVMBB = moduleTranslation.lookupBlock(blocks.front());
+  std::optional<LLVM::ModuleTranslation::SaveStack<OpenMPAllocaStackFrame>>
+      frame;
+
+  if (saveFirstBlockForAlloca)
+    frame.emplace(moduleTranslation, llvm::OpenMPIRBuilder::InsertPointTy(
+                                         firstLLVMBB, firstLLVMBB->end()));
+
   for (Block *bb : blocks) {
     llvm::BasicBlock *llvmBB = moduleTranslation.lookupBlock(bb);
     // Retarget the branch of the entry block to the entry block of the
@@ -2093,15 +2108,11 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
     LLVM::ModuleTranslation::SaveStack<OpenMPVarMappingStackFrame> mappingGuard(
         moduleTranslation, reductionVariableMap);
 
-    // Save the alloca insertion point on ModuleTranslation stack for use in
-    // nested regions.
-    LLVM::ModuleTranslation::SaveStack<OpenMPAllocaStackFrame> frame(
-        moduleTranslation, allocaIP);
-
     // ParallelOp has only one region associated with it.
     builder.restoreIP(codeGenIP);
     llvm::Expected<llvm::BasicBlock *> regionBlock = convertOmpOpRegions(
-        opInst.getRegion(), "omp.par.region", builder, moduleTranslation);
+        opInst.getRegion(), "omp.par.region", builder, moduleTranslation,
+        /*continuationBlockPHIs=*/nullptr, /*saveFirstBlockForAlloca=*/true);
     if (!regionBlock)
       return regionBlock.takeError();
 
@@ -2186,6 +2197,7 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
 
   llvm::OpenMPIRBuilder::InsertPointTy allocaIP =
       findAllocaInsertPoint(builder, moduleTranslation);
+
   llvm::OpenMPIRBuilder::LocationDescription ompLoc(builder);
 
   llvm::OpenMPIRBuilder::InsertPointOrErrorTy afterIP =
@@ -4022,7 +4034,8 @@ convertOmpTarget(Operation &opInst, llvm::IRBuilderBase &builder,
 
     builder.restoreIP(codeGenIP);
     llvm::Expected<llvm::BasicBlock *> exitBlock = convertOmpOpRegions(
-        targetRegion, "omp.target", builder, moduleTranslation);
+        targetRegion, "omp.target", builder, moduleTranslation,
+        /*continuationBlockPHIs=*/nullptr, /*saveFirstBlockForAlloca=*/true);
 
     if (!exitBlock)
       return exitBlock.takeError();
diff --git a/mlir/test/Target/LLVMIR/omptarget-byref-bycopy-generation-host.mlir b/mlir/test/Target/LLVMIR/omptarget-byref-bycopy-generation-host.mlir
index 871f5caf7b2ffc..e4da548e84a1c0 100644
--- a/mlir/test/Target/LLVMIR/omptarget-byref-bycopy-generation-host.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-byref-bycopy-generation-host.mlir
@@ -26,7 +26,7 @@ module attributes {omp.is_target_device = false, omp.target_triples = ["amdgcn-a
 // CHECK: define void @_QQmain() {
 // CHECK: %[[BYCOPY_ALLOCA:.*]] = alloca ptr, align 8
 
-// CHECK: entry:                                            ; preds = %0
+// CHECK: {{.*}}:                                            ; preds = %0
 // CHECK: %[[LOAD_VAL:.*]] = load i32, ptr @_QFEi, align 4
 // CHECK: store i32 %[[LOAD_VAL]], ptr %[[BYCOPY_ALLOCA]], align 4
 // CHECK: %[[BYCOPY_LOAD:.*]] = load ptr, ptr %[[BYCOPY_ALLOCA]], align 8
diff --git a/mlir/test/Target/LLVMIR/omptarget-llvm.mlir b/mlir/test/Target/LLVMIR/omptarget-llvm.mlir
index 7f21095763a397..e6a3c54c6957f2 100644
--- a/mlir/test/Target/LLVMIR/omptarget-llvm.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-llvm.mlir
@@ -20,7 +20,7 @@ llvm.func @_QPopenmp_target_data() {
 // CHECK:         %[[VAL_2:.*]] = alloca [1 x ptr], align 8
 // CHECK:         %[[VAL_3:.*]] = alloca i32, i64 1, align 4
 // CHECK:         br label %[[VAL_4:.*]]
-// CHECK:       entry:                                            ; preds = %[[VAL_5:.*]]
+// CHECK:       [[VAL_4]]:                                            ; preds = %[[VAL_5:.*]]
 // CHECK:         %[[VAL_6:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_0]], i32 0, i32 0
 // CHECK:         store ptr %[[VAL_3]], ptr %[[VAL_6]], align 8
 // CHECK:         %[[VAL_7:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_1]], i32 0, i32 0
@@ -65,7 +65,7 @@ llvm.func @_QPopenmp_target_data_region(%0 : !llvm.ptr) {
 // CHECK:         %[[VAL_1:.*]] = alloca [1 x ptr], align 8
 // CHECK:         %[[VAL_2:.*]] = alloca [1 x ptr], align 8
 // CHECK:         br label %[[VAL_3:.*]]
-// CHECK:       entry:                                            ; preds = %[[VAL_4:.*]]
+// CHECK:       [[VAL_3]]:                                            ; preds = %[[VAL_4:.*]]
 // CHECK:         %[[ARR_OFFSET:.*]] = getelementptr inbounds [1024 x i32], ptr %[[ARR_DATA:.*]], i64 0, i64 0
 // CHECK:         %[[VAL_5:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_0]], i32 0, i32 0
 // CHECK:         store ptr %[[ARR_DATA]], ptr %[[VAL_5]], align 8
@@ -151,7 +151,7 @@ llvm.func @_QPomp_target_enter_exit(%1 : !llvm.ptr, %3 : !llvm.ptr) {
 // CHECK:         %[[VAL_9:.*]] = icmp slt i32 %[[VAL_8]], 10
 // CHECK:         %[[VAL_10:.*]] = load i32, ptr %[[VAL_6]], align 4
 // CHECK:         br label %[[VAL_11:.*]]
-// CHECK:       entry:                                            ; preds = %[[VAL_12:.*]]
+// CHECK:       [[VAL_11]]:                                            ; preds = %[[VAL_12:.*]]
 // CHECK:         br i1 %[[VAL_9]], label %[[VAL_13:.*]], label %[[VAL_14:.*]]
 // CHECK:       omp_if.then:                                      ; preds = %[[VAL_11]]
 // CHECK:         %[[ARR_OFFSET1:.*]] = getelementptr inbounds [1024 x i32], ptr %[[VAL_16:.*]], i64 0, i64 0
@@ -228,7 +228,7 @@ llvm.func @_QPopenmp_target_use_dev_ptr() {
 // CHECK:         %[[VAL_3:.*]] = alloca ptr, align 8
 // CHECK:         %[[VAL_4:.*]] = alloca ptr, i64 1, align 8
 // CHECK:         br label %[[VAL_5:.*]]
-// CHECK:       entry:                                            ; preds = %[[VAL_6:.*]]
+// CHECK:       [[VAL_5]]:                                            ; preds = %[[VAL_6:.*]]
 // CHECK:         %[[VAL_7:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_0]], i32 0, i32 0
 // CHECK:         store ptr %[[VAL_4]], ptr %[[VAL_7]], align 8
 // CHECK:         %[[VAL_8:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_1]], i32 0, i32 0
@@ -271,7 +271,7 @@ llvm.func @_QPopenmp_target_use_dev_addr() {
 // CHECK:         %[[VAL_2:.*]] = alloca [1 x ptr], align 8
 // CHECK:         %[[VAL_3:.*]] = alloca ptr, i64 1, align 8
 // CHECK:         br label %[[VAL_4:.*]]
-// CHECK:       entry:                                            ; preds = %[[VAL_5:.*]]
+// CHECK:       [[VAL_4]]:                                            ; preds = %[[VAL_5:.*]]
 // CHECK:         %[[VAL_6:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_0]], i32 0, i32 0
 // CHECK:         store ptr %[[VAL_3]], ptr %[[VAL_6]], align 8
 // CHECK:         %[[VAL_7:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_1]], i32 0, i32 0
@@ -312,7 +312,7 @@ llvm.func @_QPopenmp_target_use_dev_addr_no_ptr() {
 // CHECK:         %[[VAL_2:.*]] = alloca [1 x ptr], align 8
 // CHECK:         %[[VAL_3:.*]] = alloca i32, i64 1, align 4
 // CHECK:         br label %[[VAL_4:.*]]
-// CHECK:       entry:                                            ; preds = %[[VAL_5:.*]]
+// CHECK:       [[VAL_4]]:                                            ; preds = %[[VAL_5:.*]]
 // CHECK:         %[[VAL_6:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_0]], i32 0, i32 0
 // CHECK:         store ptr %[[VAL_3]], ptr %[[VAL_6]], align 8
 // CHECK:         %[[VAL_7:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_1]], i32 0, i32 0
@@ -359,7 +359,7 @@ llvm.func @_QPopenmp_target_use_dev_addr_nomap() {
 // CHECK:         %[[VAL_3:.*]] = alloca ptr, i64 1, align 8
 // CHECK:         %[[VAL_4:.*]] = alloca ptr, i64 1, align 8
 // CHECK:         br label %[[VAL_5:.*]]
-// CHECK:       entry:                                            ; preds = %[[VAL_6:.*]]
+// CHECK:       [[VAL_5]]:                                            ; preds = %[[VAL_6:.*]]
 // CHECK:         %[[VAL_7:.*]] = getelementptr inbounds [2 x ptr], ptr %[[VAL_0]], i32 0, i32 0
 // CHECK:         store ptr %[[VAL_4]], ptr %[[VAL_7]], align 8
 // CHECK:         %[[VAL_8:.*]] = getelementptr inbounds [2 x ptr], ptr %[[VAL_1]], i32 0, i32 0
@@ -418,7 +418,7 @@ llvm.func @_QPopenmp_target_use_dev_both() {
 // CHECK:         %[[VAL_4:.*]] = alloca ptr, i64 1, align 8
 // CHECK:         %[[VAL_5:.*]] = alloca ptr, i64 1, align 8
 // CHECK:         br label %[[VAL_6:.*]]
-// CHECK:       entry:                                            ; preds = %[[VAL_7:.*]]
+// CHECK:       [[VAL_6]]:                                            ; preds = %[[VAL_7:.*]]
 // CHECK:         %[[VAL_8:.*]] = getelementptr inbounds [2 x ptr], ptr %[[VAL_0]], i32 0, i32 0
 // CHECK:         store ptr %[[VAL_4]], ptr %[[VAL_8]], align 8
 // CHECK:         %[[VAL_9:.*]] = getelementptr inbounds [2 x ptr], ptr %[[VAL_1]], i32 0, i32 0
diff --git a/mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir b/mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir
index 4903656c22ec72..e586110529c005 100644
--- a/mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir
@@ -54,9 +54,7 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memo
 // CHECK: define weak_odr protected amdgpu_kernel void @[[FUNC0:.*]](
 // CHECK-SAME: ptr %[[TMP:.*]], ptr %[[TMP0:.*]]) {
 // CHECK:         %[[TMP1:.*]] = alloca [1 x ptr], align 8, addrspace(5)
-// CHECK:         %[[TMP2:.*]] = addrspacecast ptr addrspace(5) %[[TMP1]] to ptr
 // CHECK:         %[[STRUCTARG:.*]] = alloca { ptr }, align 8, addrspace(5)
-// CHECK:         %[[STRUCTARG_ASCAST:.*]] = addrspacecast ptr addrspace(5) %[[STRUCTARG]] to ptr
 // CHECK:         %[[TMP3:.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK:         %[[TMP4:.*]] = addrspacecast ptr addrspace(5) %[[TMP3]] to ptr
 // CHECK:         store ptr %[[TMP0]], ptr %[[TMP4]], align 8
@@ -64,6 +62,8 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memo
 // CHECK:         %[[EXEC_USER_CODE:.*]] = icmp eq i32 %[[TMP5]], -1
 // CHECK:         br i1 %[[EXEC_USER_CODE]], label %[[USER_CODE_ENTRY:.*]], label %[[WORKER_EXIT:.*]]
 // CHECK:         %[[TMP6:.*]] = load ptr, ptr %[[TMP4]], align 8
+// CHECK:         %[[TMP2:.*]] = addrspacecast ptr addrspace(5) %[[TMP1]] to ptr
+// CHECK:         %[[STRUCTARG_ASCAST:.*]] = addrspacecast ptr addrspace(5) %[[STRUCTARG]] to ptr
 // CHECK:         %[[OMP_GLOBAL_THREAD_NUM:.*]] = call i32 @__kmpc_global_thread_num(ptr addrspacecast (ptr addrspace(1) @[[GLOB1:[0-9]+]] to ptr))
 // CHECK:         %[[GEP_:.*]] = getelementptr { ptr }, ptr addrspace(5) %[[STRUCTARG]], i32 0, i32 0
 // CHECK:         store ptr %[[TMP6]], ptr addrspace(5) %[[GEP_]], align 8
diff --git a/mlir/test/Target/LLVMIR/openmp-llvm.mlir b/mlir/test/Target/LLVMIR/openmp-llvm.mlir
index 44e32c3f35f9b5..1177dccab80cb1 100644
--- a/mlir/test/Target/LLVMIR/openmp-llvm.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-llvm.mlir
@@ -1415,16 +1415,16 @@ llvm.func @omp_atomic_update(%x:!llvm.ptr, %expr: i32, %xbool: !llvm.ptr, %exprb
 //CHECK: {{.*}} = alloca { float, float }, i64 1, align 8
 //CHECK: %[[ORIG_VAL:.*]] = alloca { float, float }, i64 1, align 8
 
-//CHECK: br label %entry
+//CHECK: br label %[[ENTRY:.*]]
 
-//CHECK: entry:
+//CHECK: [[ENTRY]]:
 //CHECK: %[[ATOMIC_TEMP_LOAD:.*]] = alloca { float, float }, align 8
 //CHECK: call void @__atomic_load(i64 8, ptr %[[ORIG_VAL]], ptr %[[ATOMIC_TEMP_LOAD]], i32 0)
 //CHECK: %[[PHI_NODE_ENTRY_1:.*]] = load { float, float }, ptr %[[ATOMIC_TEMP_LOAD]], align 8
 //CHECK: br label %.atomic.cont
 
 //CHECK: .atomic.cont
-//CHECK: %[[VAL_4:.*]] = phi { float, float } [ %[[PHI_NODE_ENTRY_1]], %entry ], [ %{{.*}}, %.atomic.cont ]
+//CHECK: %[[VAL_4:.*]] = phi { float, float } [ %[[PHI_NODE_ENTRY_1]], %{{.*}} ], [ %{{.*}}, %.atomic.cont ]
 //CHECK: %[[VAL_5:.*]] = extractvalue { float, float } %[[VAL_4]], 0
 //CHECK: %[[VAL_6:.*]] = extractvalue { float, float } %[[VAL_4]], 1
 //CHECK: %[[VAL_7:.*]] = fadd contract float %[[VAL_5]], 1.000000e+00
@@ -1467,16 +1467,16 @@ llvm.func @_QPomp_atomic_update_complex() {
 //CHECK: %[[VAL_1:.*]] = alloca { float, float }, i64 1, align 8
 //CHECK: %[[ORIG_VAL:.*]] = alloca { float, float }, i64 1, align 8
 //CHECK: store { float, float } { float 2.000000e+00, float 2.000000e+00 }, ptr %[[ORIG_VAL]], align 4
-//CHECK: br label %entry
+//CHECK: br label %[[ENTRY:.*]]
 
-//CHECK: entry:							; preds = %0
+//CHECK: [[ENTRY]]:							; preds = %0
 //CHECK: %[[ATOMIC_TEMP_LOAD:.*]] = alloca { float, float }, align 8
 //CHECK: call void @__atomic_load(i64 8, ptr %[[ORIG_VAL]], ptr %[[ATOMIC_TEMP_LOAD]], i32 0)
 //CHECK: %[[PHI_NODE_ENTRY_1:.*]] = load { float, float }, ptr %[[ATOMIC_TEMP_LOAD]], align 8
 //CHECK: br label %.atomic.cont
 
 //CHECK: .atomic.cont
-//CHECK: %[[VAL_4:.*]] = phi { float, float } [ %[[PHI_NODE_ENTRY_1]], %entry ], [ %{{.*}}, %.atomic.cont ]
+//CHECK: %[[VAL_4:.*]] = phi { float, float } [ %[[PHI_NODE_ENTRY_1]], %{{.*}} ], [ %{{.*}}, %.atomic.cont ]
 //CHECK: %[[VAL_5:.*]] = extractvalue { float, float } %[[VAL_4]], 0
 //CHECK: %[[VAL_6:.*]] = extractvalue { float, float } %[[VAL_4]], 1
 //CHECK: %[[VAL_7:.*]] = fadd contract float %[[VAL_5]], 1.000000e+00
@@ -1613,7 +1613,7 @@ llvm.func @omp_atomic_update_intrinsic(%x:!llvm.ptr, %expr: i32) {
 // CHECK-LABEL: @atomic_update_cmpxchg
 // CHECK-SAME: (ptr %[[X:.*]], ptr %[[EXPR:.*]]) {
 // CHECK:  %[[AT_LOAD_VAL:.*]] = load atomic i32, ptr %[[X]] monotonic, align 4
-// CHECK:  %[[LOAD_VAL_PHI:.*]] = phi i32 [ %[[AT_LOAD_VAL]], %entry ], [ %[[LOAD_VAL:.*]], %.atomic.cont ]
+// CHECK:  %[[LOAD_VAL_PHI:.*]] = phi i32 [ %[[AT_LOAD_VAL]], %{{.*}} ], [ %[[LOAD_VAL:.*]], %.atomic.cont ]
 // CHECK:  %[[VAL_SUCCESS:.*]] = cmpxchg ptr %[[X]], i32 %[[LOAD_VAL_PHI]], i32 %{{.*}} monotonic monotonic, align 4
 // CHECK:  %[[LOAD_VAL]] = extractvalue { i32, i1 } %[[VAL_SUCCESS]], 0
 // CHECK:  br i1 %{{.*}}, label %.atomic.exit, label %.atomic.cont
@@ -2216,8 +2216,8 @@ llvm.func @omp_sections_empty() -> () {
   omp.sections {
     omp.terminator
   }
-  // CHECK-NEXT: br label %entry
-  // CHECK: entry:
+  // CHECK-NEXT: br label %[[ENTRY:.*]]
+  // CHECK: [[ENTRY]]:
   // CHECK-NEXT: ret void
   llvm.return
 }
@@ -3093,7 +3093,7 @@ llvm.func @omp_task_final(%boolexpr: i1) {
 // CHECK:         br label %[[entry:[^,]+]]
 // CHECK:       [[entry]]:
 // CHECK:         br label %[[codeRepl:[^,]+]]
-// CHECK:       [[codeRepl]]:                                         ; preds = %entry
+// CHECK:       [[codeRepl]]:
 // CHECK:         %[[omp_global_thread_num:.+]] = call i32 @__kmpc_global_thread_num(ptr @{{.+}})
 // CHECK:         %[[final_flag:.+]] = select i1 %[[boolexpr]], i32 2, i32 0
 // CHECK:         %[[task_flags:.+]] = or i32 %[[final_flag]], 1
diff --git a/mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir b/mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir
index 55fb5954548a04..134eb8ef5fd9f2 100644
--- a/mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir
@@ -30,7 +30,7 @@ llvm.func @missordered_blocks_(%arg0: !llvm.ptr {fir.bindc_name = "x"}, %arg1: !
 
 // CHECK:  ...
[truncated]

ergawy · 2025-01-07T05:33:58Z

I did not add new tests yet. If the proposed solution is acceptable, I will add some new tests to reproduce the bug fixed by the PR.

ergawy · 2025-01-07T08:23:06Z

flang/test/Integration/OpenMP/atomic-capture-complex.f90

 !RUN: %if aarch64-registerd-target %{ %flang_fc1 -triple aarch64-unknown-linux-gnu -emit-llvm -fopenmp %s -o - | FileCheck --check-prefixes=CHECK,AARCH64 %s %}

 !CHECK: %[[X_NEW_VAL:.*]] = alloca { float, float }, align 8
 !CHECK: %[[VAL_1:.*]] = alloca { float, float }, i64 1, align 8


TODO: Add new test(s) to reproduce the bug fixed by the PR once the reviewers agree with the taken approach.

tblah

Thanks for working on a fix for this. I have had a lot of trouble before getting the insertion points right for reductions.

Now, when we choose an alloca insertion point for the reduction, this is the chosen block omp.par.entry (without the changes in this PR). The problem is that the allocation needed for the reduction needs to reference the %5 SSA value. This results in inserting allocations in omp.par.entry that reference allocations in a later block omp.par.region1 which causes the Instruction does not dominate all uses! error.

Do you mean that the allocation region of the reduction needs to reference %5? Shouldn't that be the init region (which definately shouldn't be in the alloca block)?

tblah · 2025-01-07T09:55:18Z

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

+  // If the current IRBuilder insertion block is the same as the alloca
+  // insertion block, it cannot also be used for alloca insertion which would
+  // result in insertion order confusion. Create a new BasicBlock for the
+  // Builder and use the entry block for the allocs.


Shouldn't it be okay so long as the allocas land at the start of the block (as they should be anyway)?

tblah · 2025-01-07T10:01:59Z

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

+      allocaInsertPoint.getPoint() != allocaInsertPoint.getBlock()->end()
+          ? allocaInsertPoint.getPoint()
+          : allocaInsertPoint.getBlock()->getFirstInsertionPt());


Why not unconditionally put the alloca at the start of the block?

tblah · 2025-01-07T10:05:53Z

mlir/test/Target/LLVMIR/openmp-reduction-array-sections.mlir

+// CHECK:         %[[VAL_6:.*]] = alloca i32, align 4
+// CHECK:         %[[VAL_7:.*]] = alloca i32, align 4
+// CHECK:         %[[VAL_8:.*]] = alloca i32, align 4
+// CHECK:         %[[VAL_9:.*]] = alloca i32, align 4
+// CHECK:         %[[VAL_19:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, i64 1, align 8


It is a very minor thing, but this for me is a bit of a regression. Ideally I would like to see all allocas at the start of the first block of the function. We were already not achieving that so this doesn't matter too much, but it would have been nice if this patch moved us further in the right direction.

ergawy · 2025-01-07T11:28:53Z

Thanks for the quick review.

Do you mean that the allocation region of the reduction needs to reference %5? Shouldn't that be the init region (which definately shouldn't be in the alloca block)?

Yes, the init region indeed. Your comment made me understand a bigger part of the picture regarding reductions. The problem is that in initReductionVars, we start the reduction init logic after the latest alloca block chosen for allocating the reductions: builder.SetInsertPoint(latestAllocaBlock->getTerminator());. The latest alloca block (without the changes in this PR) would be before the first block in parent parallel region (in the example mentioned in the description) which causes the bug.

I think we can find a more suitable solution for the issue than the proposed one in this PR, if so I will abandon this PR. Looking into it .... 👀

@1

Problem Consider the following example: ```fortran program test real :: x(1) integer :: i !$omp parallel do reduction(+:x) do i = 1,1 x = 1 end do !$omp end parallel do end program ``` The HLFIR+OMP IR for this example looks like this: ```mlir func.func @_QQmain() { ... omp.parallel { %5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>> %6 = fir.alloca !fir.box<!fir.array<1xf32>> ... omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) { omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) { ... omp.yield } } omp.terminator } return } ``` The problem addressed by this PR is related to: the `alloca` in the `omp.parallel` region + the related `reduction` clause on the `omp.wsloop` op. When we try translate the reduction from MLIR to LLVM, we have to choose an `alloca` insertion point. This happens in `convertOmpWsloop` where at entry to that function, this is what the LLVM module looks like: ```llvm define void @_QQmain() { %tid.addr = alloca i32, align 4 ... entry: %omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1) br label %omp.par.entry omp.par.entry: %tid.addr.local = alloca i32, align 4 ... br label %omp.par.region omp.par.region: br label %omp.par.region1 omp.par.region1: ... %5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8 ``` Now, when we choose an `alloca` insertion point for the reduction, this is the chosen block `omp.par.entry` (without the changes in this PR). The problem is that the allocation needed for the reduction needs to reference the `%5` SSA value. This results in inserting allocations in `omp.par.entry` that reference allocations in a later block `omp.par.region1` which causes the `Instruction does not dominate all uses!` error. Possible solution - take 2: This PR contains a more localized solution than llvm#121886. It makes sure that on entry to `initReductionVars`, the IR builder is at a point where we can starting inserting initialization region; to make things cleaner, we still split the builder insertion point to a dedicated `omp.reduction.init`. This way we avoid splitting after the latest allocation block; which is what causing the issue.

ergawy · 2025-01-08T09:30:19Z

#122079 is another possible solution to the issue. If so, I will abandon this PR later.

@1

Problem Consider the following example: ```fortran program test real :: x(1) integer :: i !$omp parallel do reduction(+:x) do i = 1,1 x = 1 end do !$omp end parallel do end program ``` The HLFIR+OMP IR for this example looks like this: ```mlir func.func @_QQmain() { ... omp.parallel { %5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>> %6 = fir.alloca !fir.box<!fir.array<1xf32>> ... omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) { omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) { ... omp.yield } } omp.terminator } return } ``` The problem addressed by this PR is related to: the `alloca` in the `omp.parallel` region + the related `reduction` clause on the `omp.wsloop` op. When we try translate the reduction from MLIR to LLVM, we have to choose an `alloca` insertion point. This happens in `convertOmpWsloop` where at entry to that function, this is what the LLVM module looks like: ```llvm define void @_QQmain() { %tid.addr = alloca i32, align 4 ... entry: %omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1) br label %omp.par.entry omp.par.entry: %tid.addr.local = alloca i32, align 4 ... br label %omp.par.region omp.par.region: br label %omp.par.region1 omp.par.region1: ... %5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8 ``` Now, when we choose an `alloca` insertion point for the reduction, this is the chosen block `omp.par.entry` (without the changes in this PR). The problem is that the allocation needed for the reduction needs to reference the `%5` SSA value. This results in inserting allocations in `omp.par.entry` that reference allocations in a later block `omp.par.region1` which causes the `Instruction does not dominate all uses!` error. Possible solution - take 2: This PR contains a more localized solution than llvm#121886. It makes sure that on entry to `initReductionVars`, the IR builder is at a point where we can starting inserting initialization region; to make things cleaner, we still split the builder insertion point to a dedicated `omp.reduction.init`. This way we avoid splitting after the latest allocation block; which is what causing the issue.

ergawy · 2025-01-08T11:14:43Z

Abandoning in favor of #122079.

@1

Replaces #121886 Fixes #120254 (hopefully 🤞) ## Problem Consider the following example: ```fortran program test real :: x(1) integer :: i !$omp parallel do reduction(+:x) do i = 1,1 x = 1 end do !$omp end parallel do end program ``` The HLFIR+OMP IR for this example looks like this: ```mlir func.func @_QQmain() { ... omp.parallel { %5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>> %6 = fir.alloca !fir.box<!fir.array<1xf32>> ... omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) { omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) { ... omp.yield } } omp.terminator } return } ``` The problem addressed by this PR is related to: the `alloca` in the `omp.parallel` region + the related `reduction` clause on the `omp.wsloop` op. When we try translate the reduction from MLIR to LLVM, we have to choose an `alloca` insertion point. This happens in `convertOmpWsloop` where at entry to that function, this is what the LLVM module looks like: ```llvm define void @_QQmain() { %tid.addr = alloca i32, align 4 ... entry: %omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1) br label %omp.par.entry omp.par.entry: %tid.addr.local = alloca i32, align 4 ... br label %omp.par.region omp.par.region: br label %omp.par.region1 omp.par.region1: ... %5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8 ``` Now, when we choose an `alloca` insertion point for the reduction, this is the chosen block `omp.par.entry` (without the changes in this PR). The problem is that the allocation needed for the reduction needs to reference the `%5` SSA value. This results in inserting allocations in `omp.par.entry` that reference allocations in a later block `omp.par.region1` which causes the `Instruction does not dominate all uses!` error. ## Possible solution - take 2: This PR contains a more localized solution than #121886. It makes sure that on entry to `initReductionVars`, the IR builder is at a point where we can starting inserting initialization region; to make things cleaner, we still split the builder insertion point to a dedicated `omp.reduction.init`. This way we avoid splitting after the latest allocation block; which is what causing the issue.

@1

… (#122079) Replaces llvm/llvm-project#121886 Fixes llvm/llvm-project#120254 (hopefully 🤞) ## Problem Consider the following example: ```fortran program test real :: x(1) integer :: i !$omp parallel do reduction(+:x) do i = 1,1 x = 1 end do !$omp end parallel do end program ``` The HLFIR+OMP IR for this example looks like this: ```mlir func.func @_QQmain() { ... omp.parallel { %5 = fir.embox %4#0(%3) : (!fir.ref<!fir.array<1xf32>>, !fir.shape<1>) -> !fir.box<!fir.array<1xf32>> %6 = fir.alloca !fir.box<!fir.array<1xf32>> ... omp.wsloop private(@_QFEi_private_ref_i32 %1#0 -> %arg0 : !fir.ref<i32>) reduction(byref @add_reduction_byref_box_1xf32 %6 -> %arg1 : !fir.ref<!fir.box<!fir.array<1xf32>>>) { omp.loop_nest (%arg2) : i32 = (%c1_i32) to (%c1_i32_0) inclusive step (%c1_i32_1) { ... omp.yield } } omp.terminator } return } ``` The problem addressed by this PR is related to: the `alloca` in the `omp.parallel` region + the related `reduction` clause on the `omp.wsloop` op. When we try translate the reduction from MLIR to LLVM, we have to choose an `alloca` insertion point. This happens in `convertOmpWsloop` where at entry to that function, this is what the LLVM module looks like: ```llvm define void @_QQmain() { %tid.addr = alloca i32, align 4 ... entry: %omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1) br label %omp.par.entry omp.par.entry: %tid.addr.local = alloca i32, align 4 ... br label %omp.par.region omp.par.region: br label %omp.par.region1 omp.par.region1: ... %5 = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8 ``` Now, when we choose an `alloca` insertion point for the reduction, this is the chosen block `omp.par.entry` (without the changes in this PR). The problem is that the allocation needed for the reduction needs to reference the `%5` SSA value. This results in inserting allocations in `omp.par.entry` that reference allocations in a later block `omp.par.region1` which causes the `Instruction does not dominate all uses!` error. ## Possible solution - take 2: This PR contains a more localized solution than llvm/llvm-project#121886. It makes sure that on entry to `initReductionVars`, the IR builder is at a point where we can starting inserting initialization region; to make things cleaner, we still split the builder insertion point to a dedicated `omp.reduction.init`. This way we avoid splitting after the latest allocation block; which is what causing the issue.

llvmbot added mlir:llvm mlir flang Flang issues not falling into any other category mlir:openmp flang:openmp labels Jan 7, 2025

ergawy commented Jan 7, 2025

View reviewed changes

ergawy requested review from Meinersbur, jsonn, luporl, skatrak and tblah January 7, 2025 08:23

tblah reviewed Jan 7, 2025

View reviewed changes

ergawy mentioned this pull request Jan 8, 2025

[flang][OpenMP] Fix reduction init region block management #122079

Merged

ergawy removed the request for review from jsonn January 8, 2025 09:32

ergawy closed this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[flang][OpenMP] Allow saving first block of an OMP region for allocas #121886

[flang][OpenMP] Allow saving first block of an OMP region for allocas #121886

Uh oh!

ergawy commented Jan 7, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jan 7, 2025 •

edited

Loading

Problem:

Possible solution:

Uh oh!

ergawy commented Jan 7, 2025

Uh oh!

ergawy Jan 7, 2025

Uh oh!

tblah left a comment

Uh oh!

tblah Jan 7, 2025

Uh oh!

tblah Jan 7, 2025

Uh oh!

tblah Jan 7, 2025

Uh oh!

ergawy commented Jan 7, 2025

Uh oh!

ergawy commented Jan 8, 2025

Uh oh!

ergawy commented Jan 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[flang][OpenMP] Allow saving first block of an OMP region for allocas #121886

[flang][OpenMP] Allow saving first block of an OMP region for allocas #121886

Uh oh!

Conversation

ergawy commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Possible solution:

Uh oh!

llvmbot commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Possible solution:

Uh oh!

ergawy commented Jan 7, 2025

Uh oh!

ergawy Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

tblah left a comment

Choose a reason for hiding this comment

Uh oh!

tblah Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

tblah Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

tblah Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

ergawy commented Jan 7, 2025

Uh oh!

ergawy commented Jan 8, 2025

Uh oh!

ergawy commented Jan 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ergawy commented Jan 7, 2025 •

edited

Loading

llvmbot commented Jan 7, 2025 •

edited

Loading