Skip to content

Conversation

@luporl
Copy link
Contributor

@luporl luporl commented Jun 30, 2025

Assignments of n-dimensional arrays, with trivial RHS, were
always being converted to n nested loops. For contiguous arrays,
it's possible to flatten them and use a single loop, that can
usually be better optimized by LLVM.

In a test program, using a 3-dimensional array and varying its
size, the resulting speedup was as follows (measured on Graviton4):

16K 1.09
64K 1.40
128K 1.90
256K 1.91
512K 1.00

For sizes above or equal to 512K no improvement was observed.
It looks like LLVM stops trying to perform aggressive loop
unrolling at a certain threshold and just uses nested loops
instead. Larger sizes won't fit on L1 and L2 caches too.

This was noticed while profiling 527.cam4_r. This optimization
makes aer_rad_props_sw slightly faster, but unfortunately it
practically doesn't change 527.cam4_r total execution time.

Assignments of n-dimensional arrays, with trivial RHS, were
always being converted to n nested loops. For contiguous arrays,
it's possible to flatten them and use a single loop, that can
usually be better optimized by LLVM.

In a test program, using a 3-dimensional array and varying its
size, the resulting speedup was as follows (measured on Graviton4):

16K     1.09
64K     1.40
128K    1.90
256K    1.91
512K    1.00

For sizes above or equal to 512K no improvement was observed.
It looks like LLVM stops trying to perform aggressive loop
unrolling at a certain threshold and just uses nested loops
instead. Larger sizes won't fit on L1 and L2 caches too.

This was noticed while profiling 527.cam4_r. This optimization
makes aer_rad_props slightly faster, but unfortunately it
practically doesn't change 527.cam4_r total execution time.
@llvmbot llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Jun 30, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 30, 2025

@llvm/pr-subscribers-flang-fir-hlfir

Author: Leandro Lupori (luporl)

Changes

Assignments of n-dimensional arrays, with trivial RHS, were
always being converted to n nested loops. For contiguous arrays,
it's possible to flatten them and use a single loop, that can
usually be better optimized by LLVM.

In a test program, using a 3-dimensional array and varying its
size, the resulting speedup was as follows (measured on Graviton4):

16K 1.09
64K 1.40
128K 1.90
256K 1.91
512K 1.00

For sizes above or equal to 512K no improvement was observed.
It looks like LLVM stops trying to perform aggressive loop
unrolling at a certain threshold and just uses nested loops
instead. Larger sizes won't fit on L1 and L2 caches too.

This was noticed while profiling 527.cam4_r. This optimization
makes aer_rad_props slightly faster, but unfortunately it
practically doesn't change 527.cam4_r total execution time.


Full diff: https://github.com/llvm/llvm-project/pull/146408.diff

2 Files Affected:

  • (modified) flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp (+36-7)
  • (modified) flang/test/HLFIR/opt-scalar-assign.fir (+41-12)
diff --git a/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp b/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
index 91df8672c20d9..e88991b801415 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
+++ b/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
@@ -786,13 +786,42 @@ llvm::LogicalResult BroadcastAssignBufferization::matchAndRewrite(
   mlir::Value shape = hlfir::genShape(loc, builder, lhs);
   llvm::SmallVector<mlir::Value> extents =
       hlfir::getIndexExtents(loc, builder, shape);
-  hlfir::LoopNest loopNest =
-      hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
-                         flangomp::shouldUseWorkshareLowering(assign));
-  builder.setInsertionPointToStart(loopNest.body);
-  auto arrayElement =
-      hlfir::getElementAt(loc, builder, lhs, loopNest.oneBasedIndices);
-  builder.create<hlfir::AssignOp>(loc, rhs, arrayElement);
+
+  if (lhs.isSimplyContiguous() && extents.size() > 1) {
+    // Flatten the array to use a single assign loop, that can be better
+    // optimized.
+    mlir::Value n = extents[0];
+    for (size_t i = 1; i < extents.size(); ++i)
+      n = builder.create<mlir::arith::MulIOp>(loc, n, extents[i]);
+    extents = {n};
+    shape = builder.genShape(loc, extents);
+    mlir::Type flatArrayType =
+        fir::ReferenceType::get(fir::SequenceType::get(eleTy, 1));
+    mlir::Value flatArray = lhs.getBase();
+    if (mlir::isa<fir::BoxType>(lhs.getType()))
+      flatArray = builder.create<fir::BoxAddrOp>(loc, flatArray);
+    flatArray = builder.createConvert(loc, flatArrayType, flatArray);
+
+    hlfir::LoopNest loopNest =
+        hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
+                           flangomp::shouldUseWorkshareLowering(assign));
+    builder.setInsertionPointToStart(loopNest.body);
+
+    mlir::Value coor = builder.create<fir::ArrayCoorOp>(
+        loc, fir::ReferenceType::get(eleTy), flatArray, shape,
+        /*slice=*/mlir::Value{}, loopNest.oneBasedIndices,
+        /*typeparams=*/mlir::ValueRange{});
+    builder.create<fir::StoreOp>(loc, rhs, coor);
+  } else {
+    hlfir::LoopNest loopNest =
+        hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
+                           flangomp::shouldUseWorkshareLowering(assign));
+    builder.setInsertionPointToStart(loopNest.body);
+    auto arrayElement =
+        hlfir::getElementAt(loc, builder, lhs, loopNest.oneBasedIndices);
+    builder.create<hlfir::AssignOp>(loc, rhs, arrayElement);
+  }
+
   rewriter.eraseOp(assign);
   return mlir::success();
 }
diff --git a/flang/test/HLFIR/opt-scalar-assign.fir b/flang/test/HLFIR/opt-scalar-assign.fir
index 02ab02945b042..0f78d68f17ac8 100644
--- a/flang/test/HLFIR/opt-scalar-assign.fir
+++ b/flang/test/HLFIR/opt-scalar-assign.fir
@@ -12,18 +12,20 @@ func.func @_QPtest1() {
   return
 }
 // CHECK-LABEL:   func.func @_QPtest1() {
-// CHECK:           %[[VAL_0:.*]] = arith.constant 1 : index
-// CHECK:           %[[VAL_1:.*]] = arith.constant 0.000000e+00 : f32
-// CHECK:           %[[VAL_2:.*]] = arith.constant 11 : index
-// CHECK:           %[[VAL_3:.*]] = arith.constant 13 : index
-// CHECK:           %[[VAL_4:.*]] = fir.alloca !fir.array<11x13xf32> {bindc_name = "x", uniq_name = "_QFtest1Ex"}
-// CHECK:           %[[VAL_5:.*]] = fir.shape %[[VAL_2]], %[[VAL_3]] : (index, index) -> !fir.shape<2>
-// CHECK:           %[[VAL_6:.*]]:2 = hlfir.declare %[[VAL_4]](%[[VAL_5]]) {uniq_name = "_QFtest1Ex"} : (!fir.ref<!fir.array<11x13xf32>>, !fir.shape<2>) -> (!fir.ref<!fir.array<11x13xf32>>, !fir.ref<!fir.array<11x13xf32>>)
-// CHECK:           fir.do_loop %[[VAL_7:.*]] = %[[VAL_0]] to %[[VAL_3]] step %[[VAL_0]] unordered {
-// CHECK:             fir.do_loop %[[VAL_8:.*]] = %[[VAL_0]] to %[[VAL_2]] step %[[VAL_0]] unordered {
-// CHECK:               %[[VAL_9:.*]] = hlfir.designate %[[VAL_6]]#0 (%[[VAL_8]], %[[VAL_7]])  : (!fir.ref<!fir.array<11x13xf32>>, index, index) -> !fir.ref<f32>
-// CHECK:               hlfir.assign %[[VAL_1]] to %[[VAL_9]] : f32, !fir.ref<f32>
-// CHECK:             }
+// CHECK:           %[[VAL_0:.*]] = arith.constant 143 : index
+// CHECK:           %[[VAL_1:.*]] = arith.constant 1 : index
+// CHECK:           %[[VAL_2:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK:           %[[VAL_3:.*]] = arith.constant 11 : index
+// CHECK:           %[[VAL_4:.*]] = arith.constant 13 : index
+// CHECK:           %[[VAL_5:.*]] = fir.alloca !fir.array<11x13xf32> {bindc_name = "x", uniq_name = "_QFtest1Ex"}
+// CHECK:           %[[VAL_6:.*]] = fir.shape %[[VAL_3]], %[[VAL_4]] : (index, index) -> !fir.shape<2>
+// CHECK:           %[[VAL_7:.*]]:2 = hlfir.declare %[[VAL_5]](%[[VAL_6]]) {uniq_name = "_QFtest1Ex"} : (!fir.ref<!fir.array<11x13xf32>>, !fir.shape<2>) -> (!fir.ref<!fir.array<11x13xf32>>, !fir.ref<!fir.array<11x13xf32>>)
+
+// CHECK:           %[[VAL_8:.*]] = fir.shape %[[VAL_0]] : (index) -> !fir.shape<1>
+// CHECK:           %[[VAL_9:.*]] = fir.convert %[[VAL_7]]#0 : (!fir.ref<!fir.array<11x13xf32>>) -> !fir.ref<!fir.array<?xf32>>
+// CHECK:           fir.do_loop %[[VAL_10:.*]] = %[[VAL_1]] to %[[VAL_0]] step %[[VAL_1]] unordered {
+// CHECK:             %[[VAL_11:.*]] = fir.array_coor %[[VAL_9]](%[[VAL_8]]) %[[VAL_10]] : (!fir.ref<!fir.array<?xf32>>, !fir.shape<1>, index) -> !fir.ref<f32>
+// CHECK:             fir.store %[[VAL_2]] to %[[VAL_11]] : !fir.ref<f32>
 // CHECK:           }
 // CHECK:           return
 // CHECK:         }
@@ -129,3 +131,30 @@ func.func @_QPtest5(%arg0: !fir.ref<!fir.array<77xcomplex<f32>>> {fir.bindc_name
 // CHECK:           }
 // CHECK:           return
 // CHECK:         }
+
+func.func @_QPtest6(%arg0: !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>> {fir.bindc_name = "x"}) {
+  %c0_i32 = arith.constant 0 : i32
+  %0:2 = hlfir.declare %arg0 {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFtest6Ex"} : (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>) -> (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>)
+  hlfir.assign %c0_i32 to %0#0 realloc : i32, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>
+  return
+}
+
+// CHECK-LABEL:   func.func @_QPtest6(
+// CHECK-SAME:                        %[[VAL_0:.*]]: !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>> {fir.bindc_name = "x"}) {
+// CHECK:           %[[VAL_1:.*]] = arith.constant 1 : index
+// CHECK:           %[[VAL_2:.*]] = arith.constant 0 : index
+// CHECK:           %[[VAL_3:.*]] = arith.constant 0 : i32
+// CHECK:           %[[VAL_4:.*]]:2 = hlfir.declare %[[VAL_0]] {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFtest6Ex"} : (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>) -> (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>)
+// CHECK:           %[[VAL_5:.*]] = fir.load %[[VAL_4]]#0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>
+// CHECK:           %[[VAL_6:.*]]:3 = fir.box_dims %[[VAL_5]], %[[VAL_2]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>, index) -> (index, index, index)
+// CHECK:           %[[VAL_7:.*]]:3 = fir.box_dims %[[VAL_5]], %[[VAL_1]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>, index) -> (index, index, index)
+// CHECK:           %[[VAL_8:.*]] = arith.muli %[[VAL_6]]#1, %[[VAL_7]]#1 : index
+// CHECK:           %[[VAL_9:.*]] = fir.shape %[[VAL_8]] : (index) -> !fir.shape<1>
+// CHECK:           %[[VAL_10:.*]] = fir.box_addr %[[VAL_5]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>) -> !fir.heap<!fir.array<?x?xi32>>
+// CHECK:           %[[VAL_11:.*]] = fir.convert %[[VAL_10]] : (!fir.heap<!fir.array<?x?xi32>>) -> !fir.ref<!fir.array<?xi32>>
+// CHECK:           fir.do_loop %[[VAL_12:.*]] = %[[VAL_1]] to %[[VAL_8]] step %[[VAL_1]] unordered {
+// CHECK:             %[[VAL_13:.*]] = fir.array_coor %[[VAL_11]](%[[VAL_9]]) %[[VAL_12]] : (!fir.ref<!fir.array<?xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
+// CHECK:             fir.store %[[VAL_3]] to %[[VAL_13]] : !fir.ref<i32>
+// CHECK:           }
+// CHECK:           return
+// CHECK:         }

@luporl luporl requested a review from vzakhari June 30, 2025 19:15
Copy link
Contributor

@vzakhari vzakhari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes! It looks good to me, though I would prefer keeping HLFIR operations at this stage (please see my comments inlined).

Copy link
Contributor

@vzakhari vzakhari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!


bool isArrayRef =
mlir::isa<fir::SequenceType>(fir::unwrapRefType(lhs.getType()));
if (lhs.isSimplyContiguous() && extents.size() > 1 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't lhs.isSimplyContiguous() always true for the allocatable arrays? What is the reason for adding isAllocatableArray method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, lhs.isSimplyContiguous() is always true for allocatable arrays.

I've added isAllocatableArray because the code now handles only !fir.ref<!fir.array> and !fir.box<!fir.heap<!fir.array>> types.

I guess it should also handle !fir.box<!fir.array>, when those are contiguous. They may appear when using non-default lower bounds.

By handling these cases, do you think it would be safe to drop the isAllocatableArray check? Or are there other array types that also pass the isSimplyContiguous check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe isSimplyContiguous may also return true for !fir.box<!fir.ptr<.... I do not think you need to explicitly handle all these different cases. There is hlfir::derefPointersAndAllocatables call above. If lhs is a box, you may use the box_addr operation to get the base address of the array, otherwise, lhs is the base address of the array already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removing the isAllocatableArray check also made it easier to support !fir.box<!fir.array>.
Now all boxed arrays are converted to !fir.box<!fir.array<?x type>>

Copy link
Contributor

@vzakhari vzakhari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! LGTM

@luporl
Copy link
Contributor Author

luporl commented Jul 4, 2025

Thanks for the review!

@luporl luporl merged commit 0ba5958 into llvm:main Jul 4, 2025
9 checks passed
@luporl luporl deleted the luporl-opt-array-ini branch July 4, 2025 11:49
@eugeneepshteyn
Copy link
Contributor

eugeneepshteyn commented Jul 7, 2025

@luporl , this change seems to cause regression with a test in llvm-test-suite:

$ flang -c llvm-test-suite/Fortran/gfortran/regression/ISO_Fortran_binding_1.f90 -O1
flang: llvm-project/llvm/include/llvm/Support/Casting.h:572: decltype(auto) llvm::cast(From &) [To = mlir::arith::ConstantOp, From = mlir::Operation]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.

The same failure can be observed in various apps that use C interop in similar fashion.

This failure only happens at -O1 and above.

The failure is gone if I revert this patch in my local workspace.

Could you please check?

luporl added a commit to luporl/llvm-project that referenced this pull request Jul 7, 2025
Host associated variables were not being handled properly.
For array references, get the fixed shape extents from the value
type instead, that works correctly in all cases.
@luporl
Copy link
Contributor Author

luporl commented Jul 7, 2025

@eugeneepshteyn, the problem is actually related to host associated array variables.
This is a reduced reproducer:

  integer, dimension(2,2) :: a

contains
  subroutine test
    a = 0
  end subroutine
end

ISO_Fortran_binding_1.f90 test seems to be disabled, and the buildbots I've checked are not failing in llvm-test-suite.
Was this regression detected with a custom run of llvm-test-suite?

In any case, I'm finishing a PR to fix this issue.

@eugeneepshteyn
Copy link
Contributor

It was detected by a custom run of a test from llvm-test-sute.

Should probably check why ISO_Fortran_binding_1.f90 is disabled. Perhaps it could be enabled now.

Do buildbots also run tests at -O1 and higher optimization levels?

@luporl
Copy link
Contributor Author

luporl commented Jul 7, 2025

Should probably check why ISO_Fortran_binding_1.f90 is disabled. Perhaps it could be enabled now.

It's listed in Fortran/gfortran/regression/DisabledFiles.cmake. The comment above it says:
# unimplemented: assumed rank in procedure interface.

Do buildbots also run tests at -O1 and higher optimization levels?

Probably. Some explicitly add -O3 to CMAKE_Fortran_FLAGS, like https://lab.llvm.org/buildbot/#/builders/17/builds/9391/steps/14/logs/stdio, but it seems the default, for gfortran test suite, is also -O3 if flags are not set.

There are also some tests that explicitly set the optimization level, as some are intended for -O0 and others for optimized code.

@eugeneepshteyn
Copy link
Contributor

The comment for ISO_Fortran_binding_1.f90 may be wrong at this point. I'll look into enabling this test.

luporl added a commit that referenced this pull request Jul 8, 2025
Host associated variables were not being handled properly.
For array references, get the fixed shape extents from the value
type instead, that works correctly in all cases.
luporl added a commit to luporl/llvm-project that referenced this pull request Oct 15, 2025
…#147371)

Host associated variables were not being handled properly.
For array references, get the fixed shape extents from the value
type instead, that works correctly in all cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flang:fir-hlfir flang Flang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants