[flang] Optimize assignments of multidimensional arrays #146408

luporl · 2025-06-30T19:09:08Z

Assignments of n-dimensional arrays, with trivial RHS, were
always being converted to n nested loops. For contiguous arrays,
it's possible to flatten them and use a single loop, that can
usually be better optimized by LLVM.

In a test program, using a 3-dimensional array and varying its
size, the resulting speedup was as follows (measured on Graviton4):

16K 1.09
64K 1.40
128K 1.90
256K 1.91
512K 1.00

For sizes above or equal to 512K no improvement was observed.
It looks like LLVM stops trying to perform aggressive loop
unrolling at a certain threshold and just uses nested loops
instead. Larger sizes won't fit on L1 and L2 caches too.

This was noticed while profiling 527.cam4_r. This optimization
makes aer_rad_props_sw slightly faster, but unfortunately it
practically doesn't change 527.cam4_r total execution time.

Assignments of n-dimensional arrays, with trivial RHS, were always being converted to n nested loops. For contiguous arrays, it's possible to flatten them and use a single loop, that can usually be better optimized by LLVM. In a test program, using a 3-dimensional array and varying its size, the resulting speedup was as follows (measured on Graviton4): 16K 1.09 64K 1.40 128K 1.90 256K 1.91 512K 1.00 For sizes above or equal to 512K no improvement was observed. It looks like LLVM stops trying to perform aggressive loop unrolling at a certain threshold and just uses nested loops instead. Larger sizes won't fit on L1 and L2 caches too. This was noticed while profiling 527.cam4_r. This optimization makes aer_rad_props slightly faster, but unfortunately it practically doesn't change 527.cam4_r total execution time.

llvmbot · 2025-06-30T19:09:40Z

@llvm/pr-subscribers-flang-fir-hlfir

Author: Leandro Lupori (luporl)

Changes

Assignments of n-dimensional arrays, with trivial RHS, were
always being converted to n nested loops. For contiguous arrays,
it's possible to flatten them and use a single loop, that can
usually be better optimized by LLVM.

In a test program, using a 3-dimensional array and varying its
size, the resulting speedup was as follows (measured on Graviton4):

16K 1.09
64K 1.40
128K 1.90
256K 1.91
512K 1.00

For sizes above or equal to 512K no improvement was observed.
It looks like LLVM stops trying to perform aggressive loop
unrolling at a certain threshold and just uses nested loops
instead. Larger sizes won't fit on L1 and L2 caches too.

This was noticed while profiling 527.cam4_r. This optimization
makes aer_rad_props slightly faster, but unfortunately it
practically doesn't change 527.cam4_r total execution time.

Full diff: https://github.com/llvm/llvm-project/pull/146408.diff

2 Files Affected:

(modified) flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp (+36-7)
(modified) flang/test/HLFIR/opt-scalar-assign.fir (+41-12)

diff --git a/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp b/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
index 91df8672c20d9..e88991b801415 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
+++ b/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
@@ -786,13 +786,42 @@ llvm::LogicalResult BroadcastAssignBufferization::matchAndRewrite(
   mlir::Value shape = hlfir::genShape(loc, builder, lhs);
   llvm::SmallVector<mlir::Value> extents =
       hlfir::getIndexExtents(loc, builder, shape);
-  hlfir::LoopNest loopNest =
-      hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
-                         flangomp::shouldUseWorkshareLowering(assign));
-  builder.setInsertionPointToStart(loopNest.body);
-  auto arrayElement =
-      hlfir::getElementAt(loc, builder, lhs, loopNest.oneBasedIndices);
-  builder.create<hlfir::AssignOp>(loc, rhs, arrayElement);
+
+  if (lhs.isSimplyContiguous() && extents.size() > 1) {
+    // Flatten the array to use a single assign loop, that can be better
+    // optimized.
+    mlir::Value n = extents[0];
+    for (size_t i = 1; i < extents.size(); ++i)
+      n = builder.create<mlir::arith::MulIOp>(loc, n, extents[i]);
+    extents = {n};
+    shape = builder.genShape(loc, extents);
+    mlir::Type flatArrayType =
+        fir::ReferenceType::get(fir::SequenceType::get(eleTy, 1));
+    mlir::Value flatArray = lhs.getBase();
+    if (mlir::isa<fir::BoxType>(lhs.getType()))
+      flatArray = builder.create<fir::BoxAddrOp>(loc, flatArray);
+    flatArray = builder.createConvert(loc, flatArrayType, flatArray);
+
+    hlfir::LoopNest loopNest =
+        hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
+                           flangomp::shouldUseWorkshareLowering(assign));
+    builder.setInsertionPointToStart(loopNest.body);
+
+    mlir::Value coor = builder.create<fir::ArrayCoorOp>(
+        loc, fir::ReferenceType::get(eleTy), flatArray, shape,
+        /*slice=*/mlir::Value{}, loopNest.oneBasedIndices,
+        /*typeparams=*/mlir::ValueRange{});
+    builder.create<fir::StoreOp>(loc, rhs, coor);
+  } else {
+    hlfir::LoopNest loopNest =
+        hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
+                           flangomp::shouldUseWorkshareLowering(assign));
+    builder.setInsertionPointToStart(loopNest.body);
+    auto arrayElement =
+        hlfir::getElementAt(loc, builder, lhs, loopNest.oneBasedIndices);
+    builder.create<hlfir::AssignOp>(loc, rhs, arrayElement);
+  }
+
   rewriter.eraseOp(assign);
   return mlir::success();
 }
diff --git a/flang/test/HLFIR/opt-scalar-assign.fir b/flang/test/HLFIR/opt-scalar-assign.fir
index 02ab02945b042..0f78d68f17ac8 100644
--- a/flang/test/HLFIR/opt-scalar-assign.fir
+++ b/flang/test/HLFIR/opt-scalar-assign.fir
@@ -12,18 +12,20 @@ func.func @_QPtest1() {
   return
 }
 // CHECK-LABEL:   func.func @_QPtest1() {
-// CHECK:           %[[VAL_0:.*]] = arith.constant 1 : index
-// CHECK:           %[[VAL_1:.*]] = arith.constant 0.000000e+00 : f32
-// CHECK:           %[[VAL_2:.*]] = arith.constant 11 : index
-// CHECK:           %[[VAL_3:.*]] = arith.constant 13 : index
-// CHECK:           %[[VAL_4:.*]] = fir.alloca !fir.array<11x13xf32> {bindc_name = "x", uniq_name = "_QFtest1Ex"}
-// CHECK:           %[[VAL_5:.*]] = fir.shape %[[VAL_2]], %[[VAL_3]] : (index, index) -> !fir.shape<2>
-// CHECK:           %[[VAL_6:.*]]:2 = hlfir.declare %[[VAL_4]](%[[VAL_5]]) {uniq_name = "_QFtest1Ex"} : (!fir.ref<!fir.array<11x13xf32>>, !fir.shape<2>) -> (!fir.ref<!fir.array<11x13xf32>>, !fir.ref<!fir.array<11x13xf32>>)
-// CHECK:           fir.do_loop %[[VAL_7:.*]] = %[[VAL_0]] to %[[VAL_3]] step %[[VAL_0]] unordered {
-// CHECK:             fir.do_loop %[[VAL_8:.*]] = %[[VAL_0]] to %[[VAL_2]] step %[[VAL_0]] unordered {
-// CHECK:               %[[VAL_9:.*]] = hlfir.designate %[[VAL_6]]#0 (%[[VAL_8]], %[[VAL_7]])  : (!fir.ref<!fir.array<11x13xf32>>, index, index) -> !fir.ref<f32>
-// CHECK:               hlfir.assign %[[VAL_1]] to %[[VAL_9]] : f32, !fir.ref<f32>
-// CHECK:             }
+// CHECK:           %[[VAL_0:.*]] = arith.constant 143 : index
+// CHECK:           %[[VAL_1:.*]] = arith.constant 1 : index
+// CHECK:           %[[VAL_2:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK:           %[[VAL_3:.*]] = arith.constant 11 : index
+// CHECK:           %[[VAL_4:.*]] = arith.constant 13 : index
+// CHECK:           %[[VAL_5:.*]] = fir.alloca !fir.array<11x13xf32> {bindc_name = "x", uniq_name = "_QFtest1Ex"}
+// CHECK:           %[[VAL_6:.*]] = fir.shape %[[VAL_3]], %[[VAL_4]] : (index, index) -> !fir.shape<2>
+// CHECK:           %[[VAL_7:.*]]:2 = hlfir.declare %[[VAL_5]](%[[VAL_6]]) {uniq_name = "_QFtest1Ex"} : (!fir.ref<!fir.array<11x13xf32>>, !fir.shape<2>) -> (!fir.ref<!fir.array<11x13xf32>>, !fir.ref<!fir.array<11x13xf32>>)
+
+// CHECK:           %[[VAL_8:.*]] = fir.shape %[[VAL_0]] : (index) -> !fir.shape<1>
+// CHECK:           %[[VAL_9:.*]] = fir.convert %[[VAL_7]]#0 : (!fir.ref<!fir.array<11x13xf32>>) -> !fir.ref<!fir.array<?xf32>>
+// CHECK:           fir.do_loop %[[VAL_10:.*]] = %[[VAL_1]] to %[[VAL_0]] step %[[VAL_1]] unordered {
+// CHECK:             %[[VAL_11:.*]] = fir.array_coor %[[VAL_9]](%[[VAL_8]]) %[[VAL_10]] : (!fir.ref<!fir.array<?xf32>>, !fir.shape<1>, index) -> !fir.ref<f32>
+// CHECK:             fir.store %[[VAL_2]] to %[[VAL_11]] : !fir.ref<f32>
 // CHECK:           }
 // CHECK:           return
 // CHECK:         }
@@ -129,3 +131,30 @@ func.func @_QPtest5(%arg0: !fir.ref<!fir.array<77xcomplex<f32>>> {fir.bindc_name
 // CHECK:           }
 // CHECK:           return
 // CHECK:         }
+
+func.func @_QPtest6(%arg0: !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>> {fir.bindc_name = "x"}) {
+  %c0_i32 = arith.constant 0 : i32
+  %0:2 = hlfir.declare %arg0 {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFtest6Ex"} : (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>) -> (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>)
+  hlfir.assign %c0_i32 to %0#0 realloc : i32, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>
+  return
+}
+
+// CHECK-LABEL:   func.func @_QPtest6(
+// CHECK-SAME:                        %[[VAL_0:.*]]: !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>> {fir.bindc_name = "x"}) {
+// CHECK:           %[[VAL_1:.*]] = arith.constant 1 : index
+// CHECK:           %[[VAL_2:.*]] = arith.constant 0 : index
+// CHECK:           %[[VAL_3:.*]] = arith.constant 0 : i32
+// CHECK:           %[[VAL_4:.*]]:2 = hlfir.declare %[[VAL_0]] {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFtest6Ex"} : (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>) -> (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>)
+// CHECK:           %[[VAL_5:.*]] = fir.load %[[VAL_4]]#0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>
+// CHECK:           %[[VAL_6:.*]]:3 = fir.box_dims %[[VAL_5]], %[[VAL_2]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>, index) -> (index, index, index)
+// CHECK:           %[[VAL_7:.*]]:3 = fir.box_dims %[[VAL_5]], %[[VAL_1]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>, index) -> (index, index, index)
+// CHECK:           %[[VAL_8:.*]] = arith.muli %[[VAL_6]]#1, %[[VAL_7]]#1 : index
+// CHECK:           %[[VAL_9:.*]] = fir.shape %[[VAL_8]] : (index) -> !fir.shape<1>
+// CHECK:           %[[VAL_10:.*]] = fir.box_addr %[[VAL_5]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>) -> !fir.heap<!fir.array<?x?xi32>>
+// CHECK:           %[[VAL_11:.*]] = fir.convert %[[VAL_10]] : (!fir.heap<!fir.array<?x?xi32>>) -> !fir.ref<!fir.array<?xi32>>
+// CHECK:           fir.do_loop %[[VAL_12:.*]] = %[[VAL_1]] to %[[VAL_8]] step %[[VAL_1]] unordered {
+// CHECK:             %[[VAL_13:.*]] = fir.array_coor %[[VAL_11]](%[[VAL_9]]) %[[VAL_12]] : (!fir.ref<!fir.array<?xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
+// CHECK:             fir.store %[[VAL_3]] to %[[VAL_13]] : !fir.ref<i32>
+// CHECK:           }
+// CHECK:           return
+// CHECK:         }

vzakhari

Thank you for the changes! It looks good to me, though I would prefer keeping HLFIR operations at this stage (please see my comments inlined).

flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp

vzakhari

Thanks for the update!

vzakhari · 2025-07-02T22:03:43Z

flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp

+
+  bool isArrayRef =
+      mlir::isa<fir::SequenceType>(fir::unwrapRefType(lhs.getType()));
+  if (lhs.isSimplyContiguous() && extents.size() > 1 &&


Isn't lhs.isSimplyContiguous() always true for the allocatable arrays? What is the reason for adding isAllocatableArray method?

Yes, lhs.isSimplyContiguous() is always true for allocatable arrays.

I've added isAllocatableArray because the code now handles only !fir.ref<!fir.array> and !fir.box<!fir.heap<!fir.array>> types.

I guess it should also handle !fir.box<!fir.array>, when those are contiguous. They may appear when using non-default lower bounds.

By handling these cases, do you think it would be safe to drop the isAllocatableArray check? Or are there other array types that also pass the isSimplyContiguous check?

I believe isSimplyContiguous may also return true for !fir.box<!fir.ptr<.... I do not think you need to explicitly handle all these different cases. There is hlfir::derefPointersAndAllocatables call above. If lhs is a box, you may use the box_addr operation to get the base address of the array, otherwise, lhs is the base address of the array already.

Done. Removing the isAllocatableArray check also made it easier to support !fir.box<!fir.array>.
Now all boxed arrays are converted to !fir.box<!fir.array<?x type>>

flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp

vzakhari

Thank you! LGTM

luporl · 2025-07-04T11:49:29Z

Thanks for the review!

eugeneepshteyn · 2025-07-07T13:28:43Z

@luporl , this change seems to cause regression with a test in llvm-test-suite:

$ flang -c llvm-test-suite/Fortran/gfortran/regression/ISO_Fortran_binding_1.f90 -O1
flang: llvm-project/llvm/include/llvm/Support/Casting.h:572: decltype(auto) llvm::cast(From &) [To = mlir::arith::ConstantOp, From = mlir::Operation]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.

The same failure can be observed in various apps that use C interop in similar fashion.

This failure only happens at -O1 and above.

The failure is gone if I revert this patch in my local workspace.

Could you please check?

Host associated variables were not being handled properly. For array references, get the fixed shape extents from the value type instead, that works correctly in all cases.

luporl · 2025-07-07T17:18:34Z

@eugeneepshteyn, the problem is actually related to host associated array variables.
This is a reduced reproducer:

  integer, dimension(2,2) :: a

contains
  subroutine test
    a = 0
  end subroutine
end

ISO_Fortran_binding_1.f90 test seems to be disabled, and the buildbots I've checked are not failing in llvm-test-suite.
Was this regression detected with a custom run of llvm-test-suite?

In any case, I'm finishing a PR to fix this issue.

eugeneepshteyn · 2025-07-07T17:46:19Z

It was detected by a custom run of a test from llvm-test-sute.

Should probably check why ISO_Fortran_binding_1.f90 is disabled. Perhaps it could be enabled now.

Do buildbots also run tests at -O1 and higher optimization levels?

luporl · 2025-07-07T18:34:56Z

Should probably check why ISO_Fortran_binding_1.f90 is disabled. Perhaps it could be enabled now.

It's listed in Fortran/gfortran/regression/DisabledFiles.cmake. The comment above it says:
# unimplemented: assumed rank in procedure interface.

Do buildbots also run tests at -O1 and higher optimization levels?

Probably. Some explicitly add -O3 to CMAKE_Fortran_FLAGS, like https://lab.llvm.org/buildbot/#/builders/17/builds/9391/steps/14/logs/stdio, but it seems the default, for gfortran test suite, is also -O3 if flags are not set.

There are also some tests that explicitly set the optimization level, as some are intended for -O0 and others for optimized code.

eugeneepshteyn · 2025-07-07T20:05:53Z

The comment for ISO_Fortran_binding_1.f90 may be wrong at this point. I'll look into enabling this test.

Host associated variables were not being handled properly. For array references, get the fixed shape extents from the value type instead, that works correctly in all cases.

…#147371) Host associated variables were not being handled properly. For array references, get the fixed shape extents from the value type instead, that works correctly in all cases.

llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Jun 30, 2025

luporl requested a review from vzakhari June 30, 2025 19:15

vzakhari reviewed Jun 30, 2025

View reviewed changes

flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp Outdated Show resolved Hide resolved

flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp Outdated Show resolved Hide resolved

Use hlfir.designate and hlfir.assign

cc08a23

vzakhari reviewed Jul 2, 2025

View reviewed changes

Remoeve isAllocatableArray check

2eb7b7e

vzakhari approved these changes Jul 3, 2025

View reviewed changes

luporl merged commit 0ba5958 into llvm:main Jul 4, 2025
9 checks passed

luporl deleted the luporl-opt-array-ini branch July 4, 2025 11:49

luporl mentioned this pull request Jul 7, 2025

[flang] Fix optimization of array assignments after #146408 #147371

Merged

[flang] Optimize assignments of multidimensional arrays #146408

[flang] Optimize assignments of multidimensional arrays #146408

Uh oh!

Conversation

luporl commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 30, 2025

Uh oh!

vzakhari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vzakhari left a comment

Choose a reason for hiding this comment

Uh oh!

vzakhari Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

luporl Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

vzakhari Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

luporl Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vzakhari left a comment

Choose a reason for hiding this comment

Uh oh!

luporl commented Jul 4, 2025

Uh oh!

Uh oh!

eugeneepshteyn commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luporl commented Jul 7, 2025

Uh oh!

eugeneepshteyn commented Jul 7, 2025

Uh oh!

luporl commented Jul 7, 2025

Uh oh!

eugeneepshteyn commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

luporl commented Jun 30, 2025 •

edited

Loading

eugeneepshteyn commented Jul 7, 2025 •

edited

Loading