-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[flang] Optimize assignments of multidimensional arrays #146408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Assignments of n-dimensional arrays, with trivial RHS, were always being converted to n nested loops. For contiguous arrays, it's possible to flatten them and use a single loop, that can usually be better optimized by LLVM. In a test program, using a 3-dimensional array and varying its size, the resulting speedup was as follows (measured on Graviton4): 16K 1.09 64K 1.40 128K 1.90 256K 1.91 512K 1.00 For sizes above or equal to 512K no improvement was observed. It looks like LLVM stops trying to perform aggressive loop unrolling at a certain threshold and just uses nested loops instead. Larger sizes won't fit on L1 and L2 caches too. This was noticed while profiling 527.cam4_r. This optimization makes aer_rad_props slightly faster, but unfortunately it practically doesn't change 527.cam4_r total execution time.
|
@llvm/pr-subscribers-flang-fir-hlfir Author: Leandro Lupori (luporl) ChangesAssignments of n-dimensional arrays, with trivial RHS, were In a test program, using a 3-dimensional array and varying its 16K 1.09 For sizes above or equal to 512K no improvement was observed. This was noticed while profiling 527.cam4_r. This optimization Full diff: https://github.com/llvm/llvm-project/pull/146408.diff 2 Files Affected:
diff --git a/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp b/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
index 91df8672c20d9..e88991b801415 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
+++ b/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
@@ -786,13 +786,42 @@ llvm::LogicalResult BroadcastAssignBufferization::matchAndRewrite(
mlir::Value shape = hlfir::genShape(loc, builder, lhs);
llvm::SmallVector<mlir::Value> extents =
hlfir::getIndexExtents(loc, builder, shape);
- hlfir::LoopNest loopNest =
- hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
- flangomp::shouldUseWorkshareLowering(assign));
- builder.setInsertionPointToStart(loopNest.body);
- auto arrayElement =
- hlfir::getElementAt(loc, builder, lhs, loopNest.oneBasedIndices);
- builder.create<hlfir::AssignOp>(loc, rhs, arrayElement);
+
+ if (lhs.isSimplyContiguous() && extents.size() > 1) {
+ // Flatten the array to use a single assign loop, that can be better
+ // optimized.
+ mlir::Value n = extents[0];
+ for (size_t i = 1; i < extents.size(); ++i)
+ n = builder.create<mlir::arith::MulIOp>(loc, n, extents[i]);
+ extents = {n};
+ shape = builder.genShape(loc, extents);
+ mlir::Type flatArrayType =
+ fir::ReferenceType::get(fir::SequenceType::get(eleTy, 1));
+ mlir::Value flatArray = lhs.getBase();
+ if (mlir::isa<fir::BoxType>(lhs.getType()))
+ flatArray = builder.create<fir::BoxAddrOp>(loc, flatArray);
+ flatArray = builder.createConvert(loc, flatArrayType, flatArray);
+
+ hlfir::LoopNest loopNest =
+ hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
+ flangomp::shouldUseWorkshareLowering(assign));
+ builder.setInsertionPointToStart(loopNest.body);
+
+ mlir::Value coor = builder.create<fir::ArrayCoorOp>(
+ loc, fir::ReferenceType::get(eleTy), flatArray, shape,
+ /*slice=*/mlir::Value{}, loopNest.oneBasedIndices,
+ /*typeparams=*/mlir::ValueRange{});
+ builder.create<fir::StoreOp>(loc, rhs, coor);
+ } else {
+ hlfir::LoopNest loopNest =
+ hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true,
+ flangomp::shouldUseWorkshareLowering(assign));
+ builder.setInsertionPointToStart(loopNest.body);
+ auto arrayElement =
+ hlfir::getElementAt(loc, builder, lhs, loopNest.oneBasedIndices);
+ builder.create<hlfir::AssignOp>(loc, rhs, arrayElement);
+ }
+
rewriter.eraseOp(assign);
return mlir::success();
}
diff --git a/flang/test/HLFIR/opt-scalar-assign.fir b/flang/test/HLFIR/opt-scalar-assign.fir
index 02ab02945b042..0f78d68f17ac8 100644
--- a/flang/test/HLFIR/opt-scalar-assign.fir
+++ b/flang/test/HLFIR/opt-scalar-assign.fir
@@ -12,18 +12,20 @@ func.func @_QPtest1() {
return
}
// CHECK-LABEL: func.func @_QPtest1() {
-// CHECK: %[[VAL_0:.*]] = arith.constant 1 : index
-// CHECK: %[[VAL_1:.*]] = arith.constant 0.000000e+00 : f32
-// CHECK: %[[VAL_2:.*]] = arith.constant 11 : index
-// CHECK: %[[VAL_3:.*]] = arith.constant 13 : index
-// CHECK: %[[VAL_4:.*]] = fir.alloca !fir.array<11x13xf32> {bindc_name = "x", uniq_name = "_QFtest1Ex"}
-// CHECK: %[[VAL_5:.*]] = fir.shape %[[VAL_2]], %[[VAL_3]] : (index, index) -> !fir.shape<2>
-// CHECK: %[[VAL_6:.*]]:2 = hlfir.declare %[[VAL_4]](%[[VAL_5]]) {uniq_name = "_QFtest1Ex"} : (!fir.ref<!fir.array<11x13xf32>>, !fir.shape<2>) -> (!fir.ref<!fir.array<11x13xf32>>, !fir.ref<!fir.array<11x13xf32>>)
-// CHECK: fir.do_loop %[[VAL_7:.*]] = %[[VAL_0]] to %[[VAL_3]] step %[[VAL_0]] unordered {
-// CHECK: fir.do_loop %[[VAL_8:.*]] = %[[VAL_0]] to %[[VAL_2]] step %[[VAL_0]] unordered {
-// CHECK: %[[VAL_9:.*]] = hlfir.designate %[[VAL_6]]#0 (%[[VAL_8]], %[[VAL_7]]) : (!fir.ref<!fir.array<11x13xf32>>, index, index) -> !fir.ref<f32>
-// CHECK: hlfir.assign %[[VAL_1]] to %[[VAL_9]] : f32, !fir.ref<f32>
-// CHECK: }
+// CHECK: %[[VAL_0:.*]] = arith.constant 143 : index
+// CHECK: %[[VAL_1:.*]] = arith.constant 1 : index
+// CHECK: %[[VAL_2:.*]] = arith.constant 0.000000e+00 : f32
+// CHECK: %[[VAL_3:.*]] = arith.constant 11 : index
+// CHECK: %[[VAL_4:.*]] = arith.constant 13 : index
+// CHECK: %[[VAL_5:.*]] = fir.alloca !fir.array<11x13xf32> {bindc_name = "x", uniq_name = "_QFtest1Ex"}
+// CHECK: %[[VAL_6:.*]] = fir.shape %[[VAL_3]], %[[VAL_4]] : (index, index) -> !fir.shape<2>
+// CHECK: %[[VAL_7:.*]]:2 = hlfir.declare %[[VAL_5]](%[[VAL_6]]) {uniq_name = "_QFtest1Ex"} : (!fir.ref<!fir.array<11x13xf32>>, !fir.shape<2>) -> (!fir.ref<!fir.array<11x13xf32>>, !fir.ref<!fir.array<11x13xf32>>)
+
+// CHECK: %[[VAL_8:.*]] = fir.shape %[[VAL_0]] : (index) -> !fir.shape<1>
+// CHECK: %[[VAL_9:.*]] = fir.convert %[[VAL_7]]#0 : (!fir.ref<!fir.array<11x13xf32>>) -> !fir.ref<!fir.array<?xf32>>
+// CHECK: fir.do_loop %[[VAL_10:.*]] = %[[VAL_1]] to %[[VAL_0]] step %[[VAL_1]] unordered {
+// CHECK: %[[VAL_11:.*]] = fir.array_coor %[[VAL_9]](%[[VAL_8]]) %[[VAL_10]] : (!fir.ref<!fir.array<?xf32>>, !fir.shape<1>, index) -> !fir.ref<f32>
+// CHECK: fir.store %[[VAL_2]] to %[[VAL_11]] : !fir.ref<f32>
// CHECK: }
// CHECK: return
// CHECK: }
@@ -129,3 +131,30 @@ func.func @_QPtest5(%arg0: !fir.ref<!fir.array<77xcomplex<f32>>> {fir.bindc_name
// CHECK: }
// CHECK: return
// CHECK: }
+
+func.func @_QPtest6(%arg0: !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>> {fir.bindc_name = "x"}) {
+ %c0_i32 = arith.constant 0 : i32
+ %0:2 = hlfir.declare %arg0 {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFtest6Ex"} : (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>) -> (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>)
+ hlfir.assign %c0_i32 to %0#0 realloc : i32, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>
+ return
+}
+
+// CHECK-LABEL: func.func @_QPtest6(
+// CHECK-SAME: %[[VAL_0:.*]]: !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>> {fir.bindc_name = "x"}) {
+// CHECK: %[[VAL_1:.*]] = arith.constant 1 : index
+// CHECK: %[[VAL_2:.*]] = arith.constant 0 : index
+// CHECK: %[[VAL_3:.*]] = arith.constant 0 : i32
+// CHECK: %[[VAL_4:.*]]:2 = hlfir.declare %[[VAL_0]] {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFtest6Ex"} : (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>) -> (!fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>, !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>)
+// CHECK: %[[VAL_5:.*]] = fir.load %[[VAL_4]]#0 : !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xi32>>>>
+// CHECK: %[[VAL_6:.*]]:3 = fir.box_dims %[[VAL_5]], %[[VAL_2]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>, index) -> (index, index, index)
+// CHECK: %[[VAL_7:.*]]:3 = fir.box_dims %[[VAL_5]], %[[VAL_1]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>, index) -> (index, index, index)
+// CHECK: %[[VAL_8:.*]] = arith.muli %[[VAL_6]]#1, %[[VAL_7]]#1 : index
+// CHECK: %[[VAL_9:.*]] = fir.shape %[[VAL_8]] : (index) -> !fir.shape<1>
+// CHECK: %[[VAL_10:.*]] = fir.box_addr %[[VAL_5]] : (!fir.box<!fir.heap<!fir.array<?x?xi32>>>) -> !fir.heap<!fir.array<?x?xi32>>
+// CHECK: %[[VAL_11:.*]] = fir.convert %[[VAL_10]] : (!fir.heap<!fir.array<?x?xi32>>) -> !fir.ref<!fir.array<?xi32>>
+// CHECK: fir.do_loop %[[VAL_12:.*]] = %[[VAL_1]] to %[[VAL_8]] step %[[VAL_1]] unordered {
+// CHECK: %[[VAL_13:.*]] = fir.array_coor %[[VAL_11]](%[[VAL_9]]) %[[VAL_12]] : (!fir.ref<!fir.array<?xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
+// CHECK: fir.store %[[VAL_3]] to %[[VAL_13]] : !fir.ref<i32>
+// CHECK: }
+// CHECK: return
+// CHECK: }
|
vzakhari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes! It looks good to me, though I would prefer keeping HLFIR operations at this stage (please see my comments inlined).
flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
Outdated
Show resolved
Hide resolved
flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
Outdated
Show resolved
Hide resolved
vzakhari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
|
|
||
| bool isArrayRef = | ||
| mlir::isa<fir::SequenceType>(fir::unwrapRefType(lhs.getType())); | ||
| if (lhs.isSimplyContiguous() && extents.size() > 1 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't lhs.isSimplyContiguous() always true for the allocatable arrays? What is the reason for adding isAllocatableArray method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, lhs.isSimplyContiguous() is always true for allocatable arrays.
I've added isAllocatableArray because the code now handles only !fir.ref<!fir.array> and !fir.box<!fir.heap<!fir.array>> types.
I guess it should also handle !fir.box<!fir.array>, when those are contiguous. They may appear when using non-default lower bounds.
By handling these cases, do you think it would be safe to drop the isAllocatableArray check? Or are there other array types that also pass the isSimplyContiguous check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe isSimplyContiguous may also return true for !fir.box<!fir.ptr<.... I do not think you need to explicitly handle all these different cases. There is hlfir::derefPointersAndAllocatables call above. If lhs is a box, you may use the box_addr operation to get the base address of the array, otherwise, lhs is the base address of the array already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removing the isAllocatableArray check also made it easier to support !fir.box<!fir.array>.
Now all boxed arrays are converted to !fir.box<!fir.array<?x type>>
flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
Outdated
Show resolved
Hide resolved
vzakhari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! LGTM
|
Thanks for the review! |
|
@luporl , this change seems to cause regression with a test in llvm-test-suite: The same failure can be observed in various apps that use C interop in similar fashion. This failure only happens at The failure is gone if I revert this patch in my local workspace. Could you please check? |
Host associated variables were not being handled properly. For array references, get the fixed shape extents from the value type instead, that works correctly in all cases.
|
@eugeneepshteyn, the problem is actually related to host associated array variables. integer, dimension(2,2) :: a
contains
subroutine test
a = 0
end subroutine
endISO_Fortran_binding_1.f90 test seems to be disabled, and the buildbots I've checked are not failing in llvm-test-suite. In any case, I'm finishing a PR to fix this issue. |
|
It was detected by a custom run of a test from llvm-test-sute. Should probably check why ISO_Fortran_binding_1.f90 is disabled. Perhaps it could be enabled now. Do buildbots also run tests at |
It's listed in Fortran/gfortran/regression/DisabledFiles.cmake. The comment above it says:
Probably. Some explicitly add There are also some tests that explicitly set the optimization level, as some are intended for |
|
The comment for ISO_Fortran_binding_1.f90 may be wrong at this point. I'll look into enabling this test. |
…#147371) Host associated variables were not being handled properly. For array references, get the fixed shape extents from the value type instead, that works correctly in all cases.
Assignments of n-dimensional arrays, with trivial RHS, were
always being converted to n nested loops. For contiguous arrays,
it's possible to flatten them and use a single loop, that can
usually be better optimized by LLVM.
In a test program, using a 3-dimensional array and varying its
size, the resulting speedup was as follows (measured on Graviton4):
16K 1.09
64K 1.40
128K 1.90
256K 1.91
512K 1.00
For sizes above or equal to 512K no improvement was observed.
It looks like LLVM stops trying to perform aggressive loop
unrolling at a certain threshold and just uses nested loops
instead. Larger sizes won't fit on L1 and L2 caches too.
This was noticed while profiling 527.cam4_r. This optimization
makes aer_rad_props_sw slightly faster, but unfortunately it
practically doesn't change 527.cam4_r total execution time.