-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[Flang][OMP] Replace SUM intrinsic call with SUM operations #113082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flang][OMP] Replace SUM intrinsic call with SUM operations #113082
Conversation
Add custom omp loop wrapper Add recursive memory effects trait to workshare Remove stray include Remove omp.workshare verifier Add assembly format for wrapper and add test Add verification and descriptions
Fix lower test for workshare
Emit loop nests in a custom wrapper Only emit unordered loops as omp loops Fix uninitialized memory bug in genLoopNest
Change to workshare loop wrapper op Move single op declaration Schedule pass properly Correctly handle nested nested loop nests to be parallelized by workshare Leave comments for shouldUseWorkshareLowering Use copyprivate to scatter val from omp.single TODO still need to implement copy function TODO transitive check for usage outside of omp.single not imiplemented yet Transitively check for users outisde of single op TODO need to implement copy func TODO need to hoist allocas outside of single regions Add tests Hoist allocas More tests Emit body for copy func Test the tmp storing logic Clean up trivially dead ops Only handle single-block regions for now Fix tests for custom assembly for loop wrapper Only run the lower workshare pass if openmp is enabled Implement some missing functionality Fix tests Fix test Iterate backwards to find all trivially dead ops Add expalanation comment for createCopyFun Update test
Bufferize test Bufferize test Bufferize test Add test for should use workshare lowering
|
@llvm/pr-subscribers-flang-fir-hlfir @llvm/pr-subscribers-flang-openmp Author: Thirumalai Shaktivel (Thirumalai-Shaktivel) ChangesContinuation from #104748 Using Ivan's patch: #104748 the intrinsic calls were enclosed within the single construct. The same idea has to be performed to implement all the intrinsics mentioned in Full diff: https://github.com/llvm/llvm-project/pull/113082.diff 6 Files Affected:
diff --git a/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt b/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
index fa3a59303137ff..43da1eb92d0306 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
+++ b/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
@@ -26,6 +26,7 @@ add_flang_library(HLFIRTransforms
FIRTransforms
HLFIRDialect
MLIRIR
+ FlangOpenMPTransforms
${dialect_libs}
LINK_COMPONENTS
diff --git a/flang/lib/Optimizer/OpenMP/CMakeLists.txt b/flang/lib/Optimizer/OpenMP/CMakeLists.txt
index 39e92d388288d4..776798d7239117 100644
--- a/flang/lib/Optimizer/OpenMP/CMakeLists.txt
+++ b/flang/lib/Optimizer/OpenMP/CMakeLists.txt
@@ -21,6 +21,7 @@ add_flang_library(FlangOpenMPTransforms
FortranCommon
MLIRFuncDialect
MLIROpenMPDialect
+ MLIRArithDialect
HLFIRDialect
MLIRIR
MLIRPass
diff --git a/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp b/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
index 225c585a02d913..9fb84f3e099c4c 100644
--- a/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
+++ b/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
@@ -16,6 +16,7 @@
//
//===----------------------------------------------------------------------===//
+#include "flang/Optimizer/Builder/HLFIRTools.h"
#include <flang/Optimizer/Builder/FIRBuilder.h>
#include <flang/Optimizer/Dialect/FIROps.h>
#include <flang/Optimizer/Dialect/FIRType.h>
@@ -335,49 +336,129 @@ static void parallelizeRegion(Region &sourceRegion, Region &targetRegion,
for (auto [i, opOrSingle] : llvm::enumerate(regions)) {
bool isLast = i + 1 == regions.size();
if (std::holds_alternative<SingleRegion>(opOrSingle)) {
- OpBuilder singleBuilder(sourceRegion.getContext());
- Block *singleBlock = new Block();
- singleBuilder.setInsertionPointToStart(singleBlock);
-
OpBuilder allocaBuilder(sourceRegion.getContext());
Block *allocaBlock = new Block();
allocaBuilder.setInsertionPointToStart(allocaBlock);
- OpBuilder parallelBuilder(sourceRegion.getContext());
- Block *parallelBlock = new Block();
- parallelBuilder.setInsertionPointToStart(parallelBlock);
-
- auto [allParallelized, copyprivateVars] =
- moveToSingle(std::get<SingleRegion>(opOrSingle), allocaBuilder,
- singleBuilder, parallelBuilder);
- if (allParallelized) {
- // The single region was not required as all operations were safe to
- // parallelize
- assert(copyprivateVars.empty());
- assert(allocaBlock->empty());
- delete singleBlock;
+ it = block.begin();
+ while (&*it != terminator)
+ if (isa<hlfir::SumOp>(it))
+ break;
+ else
+ it++;
+
+ if (auto sumOp = dyn_cast<hlfir::SumOp>(it)) {
+ /// Implementation:
+ /// Intrinsic function `SUM` operations
+ /// --
+ /// x = sum(array)
+ ///
+ /// is converted to
+ ///
+ /// !$omp parallel do
+ /// do i = 1, size(array)
+ /// x = x + array(i)
+ /// end do
+ /// !$omp end parallel do
+
+ OpBuilder wslBuilder(sourceRegion.getContext());
+ Block *wslBlock = new Block();
+ wslBuilder.setInsertionPointToStart(wslBlock);
+
+ Value target = dyn_cast<hlfir::AssignOp>(++it).getLhs();
+ Value array = sumOp.getArray();
+ Value dim = sumOp.getDim();
+ fir::SequenceType arrayTy = dyn_cast<fir::SequenceType>(
+ hlfir::getFortranElementOrSequenceType(array.getType()));
+ llvm::ArrayRef<int64_t> arrayShape = arrayTy.getShape();
+ if (arrayShape.size() == 1 && !dim) {
+ Value itr = allocaBuilder.create<fir::AllocaOp>(
+ loc, allocaBuilder.getI64Type());
+ Value c_one = allocaBuilder.create<arith::ConstantOp>(
+ loc, allocaBuilder.getI64IntegerAttr(1));
+ Value c_arr_size = allocaBuilder.create<arith::ConstantOp>(
+ loc, allocaBuilder.getI64IntegerAttr(arrayShape[0]));
+ // Value c_zero = allocaBuilder.create<arith::ConstantOp>(loc,
+ // allocaBuilder.getZeroAttr(arrayTy.getEleTy()));
+ // allocaBuilder.create<fir::StoreOp>(loc, c_zero, target);
+
+ omp::WsloopOperands wslOps;
+ omp::WsloopOp wslOp =
+ rootBuilder.create<omp::WsloopOp>(loc, wslOps);
+
+ hlfir::LoopNest ln;
+ ln.outerOp = wslOp;
+ omp::LoopNestOperands lnOps;
+ lnOps.loopLowerBounds.push_back(c_one);
+ lnOps.loopUpperBounds.push_back(c_arr_size);
+ lnOps.loopSteps.push_back(c_one);
+ lnOps.loopInclusive = wslBuilder.getUnitAttr();
+ omp::LoopNestOp lnOp =
+ wslBuilder.create<omp::LoopNestOp>(loc, lnOps);
+ Block *lnBlock = wslBuilder.createBlock(&lnOp.getRegion());
+ lnBlock->addArgument(c_one.getType(), loc);
+ wslBuilder.create<fir::StoreOp>(
+ loc, lnOp.getRegion().getArgument(0), itr);
+ Value tarLoad = wslBuilder.create<fir::LoadOp>(loc, target);
+ Value itrLoad = wslBuilder.create<fir::LoadOp>(loc, itr);
+ hlfir::DesignateOp arrDesOp = wslBuilder.create<hlfir::DesignateOp>(
+ loc, fir::ReferenceType::get(arrayTy.getEleTy()), array,
+ itrLoad);
+ Value desLoad = wslBuilder.create<fir::LoadOp>(loc, arrDesOp);
+ Value addf =
+ wslBuilder.create<arith::AddFOp>(loc, tarLoad, desLoad);
+ wslBuilder.create<fir::StoreOp>(loc, addf, target);
+ wslBuilder.create<omp::YieldOp>(loc);
+ ln.body = lnBlock;
+ wslOp.getRegion().push_back(wslBlock);
+ targetRegion.front().getOperations().splice(
+ wslOp->getIterator(), allocaBlock->getOperations());
+ } else {
+ emitError(loc, "Only 1D array scalar assignment for sum "
+ "instrinsic is supported in workshare construct");
+ return;
+ }
} else {
- omp::SingleOperands singleOperands;
- if (isLast)
- singleOperands.nowait = rootBuilder.getUnitAttr();
- singleOperands.copyprivateVars = copyprivateVars;
- cleanupBlock(singleBlock);
- for (auto var : singleOperands.copyprivateVars) {
- mlir::func::FuncOp funcOp =
- createCopyFunc(loc, var.getType(), firCopyFuncBuilder);
- singleOperands.copyprivateSyms.push_back(
- SymbolRefAttr::get(funcOp));
+ OpBuilder singleBuilder(sourceRegion.getContext());
+ Block *singleBlock = new Block();
+ singleBuilder.setInsertionPointToStart(singleBlock);
+
+ OpBuilder parallelBuilder(sourceRegion.getContext());
+ Block *parallelBlock = new Block();
+ parallelBuilder.setInsertionPointToStart(parallelBlock);
+
+ auto [allParallelized, copyprivateVars] =
+ moveToSingle(std::get<SingleRegion>(opOrSingle), allocaBuilder,
+ singleBuilder, parallelBuilder);
+ if (allParallelized) {
+ // The single region was not required as all operations were safe to
+ // parallelize
+ assert(copyprivateVars.empty());
+ assert(allocaBlock->empty());
+ delete singleBlock;
+ } else {
+ omp::SingleOperands singleOperands;
+ if (isLast)
+ singleOperands.nowait = rootBuilder.getUnitAttr();
+ singleOperands.copyprivateVars = copyprivateVars;
+ cleanupBlock(singleBlock);
+ for (auto var : singleOperands.copyprivateVars) {
+ mlir::func::FuncOp funcOp =
+ createCopyFunc(loc, var.getType(), firCopyFuncBuilder);
+ singleOperands.copyprivateSyms.push_back(
+ SymbolRefAttr::get(funcOp));
+ }
+ omp::SingleOp singleOp =
+ rootBuilder.create<omp::SingleOp>(loc, singleOperands);
+ singleOp.getRegion().push_back(singleBlock);
+ targetRegion.front().getOperations().splice(
+ singleOp->getIterator(), allocaBlock->getOperations());
}
- omp::SingleOp singleOp =
- rootBuilder.create<omp::SingleOp>(loc, singleOperands);
- singleOp.getRegion().push_back(singleBlock);
- targetRegion.front().getOperations().splice(
- singleOp->getIterator(), allocaBlock->getOperations());
+ rootBuilder.getInsertionBlock()->getOperations().splice(
+ rootBuilder.getInsertionPoint(), parallelBlock->getOperations());
+ delete parallelBlock;
}
- rootBuilder.getInsertionBlock()->getOperations().splice(
- rootBuilder.getInsertionPoint(), parallelBlock->getOperations());
delete allocaBlock;
- delete parallelBlock;
} else {
auto op = std::get<Operation *>(opOrSingle);
if (auto wslw = dyn_cast<omp::WorkshareLoopWrapperOp>(op)) {
diff --git a/flang/lib/Optimizer/Passes/Pipelines.cpp b/flang/lib/Optimizer/Passes/Pipelines.cpp
index c1a5902b747887..3a9166f2e0aa52 100644
--- a/flang/lib/Optimizer/Passes/Pipelines.cpp
+++ b/flang/lib/Optimizer/Passes/Pipelines.cpp
@@ -227,11 +227,11 @@ void createHLFIRToFIRPassPipeline(mlir::PassManager &pm, bool enableOpenMP,
hlfir::createOptimizedBufferization);
}
pm.addPass(hlfir::createLowerHLFIROrderedAssignments());
+ if (enableOpenMP)
+ pm.addPass(flangomp::createLowerWorkshare());
pm.addPass(hlfir::createLowerHLFIRIntrinsics());
pm.addPass(hlfir::createBufferizeHLFIR());
pm.addPass(hlfir::createConvertHLFIRtoFIR());
- if (enableOpenMP)
- pm.addPass(flangomp::createLowerWorkshare());
}
/// Create a pass pipeline for handling certain OpenMP transformations needed
diff --git a/flang/test/Fir/basic-program.fir b/flang/test/Fir/basic-program.fir
index 4b18acb7c2b430..9b9651a476e583 100644
--- a/flang/test/Fir/basic-program.fir
+++ b/flang/test/Fir/basic-program.fir
@@ -44,10 +44,10 @@ func.func @_QQmain() {
// PASSES-NEXT: 'omp.private' Pipeline
// PASSES-NEXT: OptimizedBufferization
// PASSES-NEXT: LowerHLFIROrderedAssignments
+// PASSES-NEXT: LowerWorkshare
// PASSES-NEXT: LowerHLFIRIntrinsics
// PASSES-NEXT: BufferizeHLFIR
// PASSES-NEXT: ConvertHLFIRtoFIR
-// PASSES-NEXT: LowerWorkshare
// PASSES-NEXT: CSE
// PASSES-NEXT: (S) 0 num-cse'd - Number of operations CSE'd
// PASSES-NEXT: (S) 0 num-dce'd - Number of operations DCE'd
diff --git a/flang/test/Integration/OpenMP/workshare02.f90 b/flang/test/Integration/OpenMP/workshare02.f90
new file mode 100644
index 00000000000000..68b810a32f247e
--- /dev/null
+++ b/flang/test/Integration/OpenMP/workshare02.f90
@@ -0,0 +1,36 @@
+!===----------------------------------------------------------------------===!
+! This directory can be used to add Integration tests involving multiple
+! stages of the compiler (for eg. from Fortran to LLVM IR). It should not
+! contain executable tests. We should only add tests here sparingly and only
+! if there is no other way to test. Repeat this message in each test that is
+! added to this directory and sub-directories.
+!===----------------------------------------------------------------------===!
+
+!RUN: %flang_fc1 -emit-mlir -fopenmp -O3 %s -o - | FileCheck %s --check-prefix MLIR
+
+program test_ws_01
+ implicit none
+ real(8) :: arr_01(10), x
+ arr_01 = [0.347,0.892,0.573,0.126,0.788,0.412,0.964,0.205,0.631,0.746]
+
+ !$omp parallel workshare
+ x = sum(arr_01)
+ !$omp end parallel workshare
+end program test_ws_01
+
+! MLIR: func.func @_QQmain
+! MLIR: omp.parallel {
+! [...]
+! MLIR: omp.wsloop {
+! MLIR: omp.loop_nest {{.*}}
+! [...]
+! MLIR: %[[SUM:.*]] = arith.addf {{.*}}
+! [...]
+! MLIR: omp.yield
+! MLIR: }
+! MLIR: }
+! MLIR: omp.barrier
+! MLIR: omp.terminator
+! MLIR: }
+! MLIR: return
+! MLIR: }
|
|
Two issues to fix in this PR,
|
|
Kindly share your thoughts on this idea. |
|
I think there are two main approaches possible here, and they can coexist: Have alternative intrinsic implementations in their own runtime librarye.g. the version of assignment for the workshare construct will be something like this And then for the supported subset the lowering of the array intrinsic becomes not but (It can also be implemented as a flag to the runtime function telling it which version it should use and not a different rt func) Hopefully using this approach would allow us to share some code between the current intrinsic runtime implementations (which do not assume an omp context) and these ones. I think this approach may be more flexible and allow for better optimizations like offloading to BLAS libraries or having an optimized code path behind an aliasing check. (These are possible in the second approach but more cumbersome?) Caveat: I am not sure if mixing Code gen the intrinsics at the IR levelWhich is what this approach would be. My concern with this is that it may be quite a lot of effort, but perhaps it has some benefits in that we should have better support of intrinsics on GPU targets where the current runtime does not compile as is? Maybe looking for some discussion on this when the approach with using a runtime library for the intrinsic implementation was decided could be useful, if it exists. |
tblah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion about whether this should be done in flang codegen or in a runtime library. Some drive-by thoughts:
- I imagine that using a C
omp forinside of a fortranomp parallelshould work so long as they use the same openmp library. Of course care will need to be taken about privatisation etc but that can probably be handled in the interface of the runtime function. - The above requirement basically breaks allowing users to pick their own openmp library when compiling their application. I'm not sure how widely used this is.
- This is already hard to read/review and SUM is a very simple case. I think there is a danger in re-inventing the wheel here. Something like a high quality parallelised MATMUL which is useful on both CPU and GPU sounds hard. There is a lot of work in that direction in "upstream" MLIR dialects. I know flang (for historical reasons) doesn't use linalg, affine, etc but we might have to eventually.
- If we decide not to re-use upstream mlir work, I think more complex intrinsics could get quite hard to review in this style. A C++ runtime library implementation would be easier to read and maintain
- Having it in a runtime library would make it easier for vendors to swap out their own implementations e.g. maybe somebody already has a great MATMUL for AMD GPUs and wants flang to use that, but the implementation is useless on a CPU
Overall I don't know what the right approach is. Maybe we should do both as Ivan said. I think if we do decide to implement several intrinsics this way, it should go in its own pass because the code could quickly become very long.
| fir::SequenceType arrayTy = dyn_cast<fir::SequenceType>( | ||
| hlfir::getFortranElementOrSequenceType(array.getType())); | ||
| llvm::ArrayRef<int64_t> arrayShape = arrayTy.getShape(); | ||
| if (arrayShape.size() == 1 && !dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you also need to check that the element type is floating point.
| /// | ||
| /// !$omp parallel do | ||
| /// do i = 1, size(array) | ||
| /// x = x + array(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work. x will be simultaneously updated from all of the threads - overwriting each-other. It is a classic race condition.
I think you need to add a reduction clause to the wsloop operation reducing x with +.
| targetRegion.front().getOperations().splice( | ||
| wslOp->getIterator(), allocaBlock->getOperations()); | ||
| } else { | ||
| emitError(loc, "Only 1D array scalar assignment for sum " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of emitting an error here it would be better to go to the outer else branch and use the runtime library version of SUM
| } | ||
| pm.addPass(hlfir::createLowerHLFIROrderedAssignments()); | ||
| if (enableOpenMP) | ||
| pm.addPass(flangomp::createLowerWorkshare()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the workshare lowering pass assumes it runs after HLFIR bufferization. For example see the changes in the previous PR #104748. @ivanradanov please could you advise on this.
Another approach would be to modify createLowerHLFIRIntrinsics so that it does not lower eligable hlfir.sum operations if they are inside of a workshare (a bit like how bufferization was modified in the PR above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can inject the code that lowers hlfir.<intrinsic> into openmp loops at the point the hlfir.<intrinsic>s are lowered into fir.calls into the flang rt library. We should have a check for whether we should use the generic lowering or the openmp loop lowering (using the shouldUseWorkshareLowering function) and then for the supported intrinsics lower it to openmp loops instead of the rt call.
Another option is to have another pass which runs before the pass that does the fir.call lowering and that pass takes care of the intrinsics we support.
Even if we use the first approach we should still probably have a separate compilation unit for the collection of those. This way it should not interfere much with the existing code and should be easy to integrate.
It might not be practical though. An So I guess we would have to pass the OpenMP struct to the runtime function and then do all openmp runtime calls manually inside of the runtime library - not ideal. Even this would be hard to do because the OpenMP struct is only added to the module when we convert to LLVMIR so refering to it here would break the layering of the compiler. |
|
Thanks for the inputs @tblah.
How do we usually have convergence on the definition of these runtime functions when we introduce a library? Is it usually through a RFC? Because if we think of vendors swapping implementations, does there need to be a consensus prior to the implementation? |
To be clear, I don't know of anyone who has expressed an interest to swap implementations here, this just looked like an advantage to me - like how some vendors have their own openmp library or veclib (vectorized libm). An RFC and discussion on community calls sounds like a good way to agree on an interface if this route is taken. |
They won't be able to that regardless. The OpenMP library will have to match what Flang is using; otherwise, two OpenMP implementations will co-exist in the same process and that's usually a recipe for problems. Regarding the initial question: I'm ambivalent on the direction (runtime entry point vs. code-gen). I will say though that if we go with the runtime entry point, we will tie the Fortran runtime to the OpenMP runtime, which may not be desired in all cases. |
I probably wasn't clear. I meant something like |
a0f2307 to
0abf058
Compare
Continuation from #104748
Using Ivan's PR: #104748 the intrinsic calls were enclosed within the single construct.
So, I decided to pick one and experiment to enclose them into a wsloop node
which helps us to share the operations among different threads.
Here, I picked SUM intrinsic and fixed it to generate wsloop and its operations
as a body. See: https://github.com/llvm/llvm-project/pull/113082/files#diff-51125680c30229c9697d5fd264bf0d3f4effb9034c8951abea53206e8402e72b for the generated MLIR.
The same idea has to be performed to implement all the intrinsics mentioned in
the documentation.