[Flang][OMP] Replace SUM intrinsic call with SUM operations #113082

Thirumalai-Shaktivel · 2024-10-20T11:01:25Z

Continuation from #104748

From Documentation:

Evaluation of transformational array intrinsic functions may be freely
subdivided into any number of units of work.
The transformational array intrinsic functions are MATMUL, DOT_PRODUCT, SUM,
PRODUCT, MAXVAL, MINVAL, COUNT, ANY, ALL, SPREAD, PACK, UNPACK, RESHAPE,
TRANSPOSE, EOSHIFT, CSHIFT, MINLOC, and MAXLOC.

Using Ivan's PR: #104748 the intrinsic calls were enclosed within the single construct.
So, I decided to pick one and experiment to enclose them into a wsloop node
which helps us to share the operations among different threads.
Here, I picked SUM intrinsic and fixed it to generate wsloop and its operations
as a body. See: https://github.com/llvm/llvm-project/pull/113082/files#diff-51125680c30229c9697d5fd264bf0d3f4effb9034c8951abea53206e8402e72b for the generated MLIR.

The same idea has to be performed to implement all the intrinsics mentioned in
the documentation.

Add custom omp loop wrapper Add recursive memory effects trait to workshare Remove stray include Remove omp.workshare verifier Add assembly format for wrapper and add test Add verification and descriptions

Fix lower test for workshare

Emit loop nests in a custom wrapper Only emit unordered loops as omp loops Fix uninitialized memory bug in genLoopNest

Change to workshare loop wrapper op Move single op declaration Schedule pass properly Correctly handle nested nested loop nests to be parallelized by workshare Leave comments for shouldUseWorkshareLowering Use copyprivate to scatter val from omp.single TODO still need to implement copy function TODO transitive check for usage outside of omp.single not imiplemented yet Transitively check for users outisde of single op TODO need to implement copy func TODO need to hoist allocas outside of single regions Add tests Hoist allocas More tests Emit body for copy func Test the tmp storing logic Clean up trivially dead ops Only handle single-block regions for now Fix tests for custom assembly for loop wrapper Only run the lower workshare pass if openmp is enabled Implement some missing functionality Fix tests Fix test Iterate backwards to find all trivially dead ops Add expalanation comment for createCopyFun Update test

Bufferize test Bufferize test Bufferize test Add test for should use workshare lowering

llvmbot · 2024-10-20T11:01:48Z

@llvm/pr-subscribers-flang-fir-hlfir

@llvm/pr-subscribers-flang-openmp

Author: Thirumalai Shaktivel (Thirumalai-Shaktivel)

Changes

Continuation from #104748
> From Documentation:

Evaluation of transformational array intrinsic functions may be freely
subdivided into any number of units of work.
The transformational array intrinsic functions are MATMUL, DOT_PRODUCT, SUM,
PRODUCT, MAXVAL, MINVAL, COUNT, ANY, ALL, SPREAD, PACK, UNPACK, RESHAPE,
TRANSPOSE, EOSHIFT, CSHIFT, MINLOC, and MAXLOC.

Using Ivan's patch: #104748 the intrinsic calls were enclosed within the single construct.
So, I decided to pick one and experiment to enclose them into a wsloop node
which helps us to share the operations among different threads.
Here, I picked SUM intrinsic and fixed it to generate wsloop and its operations
as a body. See: for the generated MLIR.

The same idea has to be performed to implement all the intrinsics mentioned in
the documentation.

Full diff: https://github.com/llvm/llvm-project/pull/113082.diff

6 Files Affected:

(modified) flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt (+1)
(modified) flang/lib/Optimizer/OpenMP/CMakeLists.txt (+1)
(modified) flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp (+116-35)
(modified) flang/lib/Optimizer/Passes/Pipelines.cpp (+2-2)
(modified) flang/test/Fir/basic-program.fir (+1-1)
(added) flang/test/Integration/OpenMP/workshare02.f90 (+36)

diff --git a/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt b/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
index fa3a59303137ff..43da1eb92d0306 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
+++ b/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
@@ -26,6 +26,7 @@ add_flang_library(HLFIRTransforms
   FIRTransforms
   HLFIRDialect
   MLIRIR
+  FlangOpenMPTransforms
   ${dialect_libs}
 
   LINK_COMPONENTS
diff --git a/flang/lib/Optimizer/OpenMP/CMakeLists.txt b/flang/lib/Optimizer/OpenMP/CMakeLists.txt
index 39e92d388288d4..776798d7239117 100644
--- a/flang/lib/Optimizer/OpenMP/CMakeLists.txt
+++ b/flang/lib/Optimizer/OpenMP/CMakeLists.txt
@@ -21,6 +21,7 @@ add_flang_library(FlangOpenMPTransforms
   FortranCommon
   MLIRFuncDialect
   MLIROpenMPDialect
+  MLIRArithDialect
   HLFIRDialect
   MLIRIR
   MLIRPass
diff --git a/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp b/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
index 225c585a02d913..9fb84f3e099c4c 100644
--- a/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
+++ b/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
@@ -16,6 +16,7 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "flang/Optimizer/Builder/HLFIRTools.h"
 #include <flang/Optimizer/Builder/FIRBuilder.h>
 #include <flang/Optimizer/Dialect/FIROps.h>
 #include <flang/Optimizer/Dialect/FIRType.h>
@@ -335,49 +336,129 @@ static void parallelizeRegion(Region &sourceRegion, Region &targetRegion,
     for (auto [i, opOrSingle] : llvm::enumerate(regions)) {
       bool isLast = i + 1 == regions.size();
       if (std::holds_alternative<SingleRegion>(opOrSingle)) {
-        OpBuilder singleBuilder(sourceRegion.getContext());
-        Block *singleBlock = new Block();
-        singleBuilder.setInsertionPointToStart(singleBlock);
-
         OpBuilder allocaBuilder(sourceRegion.getContext());
         Block *allocaBlock = new Block();
         allocaBuilder.setInsertionPointToStart(allocaBlock);
 
-        OpBuilder parallelBuilder(sourceRegion.getContext());
-        Block *parallelBlock = new Block();
-        parallelBuilder.setInsertionPointToStart(parallelBlock);
-
-        auto [allParallelized, copyprivateVars] =
-            moveToSingle(std::get<SingleRegion>(opOrSingle), allocaBuilder,
-                         singleBuilder, parallelBuilder);
-        if (allParallelized) {
-          // The single region was not required as all operations were safe to
-          // parallelize
-          assert(copyprivateVars.empty());
-          assert(allocaBlock->empty());
-          delete singleBlock;
+        it = block.begin();
+        while (&*it != terminator)
+          if (isa<hlfir::SumOp>(it))
+            break;
+          else
+            it++;
+
+        if (auto sumOp = dyn_cast<hlfir::SumOp>(it)) {
+          /// Implementation:
+          /// Intrinsic function `SUM` operations
+          /// --
+          /// x = sum(array)
+          ///
+          /// is converted to
+          ///
+          /// !$omp parallel do
+          /// do i = 1, size(array)
+          ///     x = x + array(i)
+          /// end do
+          /// !$omp end parallel do
+
+          OpBuilder wslBuilder(sourceRegion.getContext());
+          Block *wslBlock = new Block();
+          wslBuilder.setInsertionPointToStart(wslBlock);
+
+          Value target = dyn_cast<hlfir::AssignOp>(++it).getLhs();
+          Value array = sumOp.getArray();
+          Value dim = sumOp.getDim();
+          fir::SequenceType arrayTy = dyn_cast<fir::SequenceType>(
+              hlfir::getFortranElementOrSequenceType(array.getType()));
+          llvm::ArrayRef<int64_t> arrayShape = arrayTy.getShape();
+          if (arrayShape.size() == 1 && !dim) {
+            Value itr = allocaBuilder.create<fir::AllocaOp>(
+                loc, allocaBuilder.getI64Type());
+            Value c_one = allocaBuilder.create<arith::ConstantOp>(
+                loc, allocaBuilder.getI64IntegerAttr(1));
+            Value c_arr_size = allocaBuilder.create<arith::ConstantOp>(
+                loc, allocaBuilder.getI64IntegerAttr(arrayShape[0]));
+            // Value c_zero = allocaBuilder.create<arith::ConstantOp>(loc,
+            //     allocaBuilder.getZeroAttr(arrayTy.getEleTy()));
+            // allocaBuilder.create<fir::StoreOp>(loc, c_zero, target);
+
+            omp::WsloopOperands wslOps;
+            omp::WsloopOp wslOp =
+                rootBuilder.create<omp::WsloopOp>(loc, wslOps);
+
+            hlfir::LoopNest ln;
+            ln.outerOp = wslOp;
+            omp::LoopNestOperands lnOps;
+            lnOps.loopLowerBounds.push_back(c_one);
+            lnOps.loopUpperBounds.push_back(c_arr_size);
+            lnOps.loopSteps.push_back(c_one);
+            lnOps.loopInclusive = wslBuilder.getUnitAttr();
+            omp::LoopNestOp lnOp =
+                wslBuilder.create<omp::LoopNestOp>(loc, lnOps);
+            Block *lnBlock = wslBuilder.createBlock(&lnOp.getRegion());
+            lnBlock->addArgument(c_one.getType(), loc);
+            wslBuilder.create<fir::StoreOp>(
+                loc, lnOp.getRegion().getArgument(0), itr);
+            Value tarLoad = wslBuilder.create<fir::LoadOp>(loc, target);
+            Value itrLoad = wslBuilder.create<fir::LoadOp>(loc, itr);
+            hlfir::DesignateOp arrDesOp = wslBuilder.create<hlfir::DesignateOp>(
+                loc, fir::ReferenceType::get(arrayTy.getEleTy()), array,
+                itrLoad);
+            Value desLoad = wslBuilder.create<fir::LoadOp>(loc, arrDesOp);
+            Value addf =
+                wslBuilder.create<arith::AddFOp>(loc, tarLoad, desLoad);
+            wslBuilder.create<fir::StoreOp>(loc, addf, target);
+            wslBuilder.create<omp::YieldOp>(loc);
+            ln.body = lnBlock;
+            wslOp.getRegion().push_back(wslBlock);
+            targetRegion.front().getOperations().splice(
+                wslOp->getIterator(), allocaBlock->getOperations());
+          } else {
+            emitError(loc, "Only 1D array scalar assignment for sum "
+                           "instrinsic is supported in workshare construct");
+            return;
+          }
         } else {
-          omp::SingleOperands singleOperands;
-          if (isLast)
-            singleOperands.nowait = rootBuilder.getUnitAttr();
-          singleOperands.copyprivateVars = copyprivateVars;
-          cleanupBlock(singleBlock);
-          for (auto var : singleOperands.copyprivateVars) {
-            mlir::func::FuncOp funcOp =
-                createCopyFunc(loc, var.getType(), firCopyFuncBuilder);
-            singleOperands.copyprivateSyms.push_back(
-                SymbolRefAttr::get(funcOp));
+          OpBuilder singleBuilder(sourceRegion.getContext());
+          Block *singleBlock = new Block();
+          singleBuilder.setInsertionPointToStart(singleBlock);
+
+          OpBuilder parallelBuilder(sourceRegion.getContext());
+          Block *parallelBlock = new Block();
+          parallelBuilder.setInsertionPointToStart(parallelBlock);
+
+          auto [allParallelized, copyprivateVars] =
+              moveToSingle(std::get<SingleRegion>(opOrSingle), allocaBuilder,
+                           singleBuilder, parallelBuilder);
+          if (allParallelized) {
+            // The single region was not required as all operations were safe to
+            // parallelize
+            assert(copyprivateVars.empty());
+            assert(allocaBlock->empty());
+            delete singleBlock;
+          } else {
+            omp::SingleOperands singleOperands;
+            if (isLast)
+              singleOperands.nowait = rootBuilder.getUnitAttr();
+            singleOperands.copyprivateVars = copyprivateVars;
+            cleanupBlock(singleBlock);
+            for (auto var : singleOperands.copyprivateVars) {
+              mlir::func::FuncOp funcOp =
+                  createCopyFunc(loc, var.getType(), firCopyFuncBuilder);
+              singleOperands.copyprivateSyms.push_back(
+                  SymbolRefAttr::get(funcOp));
+            }
+            omp::SingleOp singleOp =
+                rootBuilder.create<omp::SingleOp>(loc, singleOperands);
+            singleOp.getRegion().push_back(singleBlock);
+            targetRegion.front().getOperations().splice(
+                singleOp->getIterator(), allocaBlock->getOperations());
           }
-          omp::SingleOp singleOp =
-              rootBuilder.create<omp::SingleOp>(loc, singleOperands);
-          singleOp.getRegion().push_back(singleBlock);
-          targetRegion.front().getOperations().splice(
-              singleOp->getIterator(), allocaBlock->getOperations());
+          rootBuilder.getInsertionBlock()->getOperations().splice(
+              rootBuilder.getInsertionPoint(), parallelBlock->getOperations());
+          delete parallelBlock;
         }
-        rootBuilder.getInsertionBlock()->getOperations().splice(
-            rootBuilder.getInsertionPoint(), parallelBlock->getOperations());
         delete allocaBlock;
-        delete parallelBlock;
       } else {
         auto op = std::get<Operation *>(opOrSingle);
         if (auto wslw = dyn_cast<omp::WorkshareLoopWrapperOp>(op)) {
diff --git a/flang/lib/Optimizer/Passes/Pipelines.cpp b/flang/lib/Optimizer/Passes/Pipelines.cpp
index c1a5902b747887..3a9166f2e0aa52 100644
--- a/flang/lib/Optimizer/Passes/Pipelines.cpp
+++ b/flang/lib/Optimizer/Passes/Pipelines.cpp
@@ -227,11 +227,11 @@ void createHLFIRToFIRPassPipeline(mlir::PassManager &pm, bool enableOpenMP,
                                          hlfir::createOptimizedBufferization);
   }
   pm.addPass(hlfir::createLowerHLFIROrderedAssignments());
+  if (enableOpenMP)
+    pm.addPass(flangomp::createLowerWorkshare());
   pm.addPass(hlfir::createLowerHLFIRIntrinsics());
   pm.addPass(hlfir::createBufferizeHLFIR());
   pm.addPass(hlfir::createConvertHLFIRtoFIR());
-  if (enableOpenMP)
-    pm.addPass(flangomp::createLowerWorkshare());
 }
 
 /// Create a pass pipeline for handling certain OpenMP transformations needed
diff --git a/flang/test/Fir/basic-program.fir b/flang/test/Fir/basic-program.fir
index 4b18acb7c2b430..9b9651a476e583 100644
--- a/flang/test/Fir/basic-program.fir
+++ b/flang/test/Fir/basic-program.fir
@@ -44,10 +44,10 @@ func.func @_QQmain() {
 // PASSES-NEXT: 'omp.private' Pipeline
 // PASSES-NEXT:    OptimizedBufferization
 // PASSES-NEXT:   LowerHLFIROrderedAssignments
+// PASSES-NEXT:   LowerWorkshare
 // PASSES-NEXT:   LowerHLFIRIntrinsics
 // PASSES-NEXT:   BufferizeHLFIR
 // PASSES-NEXT:   ConvertHLFIRtoFIR
-// PASSES-NEXT:   LowerWorkshare
 // PASSES-NEXT:   CSE
 // PASSES-NEXT:   (S) 0 num-cse'd - Number of operations CSE'd
 // PASSES-NEXT:   (S) 0 num-dce'd - Number of operations DCE'd
diff --git a/flang/test/Integration/OpenMP/workshare02.f90 b/flang/test/Integration/OpenMP/workshare02.f90
new file mode 100644
index 00000000000000..68b810a32f247e
--- /dev/null
+++ b/flang/test/Integration/OpenMP/workshare02.f90
@@ -0,0 +1,36 @@
+!===----------------------------------------------------------------------===!
+! This directory can be used to add Integration tests involving multiple
+! stages of the compiler (for eg. from Fortran to LLVM IR). It should not
+! contain executable tests. We should only add tests here sparingly and only
+! if there is no other way to test. Repeat this message in each test that is
+! added to this directory and sub-directories.
+!===----------------------------------------------------------------------===!
+
+!RUN: %flang_fc1 -emit-mlir -fopenmp -O3 %s -o - | FileCheck %s --check-prefix MLIR
+
+program test_ws_01
+    implicit none
+    real(8) :: arr_01(10), x
+    arr_01 = [0.347,0.892,0.573,0.126,0.788,0.412,0.964,0.205,0.631,0.746]
+
+    !$omp parallel workshare
+        x = sum(arr_01)
+    !$omp end parallel workshare
+end program test_ws_01
+
+! MLIR:  func.func @_QQmain
+! MLIR:    omp.parallel {
+!            [...]
+! MLIR:      omp.wsloop {
+! MLIR:        omp.loop_nest {{.*}}
+!                [...]
+! MLIR:          %[[SUM:.*]] = arith.addf {{.*}}
+!                [...]
+! MLIR:          omp.yield
+! MLIR:        }
+! MLIR:      }
+! MLIR:      omp.barrier
+! MLIR:      omp.terminator
+! MLIR:      }
+! MLIR:    return
+! MLIR:    }

Thirumalai-Shaktivel · 2024-10-20T11:02:31Z

Two issues to fix in this PR,

Initialize the target with zero: https://github.com/llvm/llvm-project/pull/113082/files#diff-9fd96de2e44f679d43e7a59e7371b6a40c2731d496469c78ede4c6e960c53715R381-R383
Fix Ivan's test failures.

Thirumalai-Shaktivel · 2024-10-20T11:04:10Z

Kindly share your thoughts on this idea.

ivanradanov · 2024-10-20T15:10:02Z

I think there are two main approaches possible here, and they can coexist:

Have alternative intrinsic implementations in their own runtime library

e.g. the version of assignment for the workshare construct will be something like this

workshare_Assign(a, b) {
  #pragma omp for
  for i from 0 to size
    a[i] = b[i]
}

And then for the supported subset the lowering of the array intrinsic becomes not

omp.parallel {
  omp.single {
    call Assign()
  }
}

but

omp.parallel {
  call workshare_Assign()
}

(It can also be implemented as a flag to the runtime function telling it which version it should use and not a different rt func)

Hopefully using this approach would allow us to share some code between the current intrinsic runtime implementations (which do not assume an omp context) and these ones.

I think this approach may be more flexible and allow for better optimizations like offloading to BLAS libraries or having an optimized code path behind an aliasing check. (These are possible in the second approach but more cumbersome?)

Caveat: I am not sure if mixing omp parallel from fortran and omp for from C/C++ actually works.

Code gen the intrinsics at the IR level

Which is what this approach would be. My concern with this is that it may be quite a lot of effort, but perhaps it has some benefits in that we should have better support of intrinsics on GPU targets where the current runtime does not compile as is?

Maybe looking for some discussion on this when the approach with using a runtime library for the intrinsic implementation was decided could be useful, if it exists.

tblah

I don't have a strong opinion about whether this should be done in flang codegen or in a runtime library. Some drive-by thoughts:

I imagine that using a C omp for inside of a fortran omp parallel should work so long as they use the same openmp library. Of course care will need to be taken about privatisation etc but that can probably be handled in the interface of the runtime function.
The above requirement basically breaks allowing users to pick their own openmp library when compiling their application. I'm not sure how widely used this is.
This is already hard to read/review and SUM is a very simple case. I think there is a danger in re-inventing the wheel here. Something like a high quality parallelised MATMUL which is useful on both CPU and GPU sounds hard. There is a lot of work in that direction in "upstream" MLIR dialects. I know flang (for historical reasons) doesn't use linalg, affine, etc but we might have to eventually.
If we decide not to re-use upstream mlir work, I think more complex intrinsics could get quite hard to review in this style. A C++ runtime library implementation would be easier to read and maintain
Having it in a runtime library would make it easier for vendors to swap out their own implementations e.g. maybe somebody already has a great MATMUL for AMD GPUs and wants flang to use that, but the implementation is useless on a CPU

Overall I don't know what the right approach is. Maybe we should do both as Ivan said. I think if we do decide to implement several intrinsics this way, it should go in its own pass because the code could quickly become very long.

tblah · 2024-10-23T11:23:19Z

flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp

+          fir::SequenceType arrayTy = dyn_cast<fir::SequenceType>(
+              hlfir::getFortranElementOrSequenceType(array.getType()));
+          llvm::ArrayRef<int64_t> arrayShape = arrayTy.getShape();
+          if (arrayShape.size() == 1 && !dim) {


I think you also need to check that the element type is floating point.

tblah · 2024-10-23T11:25:21Z

flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp

+          ///
+          /// !$omp parallel do
+          /// do i = 1, size(array)
+          ///     x = x + array(i)


I don't think this will work. x will be simultaneously updated from all of the threads - overwriting each-other. It is a classic race condition.

I think you need to add a reduction clause to the wsloop operation reducing x with +.

tblah · 2024-10-23T11:26:09Z

flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp

+            targetRegion.front().getOperations().splice(
+                wslOp->getIterator(), allocaBlock->getOperations());
+          } else {
+            emitError(loc, "Only 1D array scalar assignment for sum "


instead of emitting an error here it would be better to go to the outer else branch and use the runtime library version of SUM

tblah · 2024-10-23T11:32:15Z

flang/lib/Optimizer/Passes/Pipelines.cpp

  }
  pm.addPass(hlfir::createLowerHLFIROrderedAssignments());
+  if (enableOpenMP)
+    pm.addPass(flangomp::createLowerWorkshare());


I think the workshare lowering pass assumes it runs after HLFIR bufferization. For example see the changes in the previous PR #104748. @ivanradanov please could you advise on this.

Another approach would be to modify createLowerHLFIRIntrinsics so that it does not lower eligable hlfir.sum operations if they are inside of a workshare (a bit like how bufferization was modified in the PR above).

I think we can inject the code that lowers hlfir.<intrinsic> into openmp loops at the point the hlfir.<intrinsic>s are lowered into fir.calls into the flang rt library. We should have a check for whether we should use the generic lowering or the openmp loop lowering (using the shouldUseWorkshareLowering function) and then for the supported intrinsics lower it to openmp loops instead of the rt call.

Another option is to have another pass which runs before the pass that does the fir.call lowering and that pass takes care of the intrinsics we support.

Even if we use the first approach we should still probably have a separate compilation unit for the collection of those. This way it should not interfere much with the existing code and should be easy to integrate.

tblah · 2024-10-23T12:09:51Z

* I imagine that using a C `omp for` inside of a fortran `omp parallel` should work **so long as they use the same openmp library**. Of course care will need to be taken about privatisation etc but that can probably be handled in the interface of the runtime function.

It might not be practical though. An omp for inside of an omp parallel looks a bit like

static GlobalOpenMPData data;

void per_thread_callback([...]) {
  int lowerbound, upperbound, stride;
  __kmpc_for_static_int_4u(&data, [...], &lowerbound, &upperbound, &stride, [...]);
  for (int i = lowerbound; i < upperbound; i += stride) {
  }
  __kmpc_for_static_fini(&data, [...[);
}

void sourceFunc() {
  __kpmc_fork_call(&data, [...], per_thread_callback, [...]);
}

So I guess we would have to pass the OpenMP struct to the runtime function and then do all openmp runtime calls manually inside of the runtime library - not ideal. Even this would be hard to do because the OpenMP struct is only added to the module when we convert to LLVMIR so refering to it here would break the layering of the compiler.

NimishMishra · 2024-10-23T12:27:35Z

Thanks for the inputs @tblah.

Having it in a runtime library would make it easier for vendors to swap out their own implementations e.g. maybe somebody already has a great MATMUL for AMD GPUs and wants flang to use that, but the implementation is useless on a CPU

How do we usually have convergence on the definition of these runtime functions when we introduce a library? Is it usually through a RFC? Because if we think of vendors swapping implementations, does there need to be a consensus prior to the implementation?

tblah · 2024-10-23T14:07:30Z

How do we usually have convergence on the definition of these runtime functions when we introduce a library? Is it usually through a RFC? Because if we think of vendors swapping implementations, does there need to be a consensus prior to the implementation?

To be clear, I don't know of anyone who has expressed an interest to swap implementations here, this just looked like an advantage to me - like how some vendors have their own openmp library or veclib (vectorized libm). An RFC and discussion on community calls sounds like a good way to agree on an interface if this route is taken.

mjklemm · 2024-10-23T14:23:48Z

The above requirement basically breaks allowing users to pick their own openmp library when compiling their application. I'm not sure how widely used this is.

They won't be able to that regardless. The OpenMP library will have to match what Flang is using; otherwise, two OpenMP implementations will co-exist in the same process and that's usually a recipe for problems.

Regarding the initial question: I'm ambivalent on the direction (runtime entry point vs. code-gen). I will say though that if we go with the runtime entry point, we will tie the Fortran runtime to the OpenMP runtime, which may not be desired in all cases.

tblah · 2024-10-23T15:57:46Z

They won't be able to that regardless. The OpenMP library will have to match what Flang is using; otherwise, two OpenMP implementations will co-exist in the same process and that's usually a recipe for problems.

I probably wasn't clear. I meant something like -fopenmp=libomp or -fopenmp=libiomp5 etc. IIRC this works because the different openmp runtimes have the same interface so we can just swap out which one is linked to the user's application.

ivanradanov added 30 commits October 19, 2024 23:43

[MLIR][omp] Add omp.workshare op

b0d76b0

Add custom omp loop wrapper Add recursive memory effects trait to workshare Remove stray include Remove omp.workshare verifier Add assembly format for wrapper and add test Add verification and descriptions

wrong replace

88d6d03

Fix wsloopwrapperop

f8d7b47

Fix op tests

d001eec

[flang][omp] Emit omp.workshare in frontend

bf36388

Fix lower test for workshare

Fix function signature

e23cf32

[flang] Introduce ws loop nest generation for HLFIR lowering

6f114e0

Emit loop nests in a custom wrapper Only emit unordered loops as omp loops Fix uninitialized memory bug in genLoopNest

genLoopNest fix

d8cfd38

Emit a proper error message for CFG in workshare

1152749

Cleanup tests

e3130f1

Fix todo tests

75b213f

Fix dst src in copy function

b227891

Use omp.single to handle CFG cases

676bf68

Fix lower workshare tests

4d20893

Different warning

5760383

Fix bug and add better clarification comments

71d13e2

Fix message

4873056

Fix tests

15f8d3d

Do not emit empty omp.single's

b52a6f9

LowerWorkshare tests

21128e7

pipelines fix

e62341d

Add workshare loop wrapper lowerings

688eead

Bufferize test Bufferize test Bufferize test Add test for should use workshare lowering

Add integration test for workshare

304ec01

One more integration test

754d54b

Add test for cfg workshare bufferization

fed4884

Fix tests

99d20f9

Test coverage for all changes

d234efe

Integration tests

d35b0c8

bufferize fix

b76551b

Thirumalai-Shaktivel added the flang:openmp label Oct 20, 2024

Thirumalai-Shaktivel requested review from NimishMishra, ivanradanov, kaviya2510, kiranchandramohan, kiranktp, skatrak and tblah October 20, 2024 11:01

Thirumalai-Shaktivel requested a review from kparzysz October 23, 2024 10:53

Thirumalai-Shaktivel marked this pull request as ready for review October 23, 2024 10:53

llvmbot added the flang:fir-hlfir label Oct 23, 2024

Thirumalai-Shaktivel requested review from DominikAdamski, Meinersbur and ergawy October 23, 2024 10:54

tblah reviewed Oct 23, 2024

View reviewed changes

NimishMishra requested a review from mjklemm October 23, 2024 12:34

jeanPerier self-requested a review October 24, 2024 07:19

ivanradanov force-pushed the users/ivanradanov/flang-workshare-hlfir-lower branch 2 times, most recently from a0f2307 to 0abf058 Compare November 19, 2024 08:03

ivanradanov deleted the branch llvm:users/ivanradanov/flang-workshare-hlfir-lower November 20, 2024 01:49

ivanradanov closed this Nov 20, 2024

[Flang][OMP] Replace SUM intrinsic call with SUM operations #113082

[Flang][OMP] Replace SUM intrinsic call with SUM operations #113082

Uh oh!

Conversation

Thirumalai-Shaktivel commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thirumalai-Shaktivel commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thirumalai-Shaktivel commented Oct 20, 2024

Uh oh!

ivanradanov commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Have alternative intrinsic implementations in their own runtime library

Code gen the intrinsics at the IR level

Uh oh!

tblah left a comment

Choose a reason for hiding this comment

Uh oh!

tblah Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

tblah Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

tblah Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

tblah Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

ivanradanov Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tblah commented Oct 23, 2024

Uh oh!

NimishMishra commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tblah commented Oct 23, 2024

Uh oh!

mjklemm commented Oct 23, 2024

Uh oh!

tblah commented Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Thirumalai-Shaktivel commented Oct 20, 2024 •

edited

Loading

llvmbot commented Oct 20, 2024 •

edited

Loading

Thirumalai-Shaktivel commented Oct 20, 2024 •

edited

Loading

ivanradanov commented Oct 20, 2024 •

edited

Loading

ivanradanov Oct 23, 2024 •

edited

Loading

NimishMishra commented Oct 23, 2024 •

edited

Loading