Skip to content

Conversation

@Thirumalai-Shaktivel
Copy link
Member

@Thirumalai-Shaktivel Thirumalai-Shaktivel commented Oct 20, 2024

Continuation from #104748

From Documentation:

Evaluation of transformational array intrinsic functions may be freely
subdivided into any number of units of work.
The transformational array intrinsic functions are MATMUL, DOT_PRODUCT, SUM,
PRODUCT, MAXVAL, MINVAL, COUNT, ANY, ALL, SPREAD, PACK, UNPACK, RESHAPE,
TRANSPOSE, EOSHIFT, CSHIFT, MINLOC, and MAXLOC.

Using Ivan's PR: #104748 the intrinsic calls were enclosed within the single construct.
So, I decided to pick one and experiment to enclose them into a wsloop node
which helps us to share the operations among different threads.
Here, I picked SUM intrinsic and fixed it to generate wsloop and its operations
as a body. See: https://github.com/llvm/llvm-project/pull/113082/files#diff-51125680c30229c9697d5fd264bf0d3f4effb9034c8951abea53206e8402e72b for the generated MLIR.

The same idea has to be performed to implement all the intrinsics mentioned in
the documentation.

Add custom omp loop wrapper

Add recursive memory effects trait to workshare

Remove stray include

Remove omp.workshare verifier

Add assembly format for wrapper and add test

Add verification and descriptions
Fix lower test for workshare
Emit loop nests in a custom wrapper

Only emit unordered loops as omp loops

Fix uninitialized memory bug in genLoopNest
Change to workshare loop wrapper op

Move single op declaration

Schedule pass properly

Correctly handle nested nested loop nests to be parallelized by workshare

Leave comments for shouldUseWorkshareLowering

Use copyprivate to scatter val from omp.single

TODO still need to implement copy function
TODO transitive check for usage outside of omp.single not imiplemented yet

Transitively check for users outisde of single op

TODO need to implement copy func
TODO need to hoist allocas outside of single regions

Add tests

Hoist allocas

More tests

Emit body for copy func

Test the tmp storing logic

Clean up trivially dead ops

Only handle single-block regions for now

Fix tests for custom assembly for loop wrapper

Only run the lower workshare pass if openmp is enabled

Implement some missing functionality

Fix tests

Fix test

Iterate backwards to find all trivially dead ops

Add expalanation comment for createCopyFun

Update test
Bufferize test

Bufferize test

Bufferize test

Add test for should use workshare lowering
@llvmbot
Copy link
Member

llvmbot commented Oct 20, 2024

@llvm/pr-subscribers-flang-fir-hlfir

@llvm/pr-subscribers-flang-openmp

Author: Thirumalai Shaktivel (Thirumalai-Shaktivel)

Changes

Continuation from #104748
> From Documentation:

Evaluation of transformational array intrinsic functions may be freely
subdivided into any number of units of work.
The transformational array intrinsic functions are MATMUL, DOT_PRODUCT, SUM,
PRODUCT, MAXVAL, MINVAL, COUNT, ANY, ALL, SPREAD, PACK, UNPACK, RESHAPE,
TRANSPOSE, EOSHIFT, CSHIFT, MINLOC, and MAXLOC.

Using Ivan's patch: #104748 the intrinsic calls were enclosed within the single construct.
So, I decided to pick one and experiment to enclose them into a wsloop node
which helps us to share the operations among different threads.
Here, I picked SUM intrinsic and fixed it to generate wsloop and its operations
as a body. See: for the generated MLIR.

The same idea has to be performed to implement all the intrinsics mentioned in
the documentation.


Full diff: https://github.com/llvm/llvm-project/pull/113082.diff

6 Files Affected:

  • (modified) flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt (+1)
  • (modified) flang/lib/Optimizer/OpenMP/CMakeLists.txt (+1)
  • (modified) flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp (+116-35)
  • (modified) flang/lib/Optimizer/Passes/Pipelines.cpp (+2-2)
  • (modified) flang/test/Fir/basic-program.fir (+1-1)
  • (added) flang/test/Integration/OpenMP/workshare02.f90 (+36)
diff --git a/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt b/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
index fa3a59303137ff..43da1eb92d0306 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
+++ b/flang/lib/Optimizer/HLFIR/Transforms/CMakeLists.txt
@@ -26,6 +26,7 @@ add_flang_library(HLFIRTransforms
   FIRTransforms
   HLFIRDialect
   MLIRIR
+  FlangOpenMPTransforms
   ${dialect_libs}
 
   LINK_COMPONENTS
diff --git a/flang/lib/Optimizer/OpenMP/CMakeLists.txt b/flang/lib/Optimizer/OpenMP/CMakeLists.txt
index 39e92d388288d4..776798d7239117 100644
--- a/flang/lib/Optimizer/OpenMP/CMakeLists.txt
+++ b/flang/lib/Optimizer/OpenMP/CMakeLists.txt
@@ -21,6 +21,7 @@ add_flang_library(FlangOpenMPTransforms
   FortranCommon
   MLIRFuncDialect
   MLIROpenMPDialect
+  MLIRArithDialect
   HLFIRDialect
   MLIRIR
   MLIRPass
diff --git a/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp b/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
index 225c585a02d913..9fb84f3e099c4c 100644
--- a/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
+++ b/flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp
@@ -16,6 +16,7 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "flang/Optimizer/Builder/HLFIRTools.h"
 #include <flang/Optimizer/Builder/FIRBuilder.h>
 #include <flang/Optimizer/Dialect/FIROps.h>
 #include <flang/Optimizer/Dialect/FIRType.h>
@@ -335,49 +336,129 @@ static void parallelizeRegion(Region &sourceRegion, Region &targetRegion,
     for (auto [i, opOrSingle] : llvm::enumerate(regions)) {
       bool isLast = i + 1 == regions.size();
       if (std::holds_alternative<SingleRegion>(opOrSingle)) {
-        OpBuilder singleBuilder(sourceRegion.getContext());
-        Block *singleBlock = new Block();
-        singleBuilder.setInsertionPointToStart(singleBlock);
-
         OpBuilder allocaBuilder(sourceRegion.getContext());
         Block *allocaBlock = new Block();
         allocaBuilder.setInsertionPointToStart(allocaBlock);
 
-        OpBuilder parallelBuilder(sourceRegion.getContext());
-        Block *parallelBlock = new Block();
-        parallelBuilder.setInsertionPointToStart(parallelBlock);
-
-        auto [allParallelized, copyprivateVars] =
-            moveToSingle(std::get<SingleRegion>(opOrSingle), allocaBuilder,
-                         singleBuilder, parallelBuilder);
-        if (allParallelized) {
-          // The single region was not required as all operations were safe to
-          // parallelize
-          assert(copyprivateVars.empty());
-          assert(allocaBlock->empty());
-          delete singleBlock;
+        it = block.begin();
+        while (&*it != terminator)
+          if (isa<hlfir::SumOp>(it))
+            break;
+          else
+            it++;
+
+        if (auto sumOp = dyn_cast<hlfir::SumOp>(it)) {
+          /// Implementation:
+          /// Intrinsic function `SUM` operations
+          /// --
+          /// x = sum(array)
+          ///
+          /// is converted to
+          ///
+          /// !$omp parallel do
+          /// do i = 1, size(array)
+          ///     x = x + array(i)
+          /// end do
+          /// !$omp end parallel do
+
+          OpBuilder wslBuilder(sourceRegion.getContext());
+          Block *wslBlock = new Block();
+          wslBuilder.setInsertionPointToStart(wslBlock);
+
+          Value target = dyn_cast<hlfir::AssignOp>(++it).getLhs();
+          Value array = sumOp.getArray();
+          Value dim = sumOp.getDim();
+          fir::SequenceType arrayTy = dyn_cast<fir::SequenceType>(
+              hlfir::getFortranElementOrSequenceType(array.getType()));
+          llvm::ArrayRef<int64_t> arrayShape = arrayTy.getShape();
+          if (arrayShape.size() == 1 && !dim) {
+            Value itr = allocaBuilder.create<fir::AllocaOp>(
+                loc, allocaBuilder.getI64Type());
+            Value c_one = allocaBuilder.create<arith::ConstantOp>(
+                loc, allocaBuilder.getI64IntegerAttr(1));
+            Value c_arr_size = allocaBuilder.create<arith::ConstantOp>(
+                loc, allocaBuilder.getI64IntegerAttr(arrayShape[0]));
+            // Value c_zero = allocaBuilder.create<arith::ConstantOp>(loc,
+            //     allocaBuilder.getZeroAttr(arrayTy.getEleTy()));
+            // allocaBuilder.create<fir::StoreOp>(loc, c_zero, target);
+
+            omp::WsloopOperands wslOps;
+            omp::WsloopOp wslOp =
+                rootBuilder.create<omp::WsloopOp>(loc, wslOps);
+
+            hlfir::LoopNest ln;
+            ln.outerOp = wslOp;
+            omp::LoopNestOperands lnOps;
+            lnOps.loopLowerBounds.push_back(c_one);
+            lnOps.loopUpperBounds.push_back(c_arr_size);
+            lnOps.loopSteps.push_back(c_one);
+            lnOps.loopInclusive = wslBuilder.getUnitAttr();
+            omp::LoopNestOp lnOp =
+                wslBuilder.create<omp::LoopNestOp>(loc, lnOps);
+            Block *lnBlock = wslBuilder.createBlock(&lnOp.getRegion());
+            lnBlock->addArgument(c_one.getType(), loc);
+            wslBuilder.create<fir::StoreOp>(
+                loc, lnOp.getRegion().getArgument(0), itr);
+            Value tarLoad = wslBuilder.create<fir::LoadOp>(loc, target);
+            Value itrLoad = wslBuilder.create<fir::LoadOp>(loc, itr);
+            hlfir::DesignateOp arrDesOp = wslBuilder.create<hlfir::DesignateOp>(
+                loc, fir::ReferenceType::get(arrayTy.getEleTy()), array,
+                itrLoad);
+            Value desLoad = wslBuilder.create<fir::LoadOp>(loc, arrDesOp);
+            Value addf =
+                wslBuilder.create<arith::AddFOp>(loc, tarLoad, desLoad);
+            wslBuilder.create<fir::StoreOp>(loc, addf, target);
+            wslBuilder.create<omp::YieldOp>(loc);
+            ln.body = lnBlock;
+            wslOp.getRegion().push_back(wslBlock);
+            targetRegion.front().getOperations().splice(
+                wslOp->getIterator(), allocaBlock->getOperations());
+          } else {
+            emitError(loc, "Only 1D array scalar assignment for sum "
+                           "instrinsic is supported in workshare construct");
+            return;
+          }
         } else {
-          omp::SingleOperands singleOperands;
-          if (isLast)
-            singleOperands.nowait = rootBuilder.getUnitAttr();
-          singleOperands.copyprivateVars = copyprivateVars;
-          cleanupBlock(singleBlock);
-          for (auto var : singleOperands.copyprivateVars) {
-            mlir::func::FuncOp funcOp =
-                createCopyFunc(loc, var.getType(), firCopyFuncBuilder);
-            singleOperands.copyprivateSyms.push_back(
-                SymbolRefAttr::get(funcOp));
+          OpBuilder singleBuilder(sourceRegion.getContext());
+          Block *singleBlock = new Block();
+          singleBuilder.setInsertionPointToStart(singleBlock);
+
+          OpBuilder parallelBuilder(sourceRegion.getContext());
+          Block *parallelBlock = new Block();
+          parallelBuilder.setInsertionPointToStart(parallelBlock);
+
+          auto [allParallelized, copyprivateVars] =
+              moveToSingle(std::get<SingleRegion>(opOrSingle), allocaBuilder,
+                           singleBuilder, parallelBuilder);
+          if (allParallelized) {
+            // The single region was not required as all operations were safe to
+            // parallelize
+            assert(copyprivateVars.empty());
+            assert(allocaBlock->empty());
+            delete singleBlock;
+          } else {
+            omp::SingleOperands singleOperands;
+            if (isLast)
+              singleOperands.nowait = rootBuilder.getUnitAttr();
+            singleOperands.copyprivateVars = copyprivateVars;
+            cleanupBlock(singleBlock);
+            for (auto var : singleOperands.copyprivateVars) {
+              mlir::func::FuncOp funcOp =
+                  createCopyFunc(loc, var.getType(), firCopyFuncBuilder);
+              singleOperands.copyprivateSyms.push_back(
+                  SymbolRefAttr::get(funcOp));
+            }
+            omp::SingleOp singleOp =
+                rootBuilder.create<omp::SingleOp>(loc, singleOperands);
+            singleOp.getRegion().push_back(singleBlock);
+            targetRegion.front().getOperations().splice(
+                singleOp->getIterator(), allocaBlock->getOperations());
           }
-          omp::SingleOp singleOp =
-              rootBuilder.create<omp::SingleOp>(loc, singleOperands);
-          singleOp.getRegion().push_back(singleBlock);
-          targetRegion.front().getOperations().splice(
-              singleOp->getIterator(), allocaBlock->getOperations());
+          rootBuilder.getInsertionBlock()->getOperations().splice(
+              rootBuilder.getInsertionPoint(), parallelBlock->getOperations());
+          delete parallelBlock;
         }
-        rootBuilder.getInsertionBlock()->getOperations().splice(
-            rootBuilder.getInsertionPoint(), parallelBlock->getOperations());
         delete allocaBlock;
-        delete parallelBlock;
       } else {
         auto op = std::get<Operation *>(opOrSingle);
         if (auto wslw = dyn_cast<omp::WorkshareLoopWrapperOp>(op)) {
diff --git a/flang/lib/Optimizer/Passes/Pipelines.cpp b/flang/lib/Optimizer/Passes/Pipelines.cpp
index c1a5902b747887..3a9166f2e0aa52 100644
--- a/flang/lib/Optimizer/Passes/Pipelines.cpp
+++ b/flang/lib/Optimizer/Passes/Pipelines.cpp
@@ -227,11 +227,11 @@ void createHLFIRToFIRPassPipeline(mlir::PassManager &pm, bool enableOpenMP,
                                          hlfir::createOptimizedBufferization);
   }
   pm.addPass(hlfir::createLowerHLFIROrderedAssignments());
+  if (enableOpenMP)
+    pm.addPass(flangomp::createLowerWorkshare());
   pm.addPass(hlfir::createLowerHLFIRIntrinsics());
   pm.addPass(hlfir::createBufferizeHLFIR());
   pm.addPass(hlfir::createConvertHLFIRtoFIR());
-  if (enableOpenMP)
-    pm.addPass(flangomp::createLowerWorkshare());
 }
 
 /// Create a pass pipeline for handling certain OpenMP transformations needed
diff --git a/flang/test/Fir/basic-program.fir b/flang/test/Fir/basic-program.fir
index 4b18acb7c2b430..9b9651a476e583 100644
--- a/flang/test/Fir/basic-program.fir
+++ b/flang/test/Fir/basic-program.fir
@@ -44,10 +44,10 @@ func.func @_QQmain() {
 // PASSES-NEXT: 'omp.private' Pipeline
 // PASSES-NEXT:    OptimizedBufferization
 // PASSES-NEXT:   LowerHLFIROrderedAssignments
+// PASSES-NEXT:   LowerWorkshare
 // PASSES-NEXT:   LowerHLFIRIntrinsics
 // PASSES-NEXT:   BufferizeHLFIR
 // PASSES-NEXT:   ConvertHLFIRtoFIR
-// PASSES-NEXT:   LowerWorkshare
 // PASSES-NEXT:   CSE
 // PASSES-NEXT:   (S) 0 num-cse'd - Number of operations CSE'd
 // PASSES-NEXT:   (S) 0 num-dce'd - Number of operations DCE'd
diff --git a/flang/test/Integration/OpenMP/workshare02.f90 b/flang/test/Integration/OpenMP/workshare02.f90
new file mode 100644
index 00000000000000..68b810a32f247e
--- /dev/null
+++ b/flang/test/Integration/OpenMP/workshare02.f90
@@ -0,0 +1,36 @@
+!===----------------------------------------------------------------------===!
+! This directory can be used to add Integration tests involving multiple
+! stages of the compiler (for eg. from Fortran to LLVM IR). It should not
+! contain executable tests. We should only add tests here sparingly and only
+! if there is no other way to test. Repeat this message in each test that is
+! added to this directory and sub-directories.
+!===----------------------------------------------------------------------===!
+
+!RUN: %flang_fc1 -emit-mlir -fopenmp -O3 %s -o - | FileCheck %s --check-prefix MLIR
+
+program test_ws_01
+    implicit none
+    real(8) :: arr_01(10), x
+    arr_01 = [0.347,0.892,0.573,0.126,0.788,0.412,0.964,0.205,0.631,0.746]
+
+    !$omp parallel workshare
+        x = sum(arr_01)
+    !$omp end parallel workshare
+end program test_ws_01
+
+! MLIR:  func.func @_QQmain
+! MLIR:    omp.parallel {
+!            [...]
+! MLIR:      omp.wsloop {
+! MLIR:        omp.loop_nest {{.*}}
+!                [...]
+! MLIR:          %[[SUM:.*]] = arith.addf {{.*}}
+!                [...]
+! MLIR:          omp.yield
+! MLIR:        }
+! MLIR:      }
+! MLIR:      omp.barrier
+! MLIR:      omp.terminator
+! MLIR:      }
+! MLIR:    return
+! MLIR:    }

@Thirumalai-Shaktivel
Copy link
Member Author

Thirumalai-Shaktivel commented Oct 20, 2024

Two issues to fix in this PR,

  1. Initialize the target with zero: https://github.com/llvm/llvm-project/pull/113082/files#diff-9fd96de2e44f679d43e7a59e7371b6a40c2731d496469c78ede4c6e960c53715R381-R383
  2. Fix Ivan's test failures.

@Thirumalai-Shaktivel
Copy link
Member Author

Kindly share your thoughts on this idea.

@ivanradanov
Copy link
Contributor

ivanradanov commented Oct 20, 2024

I think there are two main approaches possible here, and they can coexist:

Have alternative intrinsic implementations in their own runtime library

e.g. the version of assignment for the workshare construct will be something like this

workshare_Assign(a, b) {
  #pragma omp for
  for i from 0 to size
    a[i] = b[i]
}

And then for the supported subset the lowering of the array intrinsic becomes not

omp.parallel {
  omp.single {
    call Assign()
  }
}

but

omp.parallel {
  call workshare_Assign()
}

(It can also be implemented as a flag to the runtime function telling it which version it should use and not a different rt func)

Hopefully using this approach would allow us to share some code between the current intrinsic runtime implementations (which do not assume an omp context) and these ones.

I think this approach may be more flexible and allow for better optimizations like offloading to BLAS libraries or having an optimized code path behind an aliasing check. (These are possible in the second approach but more cumbersome?)

Caveat: I am not sure if mixing omp parallel from fortran and omp for from C/C++ actually works.

Code gen the intrinsics at the IR level

Which is what this approach would be. My concern with this is that it may be quite a lot of effort, but perhaps it has some benefits in that we should have better support of intrinsics on GPU targets where the current runtime does not compile as is?

Maybe looking for some discussion on this when the approach with using a runtime library for the intrinsic implementation was decided could be useful, if it exists.

Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about whether this should be done in flang codegen or in a runtime library. Some drive-by thoughts:

  • I imagine that using a C omp for inside of a fortran omp parallel should work so long as they use the same openmp library. Of course care will need to be taken about privatisation etc but that can probably be handled in the interface of the runtime function.
  • The above requirement basically breaks allowing users to pick their own openmp library when compiling their application. I'm not sure how widely used this is.
  • This is already hard to read/review and SUM is a very simple case. I think there is a danger in re-inventing the wheel here. Something like a high quality parallelised MATMUL which is useful on both CPU and GPU sounds hard. There is a lot of work in that direction in "upstream" MLIR dialects. I know flang (for historical reasons) doesn't use linalg, affine, etc but we might have to eventually.
  • If we decide not to re-use upstream mlir work, I think more complex intrinsics could get quite hard to review in this style. A C++ runtime library implementation would be easier to read and maintain
  • Having it in a runtime library would make it easier for vendors to swap out their own implementations e.g. maybe somebody already has a great MATMUL for AMD GPUs and wants flang to use that, but the implementation is useless on a CPU

Overall I don't know what the right approach is. Maybe we should do both as Ivan said. I think if we do decide to implement several intrinsics this way, it should go in its own pass because the code could quickly become very long.

fir::SequenceType arrayTy = dyn_cast<fir::SequenceType>(
hlfir::getFortranElementOrSequenceType(array.getType()));
llvm::ArrayRef<int64_t> arrayShape = arrayTy.getShape();
if (arrayShape.size() == 1 && !dim) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also need to check that the element type is floating point.

///
/// !$omp parallel do
/// do i = 1, size(array)
/// x = x + array(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work. x will be simultaneously updated from all of the threads - overwriting each-other. It is a classic race condition.

I think you need to add a reduction clause to the wsloop operation reducing x with +.

targetRegion.front().getOperations().splice(
wslOp->getIterator(), allocaBlock->getOperations());
} else {
emitError(loc, "Only 1D array scalar assignment for sum "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of emitting an error here it would be better to go to the outer else branch and use the runtime library version of SUM

}
pm.addPass(hlfir::createLowerHLFIROrderedAssignments());
if (enableOpenMP)
pm.addPass(flangomp::createLowerWorkshare());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the workshare lowering pass assumes it runs after HLFIR bufferization. For example see the changes in the previous PR #104748. @ivanradanov please could you advise on this.

Another approach would be to modify createLowerHLFIRIntrinsics so that it does not lower eligable hlfir.sum operations if they are inside of a workshare (a bit like how bufferization was modified in the PR above).

Copy link
Contributor

@ivanradanov ivanradanov Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can inject the code that lowers hlfir.<intrinsic> into openmp loops at the point the hlfir.<intrinsic>s are lowered into fir.calls into the flang rt library. We should have a check for whether we should use the generic lowering or the openmp loop lowering (using the shouldUseWorkshareLowering function) and then for the supported intrinsics lower it to openmp loops instead of the rt call.

Another option is to have another pass which runs before the pass that does the fir.call lowering and that pass takes care of the intrinsics we support.

Even if we use the first approach we should still probably have a separate compilation unit for the collection of those. This way it should not interfere much with the existing code and should be easy to integrate.

@tblah
Copy link
Contributor

tblah commented Oct 23, 2024

* I imagine that using a C `omp for` inside of a fortran `omp parallel` should work **so long as they use the same openmp library**. Of course care will need to be taken about privatisation etc but that can probably be handled in the interface of the runtime function.

It might not be practical though. An omp for inside of an omp parallel looks a bit like

static GlobalOpenMPData data;

void per_thread_callback([...]) {
  int lowerbound, upperbound, stride;
  __kmpc_for_static_int_4u(&data, [...], &lowerbound, &upperbound, &stride, [...]);
  for (int i = lowerbound; i < upperbound; i += stride) {
  }
  __kmpc_for_static_fini(&data, [...[);
}

void sourceFunc() {
  __kpmc_fork_call(&data, [...], per_thread_callback, [...]);
}

So I guess we would have to pass the OpenMP struct to the runtime function and then do all openmp runtime calls manually inside of the runtime library - not ideal. Even this would be hard to do because the OpenMP struct is only added to the module when we convert to LLVMIR so refering to it here would break the layering of the compiler.

@NimishMishra
Copy link
Contributor

NimishMishra commented Oct 23, 2024

Thanks for the inputs @tblah.

Having it in a runtime library would make it easier for vendors to swap out their own implementations e.g. maybe somebody already has a great MATMUL for AMD GPUs and wants flang to use that, but the implementation is useless on a CPU

How do we usually have convergence on the definition of these runtime functions when we introduce a library? Is it usually through a RFC? Because if we think of vendors swapping implementations, does there need to be a consensus prior to the implementation?

@NimishMishra NimishMishra requested a review from mjklemm October 23, 2024 12:34
@tblah
Copy link
Contributor

tblah commented Oct 23, 2024

How do we usually have convergence on the definition of these runtime functions when we introduce a library? Is it usually through a RFC? Because if we think of vendors swapping implementations, does there need to be a consensus prior to the implementation?

To be clear, I don't know of anyone who has expressed an interest to swap implementations here, this just looked like an advantage to me - like how some vendors have their own openmp library or veclib (vectorized libm). An RFC and discussion on community calls sounds like a good way to agree on an interface if this route is taken.

@mjklemm
Copy link
Contributor

mjklemm commented Oct 23, 2024

  • The above requirement basically breaks allowing users to pick their own openmp library when compiling their application. I'm not sure how widely used this is.

They won't be able to that regardless. The OpenMP library will have to match what Flang is using; otherwise, two OpenMP implementations will co-exist in the same process and that's usually a recipe for problems.

Regarding the initial question: I'm ambivalent on the direction (runtime entry point vs. code-gen). I will say though that if we go with the runtime entry point, we will tie the Fortran runtime to the OpenMP runtime, which may not be desired in all cases.

@tblah
Copy link
Contributor

tblah commented Oct 23, 2024

They won't be able to that regardless. The OpenMP library will have to match what Flang is using; otherwise, two OpenMP implementations will co-exist in the same process and that's usually a recipe for problems.

I probably wasn't clear. I meant something like -fopenmp=libomp or -fopenmp=libiomp5 etc. IIRC this works because the different openmp runtimes have the same interface so we can just swap out which one is linked to the user's application.

@jeanPerier jeanPerier self-requested a review October 24, 2024 07:19
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-hlfir-lower branch 2 times, most recently from a0f2307 to 0abf058 Compare November 19, 2024 08:03
@ivanradanov ivanradanov deleted the branch llvm:users/ivanradanov/flang-workshare-hlfir-lower November 20, 2024 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flang:fir-hlfir flang:openmp flang Flang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants