Skip to content

Conversation

@agozillon
Copy link
Contributor

@agozillon agozillon commented Dec 10, 2024

Currently the compiler will ICE in programs like the following on the device lowering pass:

program main
    implicit none

    type i1_t
       integer :: val(1000)
    end type i1_t
    integer :: i
    type(i1_t), pointer :: newi1
    type(i1_t), pointer :: tab=>null()

    integer, dimension(:), pointer :: tabval

!$omp THREADPRIVATE(tab)

allocate(newi1)

tab=>newi1
tab%val(:)=1
tabval=>tab%val

!$omp target teams distribute parallel do
  do i = 1, 1000
   tabval(i) = i
 end do
!$omp end target teams distribute parallel do

end program main

This is due to the fact that THREADPRIVATE returns a result operation, and this operation can actually be used by other LLVM dialect (or other dialect) operations. However, we currently skip the lowering of threadprivate, so we effectively never generate and bind an LLVM-IR result to the threadprivate operation result. So when we later go on to lower dependent LLVM dialect operations, we are missing the required LLVM-IR result, try to access and use it and then ICE. The fix in this particular PR is to allow compilation of threadprivate for device as well as host, and simply treat the device compilation as a no-op, binding the LLVM-IR result of threadprivate with no alterations and binding it, which will allow the rest of the compilation to proceed, where we'll eventually discard the host segment in any case.

The other possible solution to this I can think of, is doing something similar to Flang's passes that occur prior to CodeGen to the LLVM dialect, where they erase/no-op certain unrequired operations or transform them to lower level series of operations. And we would erase/no-op threadprivate on device as we'd never have these in target regions.

The main issues I can see with this are that we currently do not specialise this stage based on wether we're compiling for device or host, so it's setting a precedent and adding another point of having to understand the separation between target and host compilation. I am also not sure we'd necessarily want to enforce this at a dialect level incase someone else wishes to add a different lowering flow or translation flow. Another possible issue is that a target operation we have/utilise would depend on the result of threadprivate, meaning we'd not be allowed to entirely erase/no-op it, I am not sure of any situations where this may be an issue currently though.

@agozillon agozillon changed the title [OpenMP][MLIR] Fix threadprivate lowering when compiling for target w… [OpenMP][MLIR] Fix threadprivate lowering when compiling for target when target operations are in use Dec 10, 2024
@llvmbot
Copy link
Member

llvmbot commented Dec 10, 2024

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-mlir-llvm
@llvm/pr-subscribers-flang-openmp

@llvm/pr-subscribers-offload

Author: None (agozillon)

Changes

Currently the compiler will ICE in programs like the following on the device lowering pass:

program main
implicit none

type i1_t
   integer :: val(1000)
end type i1_t
integer :: i
type(i1_t), pointer :: newi1
type(i1_t), pointer :: tab=>null()

integer, dimension(:), pointer :: tabval

!$omp THREADPRIVATE(tab)

allocate(newi1)

tab=>newi1
tab%val(:)=1
tabval=>tab%val

!$omp target teams distribute parallel do
do i = 1, 1000
tabval(i) = i
end do
!$omp end target teams distribute parallel do

end program main

This is due to the fact that THREADPRIVATE returns a result operation, and this operation can actually be used by other LLVM dialect (or other dialect) operations. However, we currently skip the lowering of threadprivate, so we effectively never generate and bind an LLVM-IR result to the threadprivate operation result. So when we later go on to lower dependent LLVM dialect operations, we are missing the required LLVM-IR result, try to access and use it and then ICE. The fix in this particular PR is to allow compilation of threadprivate for device as well as host, and simply treat the device compilation as a no-op, binding the LLVM-IR result of threadprivate with no alterations and binding it, which will allow the rest of the compilation to proceed, where we'll eventually discard the host segment in any case.

The other possible solution to this I can think of, is doing something similar to Flang's passes that occur prior to CodeGen to the LLVM dialect, where they erase/no-op certain unrequired operations or transform them to lower level series of operations. And we would erase/no-op threadprivate on device as we'd never have these in target regions.

The main issues I can see with this are that we currently do not specialise this stage based on wether we're compiling for device or host, so it's setting a precedent and adding another point of having to understand the separation between target and host compilation. I am also not sure we'd necessarily want to enforce this at a dialect level incase someone else wishes to add a different lowering flow or translation flow. Another possible issue is that a target operation we have/utilise would depend on the result of threadprivate, meaning we'd not be allowed to entirely erase/no-op it, I am not sure of any situations where this may be an issue currently though.


Full diff: https://github.com/llvm/llvm-project/pull/119310.diff

2 Files Affected:

  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+29-12)
  • (added) offload/test/offloading/fortran/target-with-threadprivate.f90 (+37)
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 49fe509800491a..52c385545361b6 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -2537,6 +2537,10 @@ convertOmpThreadprivate(Operation &opInst, llvm::IRBuilderBase &builder,
 
   Value symAddr = threadprivateOp.getSymAddr();
   auto *symOp = symAddr.getDefiningOp();
+
+  if (auto asCast = dyn_cast<LLVM::AddrSpaceCastOp>(symOp))
+    symOp = asCast.getOperand().getDefiningOp();
+
   if (!isa<LLVM::AddressOfOp>(symOp))
     return opInst.emitError("Addressing symbol not found");
   LLVM::AddressOfOp addressOfOp = dyn_cast<LLVM::AddressOfOp>(symOp);
@@ -2544,17 +2548,23 @@ convertOmpThreadprivate(Operation &opInst, llvm::IRBuilderBase &builder,
   LLVM::GlobalOp global =
       addressOfOp.getGlobal(moduleTranslation.symbolTable());
   llvm::GlobalValue *globalValue = moduleTranslation.lookupGlobal(global);
-  llvm::Type *type = globalValue->getValueType();
-  llvm::TypeSize typeSize =
-      builder.GetInsertBlock()->getModule()->getDataLayout().getTypeStoreSize(
-          type);
-  llvm::ConstantInt *size = builder.getInt64(typeSize.getFixedValue());
-  llvm::StringRef suffix = llvm::StringRef(".cache", 6);
-  std::string cacheName = (Twine(global.getSymName()).concat(suffix)).str();
-  llvm::Value *callInst =
-      moduleTranslation.getOpenMPBuilder()->createCachedThreadPrivate(
-          ompLoc, globalValue, size, cacheName);
-  moduleTranslation.mapValue(opInst.getResult(0), callInst);
+
+  if (!moduleTranslation.getOpenMPBuilder()->Config.isTargetDevice()) {
+    llvm::Type *type = globalValue->getValueType();
+    llvm::TypeSize typeSize =
+        builder.GetInsertBlock()->getModule()->getDataLayout().getTypeStoreSize(
+            type);
+    llvm::ConstantInt *size = builder.getInt64(typeSize.getFixedValue());
+    llvm::StringRef suffix = llvm::StringRef(".cache", 6);
+    std::string cacheName = (Twine(global.getSymName()).concat(suffix)).str();
+    llvm::Value *callInst =
+        moduleTranslation.getOpenMPBuilder()->createCachedThreadPrivate(
+            ompLoc, globalValue, size, cacheName);
+    moduleTranslation.mapValue(opInst.getResult(0), callInst);
+  } else {
+    moduleTranslation.mapValue(opInst.getResult(0), globalValue);
+  }
+
   return success();
 }
 
@@ -4090,6 +4100,14 @@ static bool isTargetDeviceOp(Operation *op) {
   if (op->getParentOfType<omp::TargetOp>())
     return true;
 
+  // Certain operations return results, and wether utilised in host or
+  // target there is a chance an LLVM Dialect operation depends on it
+  // by taking it in as an operand, so we must always lower these in
+  // some manner or result in an ICE (whether they end up in a no-op
+  // or otherwise).
+  if (mlir::isa<omp::ThreadprivateOp>(op))
+    return true;
+
   if (auto parentFn = op->getParentOfType<LLVM::LLVMFuncOp>())
     if (auto declareTargetIface =
             llvm::dyn_cast<mlir::omp::DeclareTargetInterface>(
@@ -4107,7 +4125,6 @@ static bool isTargetDeviceOp(Operation *op) {
 static LogicalResult
 convertHostOrTargetOperation(Operation *op, llvm::IRBuilderBase &builder,
                              LLVM::ModuleTranslation &moduleTranslation) {
-
   llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
 
   return llvm::TypeSwitch<Operation *, LogicalResult>(op)
diff --git a/offload/test/offloading/fortran/target-with-threadprivate.f90 b/offload/test/offloading/fortran/target-with-threadprivate.f90
new file mode 100644
index 00000000000000..10c7cecf084123
--- /dev/null
+++ b/offload/test/offloading/fortran/target-with-threadprivate.f90
@@ -0,0 +1,37 @@
+! Basic offloading test that makes sure we can use the predominantly host
+! pragma threadprivate in the same program as target code
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+    implicit none
+
+    type dtype
+       integer :: val(10)
+    end type dtype
+
+    integer :: i
+    type(dtype), pointer :: pointer1
+    type(dtype), pointer :: pointer2=>null()
+    integer, dimension(:), pointer :: data_pointer
+
+!$omp threadprivate(pointer2)
+
+nullify(pointer1)
+allocate(pointer1)
+
+pointer2=>pointer1
+pointer2%val(:)=1
+data_pointer=>pointer2%val
+
+!$omp target
+   do i = 1, 10
+     data_pointer(i) = i
+   end do
+!$omp end target
+
+print *, data_pointer
+
+end program main
+
+! CHECK: 1 2 3 4 5 6 7 8 9 10

@llvmbot
Copy link
Member

llvmbot commented Dec 10, 2024

@llvm/pr-subscribers-mlir-openmp

Author: None (agozillon)

Changes

Currently the compiler will ICE in programs like the following on the device lowering pass:

program main
implicit none

type i1_t
   integer :: val(1000)
end type i1_t
integer :: i
type(i1_t), pointer :: newi1
type(i1_t), pointer :: tab=&gt;null()

integer, dimension(:), pointer :: tabval

!$omp THREADPRIVATE(tab)

allocate(newi1)

tab=>newi1
tab%val(:)=1
tabval=>tab%val

!$omp target teams distribute parallel do
do i = 1, 1000
tabval(i) = i
end do
!$omp end target teams distribute parallel do

end program main

This is due to the fact that THREADPRIVATE returns a result operation, and this operation can actually be used by other LLVM dialect (or other dialect) operations. However, we currently skip the lowering of threadprivate, so we effectively never generate and bind an LLVM-IR result to the threadprivate operation result. So when we later go on to lower dependent LLVM dialect operations, we are missing the required LLVM-IR result, try to access and use it and then ICE. The fix in this particular PR is to allow compilation of threadprivate for device as well as host, and simply treat the device compilation as a no-op, binding the LLVM-IR result of threadprivate with no alterations and binding it, which will allow the rest of the compilation to proceed, where we'll eventually discard the host segment in any case.

The other possible solution to this I can think of, is doing something similar to Flang's passes that occur prior to CodeGen to the LLVM dialect, where they erase/no-op certain unrequired operations or transform them to lower level series of operations. And we would erase/no-op threadprivate on device as we'd never have these in target regions.

The main issues I can see with this are that we currently do not specialise this stage based on wether we're compiling for device or host, so it's setting a precedent and adding another point of having to understand the separation between target and host compilation. I am also not sure we'd necessarily want to enforce this at a dialect level incase someone else wishes to add a different lowering flow or translation flow. Another possible issue is that a target operation we have/utilise would depend on the result of threadprivate, meaning we'd not be allowed to entirely erase/no-op it, I am not sure of any situations where this may be an issue currently though.


Full diff: https://github.com/llvm/llvm-project/pull/119310.diff

2 Files Affected:

  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+29-12)
  • (added) offload/test/offloading/fortran/target-with-threadprivate.f90 (+37)
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 49fe509800491a..52c385545361b6 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -2537,6 +2537,10 @@ convertOmpThreadprivate(Operation &opInst, llvm::IRBuilderBase &builder,
 
   Value symAddr = threadprivateOp.getSymAddr();
   auto *symOp = symAddr.getDefiningOp();
+
+  if (auto asCast = dyn_cast<LLVM::AddrSpaceCastOp>(symOp))
+    symOp = asCast.getOperand().getDefiningOp();
+
   if (!isa<LLVM::AddressOfOp>(symOp))
     return opInst.emitError("Addressing symbol not found");
   LLVM::AddressOfOp addressOfOp = dyn_cast<LLVM::AddressOfOp>(symOp);
@@ -2544,17 +2548,23 @@ convertOmpThreadprivate(Operation &opInst, llvm::IRBuilderBase &builder,
   LLVM::GlobalOp global =
       addressOfOp.getGlobal(moduleTranslation.symbolTable());
   llvm::GlobalValue *globalValue = moduleTranslation.lookupGlobal(global);
-  llvm::Type *type = globalValue->getValueType();
-  llvm::TypeSize typeSize =
-      builder.GetInsertBlock()->getModule()->getDataLayout().getTypeStoreSize(
-          type);
-  llvm::ConstantInt *size = builder.getInt64(typeSize.getFixedValue());
-  llvm::StringRef suffix = llvm::StringRef(".cache", 6);
-  std::string cacheName = (Twine(global.getSymName()).concat(suffix)).str();
-  llvm::Value *callInst =
-      moduleTranslation.getOpenMPBuilder()->createCachedThreadPrivate(
-          ompLoc, globalValue, size, cacheName);
-  moduleTranslation.mapValue(opInst.getResult(0), callInst);
+
+  if (!moduleTranslation.getOpenMPBuilder()->Config.isTargetDevice()) {
+    llvm::Type *type = globalValue->getValueType();
+    llvm::TypeSize typeSize =
+        builder.GetInsertBlock()->getModule()->getDataLayout().getTypeStoreSize(
+            type);
+    llvm::ConstantInt *size = builder.getInt64(typeSize.getFixedValue());
+    llvm::StringRef suffix = llvm::StringRef(".cache", 6);
+    std::string cacheName = (Twine(global.getSymName()).concat(suffix)).str();
+    llvm::Value *callInst =
+        moduleTranslation.getOpenMPBuilder()->createCachedThreadPrivate(
+            ompLoc, globalValue, size, cacheName);
+    moduleTranslation.mapValue(opInst.getResult(0), callInst);
+  } else {
+    moduleTranslation.mapValue(opInst.getResult(0), globalValue);
+  }
+
   return success();
 }
 
@@ -4090,6 +4100,14 @@ static bool isTargetDeviceOp(Operation *op) {
   if (op->getParentOfType<omp::TargetOp>())
     return true;
 
+  // Certain operations return results, and wether utilised in host or
+  // target there is a chance an LLVM Dialect operation depends on it
+  // by taking it in as an operand, so we must always lower these in
+  // some manner or result in an ICE (whether they end up in a no-op
+  // or otherwise).
+  if (mlir::isa<omp::ThreadprivateOp>(op))
+    return true;
+
   if (auto parentFn = op->getParentOfType<LLVM::LLVMFuncOp>())
     if (auto declareTargetIface =
             llvm::dyn_cast<mlir::omp::DeclareTargetInterface>(
@@ -4107,7 +4125,6 @@ static bool isTargetDeviceOp(Operation *op) {
 static LogicalResult
 convertHostOrTargetOperation(Operation *op, llvm::IRBuilderBase &builder,
                              LLVM::ModuleTranslation &moduleTranslation) {
-
   llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
 
   return llvm::TypeSwitch<Operation *, LogicalResult>(op)
diff --git a/offload/test/offloading/fortran/target-with-threadprivate.f90 b/offload/test/offloading/fortran/target-with-threadprivate.f90
new file mode 100644
index 00000000000000..10c7cecf084123
--- /dev/null
+++ b/offload/test/offloading/fortran/target-with-threadprivate.f90
@@ -0,0 +1,37 @@
+! Basic offloading test that makes sure we can use the predominantly host
+! pragma threadprivate in the same program as target code
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+    implicit none
+
+    type dtype
+       integer :: val(10)
+    end type dtype
+
+    integer :: i
+    type(dtype), pointer :: pointer1
+    type(dtype), pointer :: pointer2=>null()
+    integer, dimension(:), pointer :: data_pointer
+
+!$omp threadprivate(pointer2)
+
+nullify(pointer1)
+allocate(pointer1)
+
+pointer2=>pointer1
+pointer2%val(:)=1
+data_pointer=>pointer2%val
+
+!$omp target
+   do i = 1, 10
+     data_pointer(i) = i
+   end do
+!$omp end target
+
+print *, data_pointer
+
+end program main
+
+! CHECK: 1 2 3 4 5 6 7 8 9 10

@agozillon agozillon force-pushed the threadprivate-device-lower-fix-upstream branch from 145696a to 0fd92f4 Compare December 10, 2024 02:36
Copy link
Member

@skatrak skatrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Andrew, this LGTM. I have some minimal non-blocking comments.

Can you write an MLIR to LLVM IR unit test with a minimal reproducer for this issue?

Comment on lines 2558 to 2562
llvm::StringRef suffix = llvm::StringRef(".cache", 6);
std::string cacheName = (Twine(global.getSymName()).concat(suffix)).str();
llvm::Value *callInst =
moduleTranslation.getOpenMPBuilder()->createCachedThreadPrivate(
ompLoc, globalValue, size, cacheName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Isn't this refactoring possible? I know this patch isn't changing this, but the implementation looks unnecessarily convoluted.

Suggested change
llvm::StringRef suffix = llvm::StringRef(".cache", 6);
std::string cacheName = (Twine(global.getSymName()).concat(suffix)).str();
llvm::Value *callInst =
moduleTranslation.getOpenMPBuilder()->createCachedThreadPrivate(
ompLoc, globalValue, size, cacheName);
llvm::Value *callInst =
ompBuilder->createCachedThreadPrivate(
ompLoc, globalValue, size, global.getSymName() + ".cache");

ompLoc, globalValue, size, cacheName);
moduleTranslation.mapValue(opInst.getResult(0), callInst);

if (!moduleTranslation.getOpenMPBuilder()->Config.isTargetDevice()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Add llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder(); to the start of the function, since there are now two uses of it.

Suggested change
if (!moduleTranslation.getOpenMPBuilder()->Config.isTargetDevice()) {
if (!ompBuilder->Config.isTargetDevice()) {

if (op->getParentOfType<omp::TargetOp>())
return true;

// Certain operations return results, and wether utilised in host or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Certain operations return results, and wether utilised in host or
// Certain operations return results, and whether utilised in host or

Copy link
Contributor

@mjklemm mjklemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please fix the typos.

…hen target operations are in use

Currently the compiler will ICE in programs like the following on the device lowering pass:

program main
    implicit none

    type i1_t
       integer :: val(1000)
    end type i1_t
    integer :: i
    type(i1_t), pointer :: newi1
    type(i1_t), pointer :: tab=>null()

    integer, dimension(:), pointer :: tabval

!$omp THREADPRIVATE(tab)

allocate(newi1)

tab=>newi1
tab%val(:)=1
tabval=>tab%val

!$omp target teams distribute parallel do
  do i = 1, 1000
   tabval(i) = i
 end do
!$omp end target teams distribute parallel do

end program main

This is due to the fact that THREADPRIVATE returns a result operation, and this operation can actually be used by other LLVM dialect (or other dialect) operations. However, we currently skip the lowering of threadprivate, so we effectively never generate and bind an LLVM-IR result to the threadprivate operation result. So when we later go on to lower dependent LLVM dialect operations, we are missing the required LLVM-IR result, try to access and use it and then ICE. The fix in this particular PR is to allow compilation of threadprivate for device as well as host, and simply treat the device compilation as
a no-op, binding the LLVM-IR result of threadprivate with no alterations and binding it, which will allow the rest of the compilation to proceed, where we'll eventually discard the host segment in any case.

The other possible solution to this I can think of, is doing something similar to Flang's passes that occur prior to CodeGen to the LLVM dialect, where they erase/no-op certain unrequired operations or transform them to lower level series of operations. And we would erase/no-op threadprivate on device as we'd never have these in target regions.

The main issues I can see with this are that we currently do not specialise this stage based on wether we're compiling for device or host, so it's setting a precedent and adding another point of having to understand the separation between target and host compilation.  I am also not sure we'd necessarily want to enforce this at a dialect level incase someone else wishes to add a different lowering flow or translation flow. Another possible issue is that a target operation we have/utilise would depend on the result of threadprivate, meaning we'd not be allowed to entirely erase/no-op it, I am not sure of any situations where this may be an issue currently though.
@agozillon agozillon force-pushed the threadprivate-device-lower-fix-upstream branch from 0fd92f4 to cc6a0f3 Compare January 3, 2025 16:59
@agozillon
Copy link
Contributor Author

Thank you @mjklemm and @skatrak again for the quick reviews! Added a new test and made the requested changes, going to land this now!

@agozillon agozillon merged commit fa56e8b into llvm:main Jan 3, 2025
5 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants