[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions #150927

skatrak · 2025-07-28T11:12:56Z

This patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs.

This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions.

As a result, Generic kernels produced by Flang that contain parallel regions now work properly.

One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior.

llvmbot · 2025-07-28T11:13:28Z

@llvm/pr-subscribers-offload

Author: Sergio Afonso (skatrak)

Changes

This patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs.

This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions.

As a result, Generic kernels produced by Flang that contain parallel regions now work properly.

One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior.

Full diff: https://github.com/llvm/llvm-project/pull/150927.diff

3 Files Affected:

(modified) llvm/lib/Transforms/IPO/OpenMPOpt.cpp (+22)
(added) offload/test/offloading/fortran/target-generic-loops.f90 (+130)
(added) offload/test/offloading/fortran/target-spmd-loops.f90 (+39)

diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index 5e2247f2a88d0..d58da7b1db0e3 100644
--- a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -5020,6 +5020,28 @@ struct AAKernelInfoCallSite : AAKernelInfo {
       case OMPRTL___kmpc_free_shared:
         // Return without setting a fixpoint, to be resolved in updateImpl.
         return;
+      case OMPRTL___kmpc_distribute_static_loop_4:
+      case OMPRTL___kmpc_distribute_static_loop_4u:
+      case OMPRTL___kmpc_distribute_static_loop_8:
+      case OMPRTL___kmpc_distribute_static_loop_8u:
+      case OMPRTL___kmpc_distribute_for_static_loop_4:
+      case OMPRTL___kmpc_distribute_for_static_loop_4u:
+      case OMPRTL___kmpc_distribute_for_static_loop_8:
+      case OMPRTL___kmpc_distribute_for_static_loop_8u:
+      case OMPRTL___kmpc_for_static_loop_4:
+      case OMPRTL___kmpc_for_static_loop_4u:
+      case OMPRTL___kmpc_for_static_loop_8:
+      case OMPRTL___kmpc_for_static_loop_8u:
+        // Parallel regions might be reached by these calls, as they take a
+        // callback argument potentially arbitrary user-provided code.
+        ReachedUnknownParallelRegions.insert(&CB);
+        // TODO: The presence of these calls on their own does not prevent a
+        // kernel from being SPMD-izable. We mark it as such because we need
+        // further changes in order to also consider the contents of the
+        // callbacks passed to them.
+        SPMDCompatibilityTracker.indicatePessimisticFixpoint();
+        SPMDCompatibilityTracker.insert(&CB);
+        break;
       default:
         // Unknown OpenMP runtime calls cannot be executed in SPMD-mode,
         // generally. However, they do not hide parallel regions.
diff --git a/offload/test/offloading/fortran/target-generic-loops.f90 b/offload/test/offloading/fortran/target-generic-loops.f90
new file mode 100644
index 0000000000000..07bcbfd2c8752
--- /dev/null
+++ b/offload/test/offloading/fortran/target-generic-loops.f90
@@ -0,0 +1,130 @@
+! Offloading test for generic target regions containing different kinds of
+! loop constructs inside.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+  integer :: i1, i2, n1, n2, counter
+
+  n1 = 100
+  n2 = 50
+
+  counter = 0
+  !$omp target map(tofrom:counter)
+    !$omp teams distribute reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+  !$omp end target
+
+  ! CHECK: 1 100
+  print '(I2" "I0)', 1, counter
+
+  counter = 0
+  !$omp target map(tofrom:counter)
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+  !$omp end target
+
+  ! CHECK: 2 200
+  print '(I2" "I0)', 2, counter
+
+  counter = 0
+  !$omp target map(tofrom:counter)
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+    counter = counter + 1
+  !$omp end target
+
+  ! CHECK: 3 203
+  print '(I2" "I0)', 3, counter
+
+  counter = 0
+  !$omp target map(tofrom: counter)
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+    counter = counter + 1
+  !$omp end target
+
+  ! CHECK: 4 102
+  print '(I2" "I0)', 4, counter
+
+
+  counter = 0
+  !$omp target teams distribute reduction(+:counter)
+  do i1=1, n1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+  end do
+
+  ! CHECK: 5 5000
+  print '(I2" "I0)', 5, counter
+
+  counter = 0
+  !$omp target teams distribute reduction(+:counter)
+  do i1=1, n1
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+    counter = counter + 1
+  end do
+
+  ! CHECK: 6 5200
+  print '(I2" "I0)', 6, counter
+
+  counter = 0
+  !$omp target teams distribute reduction(+:counter)
+  do i1=1, n1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+  end do
+
+  ! CHECK: 7 10000
+  print '(I2" "I0)', 7, counter
+
+  counter = 0
+  !$omp target teams distribute reduction(+:counter)
+  do i1=1, n1
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+    counter = counter + 1
+  end do
+
+  ! CHECK: 8 10300
+  print '(I2" "I0)', 8, counter
+end program
diff --git a/offload/test/offloading/fortran/target-spmd-loops.f90 b/offload/test/offloading/fortran/target-spmd-loops.f90
new file mode 100644
index 0000000000000..7407f0c0768cb
--- /dev/null
+++ b/offload/test/offloading/fortran/target-spmd-loops.f90
@@ -0,0 +1,39 @@
+! Offloading test for generic target regions containing different kinds of
+! loop constructs inside.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+  integer :: i1, n1, counter
+
+  n1 = 100
+
+  counter = 0
+  !$omp target parallel do reduction(+:counter)
+  do i1=1, n1
+    counter = counter + 1
+  end do
+
+  ! CHECK: 1 100
+  print '(I2" "I0)', 1, counter
+
+  counter = 0
+  !$omp target map(tofrom:counter)
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+  !$omp end target
+
+  ! CHECK: 2 100
+  print '(I2" "I0)', 2, counter
+
+  counter = 0
+  !$omp target teams distribute parallel do reduction(+:counter)
+  do i1=1, n1
+    counter = counter + 1
+  end do
+
+  ! CHECK: 3 100
+  print '(I2" "I0)', 3, counter
+end program

llvmbot · 2025-07-28T11:13:28Z

@llvm/pr-subscribers-llvm-transforms

Author: Sergio Afonso (skatrak)

Changes

This patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs.

This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions.

As a result, Generic kernels produced by Flang that contain parallel regions now work properly.

One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior.

Full diff: https://github.com/llvm/llvm-project/pull/150927.diff

3 Files Affected:

(modified) llvm/lib/Transforms/IPO/OpenMPOpt.cpp (+22)
(added) offload/test/offloading/fortran/target-generic-loops.f90 (+130)
(added) offload/test/offloading/fortran/target-spmd-loops.f90 (+39)

diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index 5e2247f2a88d0..d58da7b1db0e3 100644
--- a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -5020,6 +5020,28 @@ struct AAKernelInfoCallSite : AAKernelInfo {
       case OMPRTL___kmpc_free_shared:
         // Return without setting a fixpoint, to be resolved in updateImpl.
         return;
+      case OMPRTL___kmpc_distribute_static_loop_4:
+      case OMPRTL___kmpc_distribute_static_loop_4u:
+      case OMPRTL___kmpc_distribute_static_loop_8:
+      case OMPRTL___kmpc_distribute_static_loop_8u:
+      case OMPRTL___kmpc_distribute_for_static_loop_4:
+      case OMPRTL___kmpc_distribute_for_static_loop_4u:
+      case OMPRTL___kmpc_distribute_for_static_loop_8:
+      case OMPRTL___kmpc_distribute_for_static_loop_8u:
+      case OMPRTL___kmpc_for_static_loop_4:
+      case OMPRTL___kmpc_for_static_loop_4u:
+      case OMPRTL___kmpc_for_static_loop_8:
+      case OMPRTL___kmpc_for_static_loop_8u:
+        // Parallel regions might be reached by these calls, as they take a
+        // callback argument potentially arbitrary user-provided code.
+        ReachedUnknownParallelRegions.insert(&CB);
+        // TODO: The presence of these calls on their own does not prevent a
+        // kernel from being SPMD-izable. We mark it as such because we need
+        // further changes in order to also consider the contents of the
+        // callbacks passed to them.
+        SPMDCompatibilityTracker.indicatePessimisticFixpoint();
+        SPMDCompatibilityTracker.insert(&CB);
+        break;
       default:
         // Unknown OpenMP runtime calls cannot be executed in SPMD-mode,
         // generally. However, they do not hide parallel regions.
diff --git a/offload/test/offloading/fortran/target-generic-loops.f90 b/offload/test/offloading/fortran/target-generic-loops.f90
new file mode 100644
index 0000000000000..07bcbfd2c8752
--- /dev/null
+++ b/offload/test/offloading/fortran/target-generic-loops.f90
@@ -0,0 +1,130 @@
+! Offloading test for generic target regions containing different kinds of
+! loop constructs inside.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+  integer :: i1, i2, n1, n2, counter
+
+  n1 = 100
+  n2 = 50
+
+  counter = 0
+  !$omp target map(tofrom:counter)
+    !$omp teams distribute reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+  !$omp end target
+
+  ! CHECK: 1 100
+  print '(I2" "I0)', 1, counter
+
+  counter = 0
+  !$omp target map(tofrom:counter)
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+  !$omp end target
+
+  ! CHECK: 2 200
+  print '(I2" "I0)', 2, counter
+
+  counter = 0
+  !$omp target map(tofrom:counter)
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+    counter = counter + 1
+  !$omp end target
+
+  ! CHECK: 3 203
+  print '(I2" "I0)', 3, counter
+
+  counter = 0
+  !$omp target map(tofrom: counter)
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+    counter = counter + 1
+  !$omp end target
+
+  ! CHECK: 4 102
+  print '(I2" "I0)', 4, counter
+
+
+  counter = 0
+  !$omp target teams distribute reduction(+:counter)
+  do i1=1, n1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+  end do
+
+  ! CHECK: 5 5000
+  print '(I2" "I0)', 5, counter
+
+  counter = 0
+  !$omp target teams distribute reduction(+:counter)
+  do i1=1, n1
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+    counter = counter + 1
+  end do
+
+  ! CHECK: 6 5200
+  print '(I2" "I0)', 6, counter
+
+  counter = 0
+  !$omp target teams distribute reduction(+:counter)
+  do i1=1, n1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+  end do
+
+  ! CHECK: 7 10000
+  print '(I2" "I0)', 7, counter
+
+  counter = 0
+  !$omp target teams distribute reduction(+:counter)
+  do i1=1, n1
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+    counter = counter + 1
+    !$omp parallel do reduction(+:counter)
+    do i2=1, n2
+      counter = counter + 1
+    end do
+    counter = counter + 1
+  end do
+
+  ! CHECK: 8 10300
+  print '(I2" "I0)', 8, counter
+end program
diff --git a/offload/test/offloading/fortran/target-spmd-loops.f90 b/offload/test/offloading/fortran/target-spmd-loops.f90
new file mode 100644
index 0000000000000..7407f0c0768cb
--- /dev/null
+++ b/offload/test/offloading/fortran/target-spmd-loops.f90
@@ -0,0 +1,39 @@
+! Offloading test for generic target regions containing different kinds of
+! loop constructs inside.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+  integer :: i1, n1, counter
+
+  n1 = 100
+
+  counter = 0
+  !$omp target parallel do reduction(+:counter)
+  do i1=1, n1
+    counter = counter + 1
+  end do
+
+  ! CHECK: 1 100
+  print '(I2" "I0)', 1, counter
+
+  counter = 0
+  !$omp target map(tofrom:counter)
+    !$omp parallel do reduction(+:counter)
+    do i1=1, n1
+      counter = counter + 1
+    end do
+  !$omp end target
+
+  ! CHECK: 2 100
+  print '(I2" "I0)', 2, counter
+
+  counter = 0
+  !$omp target teams distribute parallel do reduction(+:counter)
+  do i1=1, n1
+    counter = counter + 1
+  end do
+
+  ! CHECK: 3 100
+  print '(I2" "I0)', 3, counter
+end program

skatrak · 2025-07-28T11:15:27Z

PR stack:

Meinersbur

LGTM

Thanks for all the work put into this. The end2end tests demonstrate it is working as intended.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp

…unctions This patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs. This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions. As a result, Generic kernels produced by Flang that contain parallel regions now work properly. One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior.

skatrak · 2025-10-03T15:03:53Z

Moving to draft because I've noticed this doesn't currently work whenever there are different calls to new DeviceRTL loop functions. The state machine rewrite optimization of OpenMPOpt causes the following code to not run properly, whereas the same code without the unused_problematic subroutine in it (or compiled with -mllvm -openmp-opt-disable-state-machine-rewrite) works:

! flang -fopenmp -fopenmp-version=52 --offload-arch=gfx1030 test.f90 && OMP_TARGET_OFFLOAD=MANDATORY ./a.out
subroutine test_subroutine(counter)
  implicit none
  integer, intent(out) :: counter
  integer :: i1, i2, n1, n2

  n1 = 100
  n2 = 50

  counter = 0
  !$omp target teams distribute reduction(+:counter)
  do i1=1, n1
    !$omp parallel do reduction(+:counter)
    do i2=1, n2
      counter = counter + 1
    end do
  end do
end subroutine

program main
  implicit none
  integer :: counter
  call test_subroutine(counter)

  ! Should print: 5000
  print '(I0)', counter
end program

subroutine foo(i)
  integer, intent(inout) :: i
end subroutine

! The presence of this unreachable function in the compilation unit causes
! the result of `test_subroutine` to be incorrect. Removing the `distribute`
! OpenMP directive avoids the problem.
subroutine unused_problematic()
  implicit none
  integer :: i

  !$omp target teams
  !$omp distribute
  do i=1, 100
    call foo(i)
  end do
  !$omp end target teams
end subroutine

skatrak requested review from DominikAdamski, Meinersbur, jdoerfert and jhuber6 July 28, 2025 11:12

llvmbot added llvm:transforms clang:openmp OpenMP related changes to Clang offload labels Jul 28, 2025

skatrak force-pushed the users/skatrak/flang-generic-05-parallel-wrapper branch from 8771c0f to b6e9849 Compare July 28, 2025 11:23

skatrak force-pushed the users/skatrak/flang-generic-06-openmp-opt branch from aac44b8 to b6c5ff8 Compare July 28, 2025 11:23

skatrak force-pushed the users/skatrak/flang-generic-05-parallel-wrapper branch from b6e9849 to 6a81001 Compare August 6, 2025 12:54

skatrak force-pushed the users/skatrak/flang-generic-06-openmp-opt branch from b6c5ff8 to a26b2a0 Compare August 6, 2025 12:54

Meinersbur approved these changes Aug 13, 2025

View reviewed changes

llvm/lib/Transforms/IPO/OpenMPOpt.cpp Outdated Show resolved Hide resolved

skatrak force-pushed the users/skatrak/flang-generic-05-parallel-wrapper branch from 6a81001 to 4f751f7 Compare August 14, 2025 12:50

skatrak force-pushed the users/skatrak/flang-generic-06-openmp-opt branch from a26b2a0 to 434ea1f Compare August 14, 2025 13:01

skatrak force-pushed the users/skatrak/flang-generic-05-parallel-wrapper branch from 4f751f7 to e57c671 Compare August 15, 2025 15:00

skatrak force-pushed the users/skatrak/flang-generic-06-openmp-opt branch from 434ea1f to 2d33426 Compare August 15, 2025 15:00

skatrak mentioned this pull request Aug 21, 2025

[OMPIRBuilder] Add support for explicit deallocation points #154752

Open

skatrak added 2 commits October 3, 2025 13:07

Address review comments

6ed5190

skatrak force-pushed the users/skatrak/flang-generic-05-parallel-wrapper branch from e57c671 to f027b86 Compare October 3, 2025 14:50

skatrak force-pushed the users/skatrak/flang-generic-06-openmp-opt branch from 2d33426 to 6ed5190 Compare October 3, 2025 14:51

skatrak marked this pull request as draft October 3, 2025 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions #150927

[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions #150927

Uh oh!

skatrak commented Jul 28, 2025

Uh oh!

llvmbot commented Jul 28, 2025

Uh oh!

llvmbot commented Jul 28, 2025

Uh oh!

skatrak commented Jul 28, 2025 •

edited

Loading

Uh oh!

Meinersbur left a comment

Uh oh!

Uh oh!

skatrak commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions #150927

Are you sure you want to change the base?

[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions #150927

Uh oh!

Conversation

skatrak commented Jul 28, 2025

Uh oh!

llvmbot commented Jul 28, 2025

Uh oh!

llvmbot commented Jul 28, 2025

Uh oh!

skatrak commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Meinersbur left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skatrak commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

skatrak commented Jul 28, 2025 •

edited

Loading