-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions #150927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/skatrak/flang-generic-05-parallel-wrapper
Are you sure you want to change the base?
[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions #150927
Conversation
@llvm/pr-subscribers-offload Author: Sergio Afonso (skatrak) ChangesThis patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs. This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions. As a result, Generic kernels produced by Flang that contain parallel regions now work properly. One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior. Full diff: https://github.com/llvm/llvm-project/pull/150927.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index 5e2247f2a88d0..d58da7b1db0e3 100644
--- a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -5020,6 +5020,28 @@ struct AAKernelInfoCallSite : AAKernelInfo {
case OMPRTL___kmpc_free_shared:
// Return without setting a fixpoint, to be resolved in updateImpl.
return;
+ case OMPRTL___kmpc_distribute_static_loop_4:
+ case OMPRTL___kmpc_distribute_static_loop_4u:
+ case OMPRTL___kmpc_distribute_static_loop_8:
+ case OMPRTL___kmpc_distribute_static_loop_8u:
+ case OMPRTL___kmpc_distribute_for_static_loop_4:
+ case OMPRTL___kmpc_distribute_for_static_loop_4u:
+ case OMPRTL___kmpc_distribute_for_static_loop_8:
+ case OMPRTL___kmpc_distribute_for_static_loop_8u:
+ case OMPRTL___kmpc_for_static_loop_4:
+ case OMPRTL___kmpc_for_static_loop_4u:
+ case OMPRTL___kmpc_for_static_loop_8:
+ case OMPRTL___kmpc_for_static_loop_8u:
+ // Parallel regions might be reached by these calls, as they take a
+ // callback argument potentially arbitrary user-provided code.
+ ReachedUnknownParallelRegions.insert(&CB);
+ // TODO: The presence of these calls on their own does not prevent a
+ // kernel from being SPMD-izable. We mark it as such because we need
+ // further changes in order to also consider the contents of the
+ // callbacks passed to them.
+ SPMDCompatibilityTracker.indicatePessimisticFixpoint();
+ SPMDCompatibilityTracker.insert(&CB);
+ break;
default:
// Unknown OpenMP runtime calls cannot be executed in SPMD-mode,
// generally. However, they do not hide parallel regions.
diff --git a/offload/test/offloading/fortran/target-generic-loops.f90 b/offload/test/offloading/fortran/target-generic-loops.f90
new file mode 100644
index 0000000000000..07bcbfd2c8752
--- /dev/null
+++ b/offload/test/offloading/fortran/target-generic-loops.f90
@@ -0,0 +1,130 @@
+! Offloading test for generic target regions containing different kinds of
+! loop constructs inside.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+ integer :: i1, i2, n1, n2, counter
+
+ n1 = 100
+ n2 = 50
+
+ counter = 0
+ !$omp target map(tofrom:counter)
+ !$omp teams distribute reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ !$omp end target
+
+ ! CHECK: 1 100
+ print '(I2" "I0)', 1, counter
+
+ counter = 0
+ !$omp target map(tofrom:counter)
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ !$omp end target
+
+ ! CHECK: 2 200
+ print '(I2" "I0)', 2, counter
+
+ counter = 0
+ !$omp target map(tofrom:counter)
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ !$omp end target
+
+ ! CHECK: 3 203
+ print '(I2" "I0)', 3, counter
+
+ counter = 0
+ !$omp target map(tofrom: counter)
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ !$omp end target
+
+ ! CHECK: 4 102
+ print '(I2" "I0)', 4, counter
+
+
+ counter = 0
+ !$omp target teams distribute reduction(+:counter)
+ do i1=1, n1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ end do
+
+ ! CHECK: 5 5000
+ print '(I2" "I0)', 5, counter
+
+ counter = 0
+ !$omp target teams distribute reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ end do
+
+ ! CHECK: 6 5200
+ print '(I2" "I0)', 6, counter
+
+ counter = 0
+ !$omp target teams distribute reduction(+:counter)
+ do i1=1, n1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ end do
+
+ ! CHECK: 7 10000
+ print '(I2" "I0)', 7, counter
+
+ counter = 0
+ !$omp target teams distribute reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ end do
+
+ ! CHECK: 8 10300
+ print '(I2" "I0)', 8, counter
+end program
diff --git a/offload/test/offloading/fortran/target-spmd-loops.f90 b/offload/test/offloading/fortran/target-spmd-loops.f90
new file mode 100644
index 0000000000000..7407f0c0768cb
--- /dev/null
+++ b/offload/test/offloading/fortran/target-spmd-loops.f90
@@ -0,0 +1,39 @@
+! Offloading test for generic target regions containing different kinds of
+! loop constructs inside.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+ integer :: i1, n1, counter
+
+ n1 = 100
+
+ counter = 0
+ !$omp target parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+
+ ! CHECK: 1 100
+ print '(I2" "I0)', 1, counter
+
+ counter = 0
+ !$omp target map(tofrom:counter)
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ !$omp end target
+
+ ! CHECK: 2 100
+ print '(I2" "I0)', 2, counter
+
+ counter = 0
+ !$omp target teams distribute parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+
+ ! CHECK: 3 100
+ print '(I2" "I0)', 3, counter
+end program
|
@llvm/pr-subscribers-llvm-transforms Author: Sergio Afonso (skatrak) ChangesThis patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs. This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions. As a result, Generic kernels produced by Flang that contain parallel regions now work properly. One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior. Full diff: https://github.com/llvm/llvm-project/pull/150927.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
index 5e2247f2a88d0..d58da7b1db0e3 100644
--- a/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
+++ b/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
@@ -5020,6 +5020,28 @@ struct AAKernelInfoCallSite : AAKernelInfo {
case OMPRTL___kmpc_free_shared:
// Return without setting a fixpoint, to be resolved in updateImpl.
return;
+ case OMPRTL___kmpc_distribute_static_loop_4:
+ case OMPRTL___kmpc_distribute_static_loop_4u:
+ case OMPRTL___kmpc_distribute_static_loop_8:
+ case OMPRTL___kmpc_distribute_static_loop_8u:
+ case OMPRTL___kmpc_distribute_for_static_loop_4:
+ case OMPRTL___kmpc_distribute_for_static_loop_4u:
+ case OMPRTL___kmpc_distribute_for_static_loop_8:
+ case OMPRTL___kmpc_distribute_for_static_loop_8u:
+ case OMPRTL___kmpc_for_static_loop_4:
+ case OMPRTL___kmpc_for_static_loop_4u:
+ case OMPRTL___kmpc_for_static_loop_8:
+ case OMPRTL___kmpc_for_static_loop_8u:
+ // Parallel regions might be reached by these calls, as they take a
+ // callback argument potentially arbitrary user-provided code.
+ ReachedUnknownParallelRegions.insert(&CB);
+ // TODO: The presence of these calls on their own does not prevent a
+ // kernel from being SPMD-izable. We mark it as such because we need
+ // further changes in order to also consider the contents of the
+ // callbacks passed to them.
+ SPMDCompatibilityTracker.indicatePessimisticFixpoint();
+ SPMDCompatibilityTracker.insert(&CB);
+ break;
default:
// Unknown OpenMP runtime calls cannot be executed in SPMD-mode,
// generally. However, they do not hide parallel regions.
diff --git a/offload/test/offloading/fortran/target-generic-loops.f90 b/offload/test/offloading/fortran/target-generic-loops.f90
new file mode 100644
index 0000000000000..07bcbfd2c8752
--- /dev/null
+++ b/offload/test/offloading/fortran/target-generic-loops.f90
@@ -0,0 +1,130 @@
+! Offloading test for generic target regions containing different kinds of
+! loop constructs inside.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+ integer :: i1, i2, n1, n2, counter
+
+ n1 = 100
+ n2 = 50
+
+ counter = 0
+ !$omp target map(tofrom:counter)
+ !$omp teams distribute reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ !$omp end target
+
+ ! CHECK: 1 100
+ print '(I2" "I0)', 1, counter
+
+ counter = 0
+ !$omp target map(tofrom:counter)
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ !$omp end target
+
+ ! CHECK: 2 200
+ print '(I2" "I0)', 2, counter
+
+ counter = 0
+ !$omp target map(tofrom:counter)
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ !$omp end target
+
+ ! CHECK: 3 203
+ print '(I2" "I0)', 3, counter
+
+ counter = 0
+ !$omp target map(tofrom: counter)
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ !$omp end target
+
+ ! CHECK: 4 102
+ print '(I2" "I0)', 4, counter
+
+
+ counter = 0
+ !$omp target teams distribute reduction(+:counter)
+ do i1=1, n1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ end do
+
+ ! CHECK: 5 5000
+ print '(I2" "I0)', 5, counter
+
+ counter = 0
+ !$omp target teams distribute reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ end do
+
+ ! CHECK: 6 5200
+ print '(I2" "I0)', 6, counter
+
+ counter = 0
+ !$omp target teams distribute reduction(+:counter)
+ do i1=1, n1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ end do
+
+ ! CHECK: 7 10000
+ print '(I2" "I0)', 7, counter
+
+ counter = 0
+ !$omp target teams distribute reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ !$omp parallel do reduction(+:counter)
+ do i2=1, n2
+ counter = counter + 1
+ end do
+ counter = counter + 1
+ end do
+
+ ! CHECK: 8 10300
+ print '(I2" "I0)', 8, counter
+end program
diff --git a/offload/test/offloading/fortran/target-spmd-loops.f90 b/offload/test/offloading/fortran/target-spmd-loops.f90
new file mode 100644
index 0000000000000..7407f0c0768cb
--- /dev/null
+++ b/offload/test/offloading/fortran/target-spmd-loops.f90
@@ -0,0 +1,39 @@
+! Offloading test for generic target regions containing different kinds of
+! loop constructs inside.
+! REQUIRES: flang, amdgpu
+
+! RUN: %libomptarget-compile-fortran-run-and-check-generic
+program main
+ integer :: i1, n1, counter
+
+ n1 = 100
+
+ counter = 0
+ !$omp target parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+
+ ! CHECK: 1 100
+ print '(I2" "I0)', 1, counter
+
+ counter = 0
+ !$omp target map(tofrom:counter)
+ !$omp parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+ !$omp end target
+
+ ! CHECK: 2 100
+ print '(I2" "I0)', 2, counter
+
+ counter = 0
+ !$omp target teams distribute parallel do reduction(+:counter)
+ do i1=1, n1
+ counter = counter + 1
+ end do
+
+ ! CHECK: 3 100
+ print '(I2" "I0)', 3, counter
+end program
|
8771c0f
to
b6e9849
Compare
aac44b8
to
b6c5ff8
Compare
b6e9849
to
6a81001
Compare
b6c5ff8
to
a26b2a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for all the work put into this. The end2end tests demonstrate it is working as intended.
6a81001
to
4f751f7
Compare
a26b2a0
to
434ea1f
Compare
4f751f7
to
e57c671
Compare
434ea1f
to
2d33426
Compare
…unctions This patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs. This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions. As a result, Generic kernels produced by Flang that contain parallel regions now work properly. One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior.
e57c671
to
f027b86
Compare
2d33426
to
6ed5190
Compare
Moving to draft because I've noticed this doesn't currently work whenever there are different calls to new DeviceRTL loop functions. The state machine rewrite optimization of OpenMPOpt causes the following code to not run properly, whereas the same code without the ! flang -fopenmp -fopenmp-version=52 --offload-arch=gfx1030 test.f90 && OMP_TARGET_OFFLOAD=MANDATORY ./a.out
subroutine test_subroutine(counter)
implicit none
integer, intent(out) :: counter
integer :: i1, i2, n1, n2
n1 = 100
n2 = 50
counter = 0
!$omp target teams distribute reduction(+:counter)
do i1=1, n1
!$omp parallel do reduction(+:counter)
do i2=1, n2
counter = counter + 1
end do
end do
end subroutine
program main
implicit none
integer :: counter
call test_subroutine(counter)
! Should print: 5000
print '(I0)', counter
end program
subroutine foo(i)
integer, intent(inout) :: i
end subroutine
! The presence of this unreachable function in the compilation unit causes
! the result of `test_subroutine` to be incorrect. Removing the `distribute`
! OpenMP directive avoids the problem.
subroutine unused_problematic()
implicit none
integer :: i
!$omp target teams
!$omp distribute
do i=1, 100
call foo(i)
end do
!$omp end target teams
end subroutine |
This patch updates the OpenMP optimization pass to know about the new DeviceRTL functions for loop constructs.
This change marks these functions as potentially containing parallel regions, which fixes a current bug with the state machine rewrite optimization. It previously failed to identify parallel regions located inside of the callbacks passed to these new DeviceRTL functions, causing the resulting code to skip executing these parallel regions.
As a result, Generic kernels produced by Flang that contain parallel regions now work properly.
One known related issue not fixed by this patch is that the presence of calls to these functions will prevent the SPMD-ization of Generic kernels by OpenMPOpt. Previously, this was due to assuming there was no parallel region. This is changed by this patch, but instead we now mark it temporarily as unsupported in an SPMD context. The reason is that, without additional changes, code intended for the main thread of the team located outside of the parallel region would not be guarded properly, resulting in race conditions and generally invalid behavior.