-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[openmp] Segfaults/assertion errors on certain omp statements after calling omp_pause_resource_all(omp_pause_hard)
#154204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…shm file because we could have unregistered the library right before fork via hard reset, and child can simply exit immediately.
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
Hi @jprotze , would you please be able to review this PR? If not can you please tell me who should review this? |
You can test this locally with the following command:git-clang-format --diff origin/main HEAD --extensions h,c,cpp -- openmp/runtime/src/kmp.h openmp/runtime/src/kmp_csupport.cpp openmp/runtime/src/kmp_global.cpp openmp/runtime/src/kmp_lock.cpp openmp/runtime/src/kmp_lock.h openmp/runtime/src/kmp_runtime.cpp openmp/runtime/src/ompt-internal.h openmp/runtime/src/z_Linux_util.cpp openmp/runtime/test/api/omp_pause_resource.c
View the diff from clang-format here.diff --git a/openmp/runtime/test/api/omp_pause_resource.c b/openmp/runtime/test/api/omp_pause_resource.c
index fce83824d..6154377b9 100644
--- a/openmp/runtime/test/api/omp_pause_resource.c
+++ b/openmp/runtime/test/api/omp_pause_resource.c
@@ -16,9 +16,13 @@ void doOmpWorkWithCritical(int *a_lockCtr, int *b_lockCtr) {
#pragma omp parallel num_threads(NUM_THREADS)
{
#pragma omp critical(a_lock)
- { *a_lockCtr = *a_lockCtr + 1; }
+ {
+ *a_lockCtr = *a_lockCtr + 1;
+ }
#pragma omp critical(b_lock)
- { *b_lockCtr = *b_lockCtr + 1; }
+ {
+ *b_lockCtr = *b_lockCtr + 1;
+ }
}
}
@@ -47,7 +51,8 @@ void test_omp_get_thread_num_after_omp_hard_pause_resource_all() {
// use omp to do some work, guarantees omp initialization
#pragma omp parallel num_threads(NUM_THREADS)
- {}
+ {
+ }
// omp hard pause should succeed
int rc = omp_pause_resource_all(omp_pause_hard);
@@ -62,7 +67,8 @@ void test_omp_get_thread_num_after_omp_hard_pause_resource_all() {
void test_omp_parallel_num_threads_after_omp_hard_pause_resource_all() {
// use omp to do some work
#pragma omp parallel num_threads(NUM_THREADS)
- {}
+ {
+ }
// omp hard pause should succeed
int rc = omp_pause_resource_all(omp_pause_hard);
@@ -70,7 +76,8 @@ void test_omp_parallel_num_threads_after_omp_hard_pause_resource_all() {
// this should not trigger any omp asserts
#pragma omp parallel num_threads(NUM_THREADS)
- {}
+ {
+ }
}
void test_KMP_INIT_AT_FORK_with_fork_after_omp_hard_pause_resource_all() {
@@ -106,7 +113,8 @@ void test_KMP_INIT_AT_FORK_with_fork_after_omp_hard_pause_resource_all() {
void test_fork_child_exiting_after_omp_hard_pause_resource_all() {
// use omp to do some work
#pragma omp parallel num_threads(NUM_THREADS)
- {}
+ {
+ }
// omp hard pause should succeed
int rc = omp_pause_resource_all(omp_pause_hard);
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of comments
openmp/runtime/src/kmp_lock.cpp
Outdated
__kmp_free(ll->lock); | ||
ll->lock = NULL; | ||
// reset the reverse critical section pointer to 0 | ||
if (ll->rev_ptr_critSec && LIKELY(!__kmp_in_atexit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rev_ptr_critSec
is not initialized. Set it to nullptr in __kmp_allocate_indirect_lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In __kmp_allocate_indirect_lock()
the new lock tables are allocated with multiples of sizeof(kmp_indirect_lock_t)
, and I believe __kmp_allocate()
will memset the allocated memory to 0. So I believe rev_ptr_critSec
is technically initialized to 0 in this case?
I can also explicitly initialize it to nullptr if that makes the code easier to reason about.
void __kmp_register_atfork(void) { | ||
if (__kmp_need_register_atfork) { | ||
// NOTE: we will not double register our fork handlers! It will cause deadlock | ||
if (!__kmp_already_registered_atfork && __kmp_need_register_atfork) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related to the pause_resource_all issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes, because if you see the test function test_KMP_INIT_AT_FORK_with_fork_after_omp_hard_pause_resource_all()
I added in the unit tests, the original behaviour will actually register the atfork handlers twice (if the environment variable KMP_INIT_AT_FORK
is explicitly set to 1), which causes a deadlock (since the atfork handlers are run twice, and inside the handler it will do locking).
There might be a cleaner way to prevent double atfork registration than adding another flag (I thought I can re-use the variable __kmp_need_register_atfork_specified
, but I didn't since it looks like its only used for debug printing)
openmp/runtime/src/kmp_runtime.cpp
Outdated
#ifdef KMP_TDATA_GTID | ||
/*reset __kmp_gtid to initial value*/ | ||
__kmp_gtid = KMP_GTID_DNE; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread 0 should not drop it's gtid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I know we are thread 0? Do I just check if __kmp_gtid == 0
(as __kmp_gtid
is declared as thread local storage) and if thats the case, I don't reset it?
I chose to reset __kmp_gtid
to KMP_GTID_DNE
because it is initialized to KMP_GTID_DNE
in kmp_global.cpp
, so I thought resetting to its default value was ok.
Can you please point me how this can cause an issue in the code, so that I can understand the code base better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you look at code generated for an OpenMP program (IR or disassembled objdump), you will see that each function that contains __kmpc*
calls initially gets the gtid using __kmpc_global_thread_num
, which triggers serial initialization. This gtid is then passed to many __kmpc
calls. You already ran into the issue that __kmpc_push_num_threads
assumes that serial initialization has happened. I think, that the runtime must be in serial initialized state after omp_pause_resource[_all] to not break the assumptions of compiler+runtime implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I did realize clang inserts __kmpc_global_thread_num()
at the beginning of functions.
But how does not resetting __kmp_gtid
to KMP_GTID_DNE
for thread 0 imply that the runtime in serial initialized state after omp_pause_resource calls? It sounds like we want to call __kmpc_global_thread_num()
(or something equivalent) right after omp_pause_resource to force serial initialized state?
To automatically address the code formatting issues, please run |
…!UNLIKELY(__kmp_in_atexit))` as suggested since they are equivalent
in `__kmpc_push_num_threads()`, as suggested in feedback
Thank you Joachim for reviewing the PR! I've replied to some of your comments and implemented some of the feedback, please re-review the comments and changes. Also I realized I'm probably not following the standard naming convention (for example, the new |
…hard reset" This reverts commit 773ab2f.
…__kmp_hard_pause()` This should ensure that the program is in serially initialized state after doing a hard pause, hence not breaking any compiler/runtime invariants/assumptions. Also fixed formatting issues
Hi @jprotze , I attempted to address the issue where after I can simply call So I want the serial initialization right after a hard pause to have the following properties (in the case that we want to fork immediately after a hard pause):
And it looks like we should be already doing this inside the I tried to understand what Here is the reason why I think the serial initialization satisfies the 2 properties I've stated above: I went in gdb and break pointed all of the pthread functions to see what will be called during I see only these pthread functions are used in
So I believe property 1 holds. Next, I break pointed the So I believe property 2 also holds. NOTE: this only verifies that the properties hold for a certain compilation, as other compiler flags might introduce different behaviour. Please correct me if my understanding is wrong, as I'd love to know and do this properly. |
My attempt to fix #154201.
I've run the omp tests using the
check-openmp
target, and the new tests pass without segfaults/assertion errors/deadlocks.However, I'm not certain that my changes are complete (i.e. I didn't miss any edge cases), as I occasionally get 1 or 2 segfaults when I run my changes with my company's regression test that heavily uses omp (I'm uncertain if its my omp changes or company's code base that caused the segfaults). However, I couldn't reproduce any segfaults with the tests I've added.
Also this is my first time contributing to open source, so any feedback is appreciated! :)