-
Notifications
You must be signed in to change notification settings - Fork 8.2k
posix: pthread: correct race condition between pthread_create and pthread_join #57637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b196df9 to
fdfcc7b
Compare
8cd587d to
219c4a3
Compare
41a88c3 to
23aab13
Compare
1abfd8f to
b6e6752
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything obviously incorrect.
b6e6752 to
5d61510
Compare
|
The state transition diagram is super helpful, thanks for figuring out how to get that into this thread (and, we hope, into the Zephyr docs too!). However, when looking at it, I'm confused as to why you think you need a new state here -- Maybe the actions that cause the state transitions need some additional annotation so we can understand the difference between the arcs with the same operation coming from the same node? For instance, TERMINATED to DETACHED/JOINABLE might be labeled as create(DETACHED)/create(JOINABLE)? I'd like to know how the two transitions from ABORTING are different. I'm hesitant to encourage a move from a spinlock/mutex to a semaphore as semaphores are harder to reason about.
Holding the lock across thread transitions is kinda icky. The "usual" way to manage things like this is to simply re-validate the state in the second thread and work from there, potentially re-trying the operation, potentially performing some other operation. |
Yeah, I think a flag would be appropriate (e.g.
The create parameter was there originally, but I just thought it was obvious. Will update. Same with transitions from ABORTING - a thread only goes to EXITED if the previous state was JOINABLE.
Hopefully I can keep it simple.
I'm not convinced that is sufficient to avoid race conditions and so far, my experiments show that hypothesis to be true. |
Yeah, I suspect we'll have to see the implementation to get to ground truth here. Diagrams are all nice and everything, but they don't actually run on computers... |
This reverts commit da0398d. Signed-off-by: Christopher Friedt <[email protected]>
Previously, we considered `PTHREAD_EXITED` as a valid state for reclaiming a thread. That was not the case. `PTHREAD_EXITED` requires that `pthread_join()` is called while the thread is in `PTHREAD_EXITED` state. With that the calling thread may retrieve the return value of the joining thread. Additionally, correct incorrectly aliased pointer argument to `k_thread_abort()`. Signed-off-by: Christopher Friedt <[email protected]>
POSIX has a somewhat odd behavior that when you call pthread_detach()[1], it is supposed to synchronously wake up anyone currently joining the thread. That isn't a Zephyr kernel behavior[2], so it needs to be exposed. The implementation is as simple as it sounds. This is minimal: we don't try to keep any state around to inform the k_thread_join() call whether the thread actually exited or if this new function was used. That has to be provided by the outer layers (pthreads already do this). [1] Which tells the OS/platform that it can clean the thread up as soon as it exits and doesn't have to keep it around as a zombie until someone calls pthread_join() on it. [2] Because... why!? Not needing to keep a thread around for joiners isn't the same thing as disallowing joins! Signed-off-by: Andy Ross <[email protected]>
The pthread_join() implementation was done using a condition variable and not k_thread_join(), and that's actually fatal. It's not possible in Zephyr to guarantee a thread has exited in a race-free way without calling either k_thread_abort() or k_thread_join(). This opened a hole where an app could call pthread_join(), which would go to sleep waiting on the condition variable, which would be signaled from the exit path of the thread, and then return to call pthread_create() (and thus k_thread_create()) on the same pthread_t object (and thus the same struct k_thread) BEFORE THE EARLIER THREAD HAD ACTUALLY EXITED. This makes the scheduler blow up, at least on SMP (I wasn't personally able to make it happen on uniprocessor configs, though the race is present always). Rework to call k_thread_join(). There's no other correct way to do this. Fixes zephyrproject-rtos#56163 Signed-off-by: Andy Ross <[email protected]>
00d9170 to
1767a08
Compare
|
@keith-packard , @andyross - some good / bad news. I improved my testsuite to also test raciness between The good news, is that pthreads are not alone in that they are racey 😀 The bad news, is that the same race exists today with k_threads 😞 Currently, testing pthreads is disabled by default. I also reduced the test duration from 60s to 10s (I think we actually have a bug in twister in terms of the timeout). Any failures below in the I was really hoping to get this fixed for v3.4.0 (and also v2.7.5) - but I'm a bit exhausted at the moment. Sadly, the "aborted _current back from dead" issue exists with all libcs in various configurations and on multiple architectures. twister -T tests/posix/pthread_pressure/
Renaming output directory to /home/cfriedt/zephyrproject/zephyr/twister-out.3
INFO - Using Ninja..
INFO - Zephyr version: zephyr-v3.3.0-4131-g1767a0844e0a
INFO - Using 'zephyr' toolchain.
INFO - Selecting default platforms per test case
INFO - Building initial testsuite list...
INFO - Writing JSON report /home/cfriedt/zephyrproject/zephyr/twister-out/testplan.json
INFO - JOBS: 32
INFO - Adding tasks to the queue...
INFO - Added initial list of jobs to queue
ERROR - qemu_x86_64 tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib FAILED : unexpected eof
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_x86_64/tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib/handler.log
ERROR - qemu_x86_64 tests/posix/pthread_pressure/portability.posix.pthread_pressure.picolibc FAILED : unexpected eof
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_x86_64/tests/posix/pthread_pressure/portability.posix.pthread_pressure.picolibc/handler.log
ERROR - qemu_cortex_a53_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.picolibc FAILED : unexpected eof
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_cortex_a53_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.picolibc/handler.log
ERROR - qemu_cortex_a53_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib FAILED : unexpected eof
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_cortex_a53_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib/handler.log
ERROR - qemu_riscv64_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib FAILED : Timeout
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_riscv64_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib/handler.log
ERROR - qemu_riscv32_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib FAILED : Timeout
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_riscv32_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib/handler.log
ERROR - qemu_cortex_a53_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls FAILED : unexpected eof
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_cortex_a53_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls/handler.log
ERROR - qemu_cortex_a53_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.newlib FAILED : unexpected eof
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_cortex_a53_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.newlib/handler.log
ERROR - qemu_x86_64 tests/posix/pthread_pressure/portability.posix.pthread_pressure FAILED : unexpected eof
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_x86_64/tests/posix/pthread_pressure/portability.posix.pthread_pressure/handler.log
ERROR - qemu_cortex_a53_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure FAILED : unexpected eof
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_cortex_a53_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure/handler.log
ERROR - qemu_riscv64_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.picolibc FAILED : Timeout
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_riscv64_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.picolibc/handler.log
ERROR - qemu_riscv32_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls FAILED : Timeout
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_riscv32_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls/handler.log
ERROR - qemu_riscv64_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure.newlib FAILED : Timeout
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_riscv64_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.newlib/handler.log
ERROR - qemu_riscv32_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure FAILED : Timeout
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_riscv32_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure/handler.log
ERROR - qemu_riscv64_smp tests/posix/pthread_pressure/portability.posix.pthread_pressure FAILED : Timeout
ERROR - see: /home/cfriedt/zephyrproject/zephyr/twister-out/qemu_riscv64_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure/handler.log
INFO - Total complete: 275/ 275 100% skipped: 171, failed: 15, error: 0
INFO - 9 test scenarios (275 test instances) selected, 171 configurations skipped (163 by static filter, 8 at runtime).
INFO - 89 of 275 test configurations passed (85.58%), 15 failed, 0 errored, 171 skipped with 0 warnings in 157.44 seconds
INFO - In total 121 test cases were executed, 429 skipped on 40 out of total 571 platforms (7.01%)
INFO - 102 test configurations executed on platforms, 2 test configurations were only built.
INFO - Saving reports...
INFO - Writing JSON report /home/cfriedt/zephyrproject/zephyr/twister-out/twister.json
INFO - Writing xunit report /home/cfriedt/zephyrproject/zephyr/twister-out/twister.xml...
INFO - Writing xunit report /home/cfriedt/zephyrproject/zephyr/twister-out/twister_report.xml...
INFO - -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
INFO - The following issues were found (showing the top 10 items):
INFO - 1) tests/posix/pthread_pressure/portability.posix.pthread_pressure on qemu_cortex_a53_smp failed (unexpected eof)
INFO - 2) tests/posix/pthread_pressure/portability.posix.pthread_pressure on qemu_riscv64_smp failed (Timeout)
INFO - 3) tests/posix/pthread_pressure/portability.posix.pthread_pressure on qemu_riscv32_smp failed (Timeout)
INFO - 4) tests/posix/pthread_pressure/portability.posix.pthread_pressure on qemu_x86_64 failed (unexpected eof)
INFO - 5) tests/posix/pthread_pressure/portability.posix.pthread_pressure.newlib on qemu_cortex_a53_smp failed (unexpected eof)
INFO - 6) tests/posix/pthread_pressure/portability.posix.pthread_pressure.newlib on qemu_riscv64_smp failed (Timeout)
INFO - 7) tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls on qemu_cortex_a53_smp failed (unexpected eof)
INFO - 8) tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls on qemu_riscv32_smp failed (Timeout)
INFO - 9) tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib on qemu_cortex_a53_smp failed (unexpected eof)
INFO - 10) tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib on qemu_riscv64_smp failed (Timeout)
INFO -
INFO - To rerun the tests, call twister using the following commandline:
INFO - west twister -p <PLATFORM> -s <TEST ID>, for example:
INFO -
INFO - west twister -p qemu_riscv64_smp -s tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib
INFO - or with west:
INFO - west build -p -b qemu_riscv64_smp -T tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib
INFO - -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
INFO - Run completed$ grep "back from dead" $(find twister-out/ -name handler.log)
twister-out/qemu_riscv32_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib/handler.log: aborted _current back from dead
twister-out/qemu_riscv32_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure/handler.log: aborted _current back from dead
twister-out/qemu_x86_64/tests/posix/pthread_pressure/portability.posix.pthread_pressure.picolibc/handler.log: aborted _current back from dead
twister-out/qemu_cortex_a53_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure/handler.log: aborted _current back from dead
twister-out/qemu_riscv64_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.tls.newlib/handler.log: aborted _current back from dead
twister-out/qemu_riscv64_smp/tests/posix/pthread_pressure/portability.posix.pthread_pressure.picolibc/handler.log: aborted _current back from dead |
|
A serious bummer. Sounds like we need to address the |
193b919 to
f124514
Compare
lib/posix/pthread.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, there's no protection here for using a thread index which is stale; checking only for one possible stale state (TERMINATED) provides a false sense of security. Makes me wonder if there should be some kind of validating mode for the API where a generation number is encoded in the upper bits of the thread ID that are validated against a current generation number stored in the thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the thought and it would at least replicate some of the other validation schemes in e.g. pthread_mutex.c.
To be honest, I think I should be able to transition pthreads to use <zephyr/sys/bitarray.h> as well.
Let me give it another spin.
Also, thanks @andyross and @npitre for validating the issue on aarch64. Will likely use an arch_exclude there if I am able to.
Will push in a moment.
lib/posix/pthread.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm. Is pthread_t at least as large as size_t and unisigned? Otherwise, I think to_posix_thread() needs some better bounds checking to avoid possible wrap issues (imagine a 32-bit pthread_t on a 64-bit size_t system, it's possible that a large address difference will wrap around back into range).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pthread_t is a uint32_t in Zephyr, so should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes it unsafe -- on 64 bit targets, size_t is 64 bits, so you will get aliasing when converting from size_t to pthread_t. Probably need to fix pthread_self to bounds check the computed pointer before converting to an array index with the subtraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - for some reason I was thinking it was 32-bit. I'll add a bounds check in this rev
f124514 to
26b9337
Compare
|
Repeating the comment from elsewhere, regarding the kernel bug tracked in #58116 : FWIW: it seems not to be a kernel race in thread lifecycle, more like a arm64-specific race in context handling (and maybe x86_64, but with 1000x lower frequency). FWIW: I'd treat this as two separate issues and merge the pthread fixes. If you need to tune the test[1] on arm64 for CI, that seems acceptable as long as we have a bug tracking the instability. [1] SMP tests are already retried in CI because of unavoidable timer skew problems with scheduling "CPUs" onto host threads that can be preempted for 10ms+ in busy CI runs. (Which isn't what we're seeing here -- no timer interaction in this test!). The frequency I'm seeing probably wouldn't even show up above the noise floor. |
Previously, the pthread implementation used a `pthread_mutex_t` to synchronize thread state. This was required because a `pthread_cond_t` was used to notify other pthreads that were attempting to join or exit the same pthread. With the previous removal of the `pthread_cond_t` from `struct posix_thread`, and the addition of the `z_thread_wake_joiners()` utility, there is no longer a need to utilize a `pthread_mutex_t` to synchronize state. Instead, use a `struct k_spinlock` for synchronization and `z_thread_wake_joiners()` for notification. Additionally, use a `bitarray` instead of the the `pthread_pool_lock`, which makes `pthread.c` resource allocation more like that of the other pthread objects. Signed-off-by: Christopher Friedt <[email protected]>
This reverts commit 7c17bda. Signed-off-by: Christopher Friedt <[email protected]>
Technically speaking, there should be a synchronization to ensure that a spawned thread has started before `pthread_join()` is called. However, since we're recycling an existing no-op thread, just add a small delay. A separate test will be used to demonstrate stressing `pthread_create()` and `pthread_join()` which will use a synchronization primitive. Signed-off-by: Christopher Friedt <[email protected]>
Recently, a race condition was discovered in `pthread_create()` and `pthread_join()` that would trigger an assertion in `kernel/sched.c` of the form below. ``` aborted _current back from dead ``` Add a test that stresses the pthread implementation in order to more easily reproduce the failure. Signed-off-by: Christopher Friedt <[email protected]>
26b9337 to
cd77548
Compare
|
I'm going to close this PR, as somehow pthreads are less reliable than before in terms of passing Twister test runs, but I'll make a separate PR for the |
Recently, a race condition was discovered in
pthread_create()andpthread_join()that would trigger an assertion inkernel/sched.cof the form below.This required a few changes to address thoroughly, but all indications point to a significant improvement in determinism and stability.
PTHREAD_TERMINATEDz_thread_wake_joiners()(c/o @andyross)Fixes #56163
Andy's contributions brought in from #57721 (I hope that's ok). Thanks boss 👍
Compliance issue is a false positive