-
Notifications
You must be signed in to change notification settings - Fork 8k
Description
Describe the bug
There is a race condition in the (optional) NATIVE_SIM_REBOOT component,
https://github.com/zephyrproject-rtos/zephyr/blob/main/boards/native/native_sim/reboot_bottom.c#L20
where before re-starting the process with exec(), all file descriptors are closed.
But, unfortunately this races any other possible user of those descriptors and cannot be done safely.
This has been seen crashing the executable when running pthread_exit() while terminating the Zephyr threads which depending on what has happened before may be currently (re)loading the libgcc_s.*.so library.
But in principle this could race any other library or component which may be asynchronously doing something still.
The safest option may be to simply not close any descriptor manually, but expect that whoever opened one, did so with the O_CLOEXEC flag which will automatically close it on a successful exec(). (This is the expectation for libraries). Note that if somebody did not do this, the cosecuence would be either a "leak" of a descriptor to the child process, or in the worst case that child process holding until it dies onto a port that should have been free'd.
Regression
- This is a regression. (Issue has been there since the component was introduced)
Steps to reproduce
To ease reproduction modify this test:
diff --git a/tests/boards/native_sim/reset_hw_info/src/main.c b/tests/boards/native_sim/reset_hw_info/src/main.c
index 7a8d5bf7c49..7a29d359357 100644
--- a/tests/boards/native_sim/reset_hw_info/src/main.c
+++ b/tests/boards/native_sim/reset_hw_info/src/main.c
@@ -14,6 +14,8 @@ int main(void)
uint32_t cause;
int err;
+ sys_reboot(SYS_REBOOT_WARM);
+
err = hwinfo_get_reset_cause(&cause);
if (err != 0) {
posix_print_error_and_exit("hwinfo_get_reset_cause() failed %i\n", err);
mkdir build && cd build
cmake -GNinja -DBOARD=native_sim ../tests/boards/native_sim/reset_hw_info && ninja
zephyr/zephyr.exe
How many reboot iterations it may take to fail is quite random, but as of now it will eventually fail.
Relevant log output
*** Booting Zephyr OS build v4.2.0-4636-gb8541c53eeaf ***
This seems like the first start => Resetting
libgcc_s.so.1 must be installed for pthread_exit to work
Aborted (core dumped)
Impact
Annoyance – Minor irritation; no significant impact on usability or functionality.
Environment
- OS Linux (Ubuntu 24.04)
- Host gcc 14.2
- Main as of now b8541c5
Additional Context
Related to:
- boards: native_sim: add reboot for native sim #87987
- tests native_sim reset_hw_info: Skip by now #96765
- arch/posix: Workaround race condition in glibc #96780 (fixes the particular case for pthread_exit and the libgcc_s load)
Proposed fix:
- native_sim drivers: Set O_CLOEXEC for all native sim specific host descriptors by default #96806
- boards native_sim: reboot: Do not close descriptors manually #96812
Priority set as low, as the issue is in an optional component and does not happen always.
An strace of the issue reveals that pthread_exit() caused libgcc_s.so to be attempted to be reloaded, which opens a new descriptor, which is racily closed by that code in another thread.
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_LARGEFILE|O_CLOEXEC ..) = 3
close(3) = 0
while in executions in which they did not step on each other that close simply fails as expected:
close(3) = -1 EBADF (Bad file descriptor)