Skip to content

Race condition in NATIVE_SIM_REBOOT component #96809

@aescolar

Description

@aescolar

Describe the bug

There is a race condition in the (optional) NATIVE_SIM_REBOOT component,
https://github.com/zephyrproject-rtos/zephyr/blob/main/boards/native/native_sim/reboot_bottom.c#L20
where before re-starting the process with exec(), all file descriptors are closed.
But, unfortunately this races any other possible user of those descriptors and cannot be done safely.
This has been seen crashing the executable when running pthread_exit() while terminating the Zephyr threads which depending on what has happened before may be currently (re)loading the libgcc_s.*.so library.

But in principle this could race any other library or component which may be asynchronously doing something still.

The safest option may be to simply not close any descriptor manually, but expect that whoever opened one, did so with the O_CLOEXEC flag which will automatically close it on a successful exec(). (This is the expectation for libraries). Note that if somebody did not do this, the cosecuence would be either a "leak" of a descriptor to the child process, or in the worst case that child process holding until it dies onto a port that should have been free'd.

Regression

  • This is a regression. (Issue has been there since the component was introduced)

Steps to reproduce

To ease reproduction modify this test:

diff --git a/tests/boards/native_sim/reset_hw_info/src/main.c b/tests/boards/native_sim/reset_hw_info/src/main.c
index 7a8d5bf7c49..7a29d359357 100644
--- a/tests/boards/native_sim/reset_hw_info/src/main.c
+++ b/tests/boards/native_sim/reset_hw_info/src/main.c
@@ -14,6 +14,8 @@ int main(void)
        uint32_t cause;
        int err;
 
+       sys_reboot(SYS_REBOOT_WARM);
+
        err = hwinfo_get_reset_cause(&cause);
        if (err != 0) {
                posix_print_error_and_exit("hwinfo_get_reset_cause() failed %i\n", err);

mkdir build && cd build
cmake -GNinja -DBOARD=native_sim ../tests/boards/native_sim/reset_hw_info && ninja
zephyr/zephyr.exe

How many reboot iterations it may take to fail is quite random, but as of now it will eventually fail.

Relevant log output

*** Booting Zephyr OS build v4.2.0-4636-gb8541c53eeaf ***
This seems like the first start => Resetting
libgcc_s.so.1 must be installed for pthread_exit to work
Aborted (core dumped)

Impact

Annoyance – Minor irritation; no significant impact on usability or functionality.

Environment

  • OS Linux (Ubuntu 24.04)
  • Host gcc 14.2
  • Main as of now b8541c5

Additional Context

Related to:

Proposed fix:

Priority set as low, as the issue is in an optional component and does not happen always.


An strace of the issue reveals that pthread_exit() caused libgcc_s.so to be attempted to be reloaded, which opens a new descriptor, which is racily closed by that code in another thread.

   openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_LARGEFILE|O_CLOEXEC ..) = 3
   close(3) = 0

while in executions in which they did not step on each other that close simply fails as expected:
close(3) = -1 EBADF (Bad file descriptor)

Metadata

Metadata

Assignees

Labels

area: native portHost native arch port (native_sim)bugThe issue is a bug, or the PR is fixing a bugpriority: lowLow impact/importance bug

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions