-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Description
Describe the bug
In some complex scenarios involving several threads and a few mutexes we have experienced some intermittent crashes depending on timing of the system. I've narrowed this down to an issue with the scalable waitq implementation not re-inserting when a priority changes, potentially making the rb tree invalid. Without the scalable waitq the wrong thread may be run which may or may not cause an issue depending on the application.
To Reproduce
This is not limited to mutexes, but the issue was most easily reproduced with mutexes with scalable waitq enabled. Consider 4 threads in decreasing priority order: A, B, C, and D along with two mutexes, m0 and m1:
- D locks m1
- C locks m0
- C pends on m1
- B pends on m1
- A pends on m0, boosts C's priority, now tree on m1 is not sorted
- D unlocks m1, left-most thread on tree is B. When removing B from tree it cannot be found because it searches to the right of C due to C's boosted priority when the node is actually on the left. rb_remove silently fails.
- B unlocks m1, left-most thread on tree is still B and it tries to unpend itself, resulting in a NULL pointer dereference on B->base.pended_on.
Expected behavior
System does not crash.
Impact
Intermittent but fairly consistent crashes on our system.
Logs and console output
See test: https://github.com/zephyrproject-rtos/zephyr/pull/87235/files#diff-8adc688dcc6c66e2f0a064f4ed385d3ff51e325b66ab8e4e9b7570cf80d1bf22
Environment (please complete the following information):
- OS: WSL
- Toolchain gcc-arm-none-eabi
- v3.3, v3.7
Additional context
Fixed in #87235
Looking to get backported:
#87840
#87839
#87841