-
Notifications
You must be signed in to change notification settings - Fork 226
Fix: stack-use-after-free after recent run_loop changes #1742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a use-after-free bug; a background thread can store true to
__done_, then before it calls notify, the waiter thread may end up seeing__done_is true and break (say, a spurious wakeup, or the wait had never reached the first iteration of the loop), then the sync_wait returns and destructs.Btw, I don't think you need to loop here .. std::atomic::wait only returns when the value has changed, accounting for spurious wakeups (it has a loop internally).
I was able to detect this with Relacy, which I am in the process of updating to work with stdexec. The test is https://github.com/NVIDIA/stdexec/compare/main...ccotter:stdexec:sync-wait-rrd-bug?expand=1
To run the test, follow these instructions, except use this branch instead of relacy's main branch: dvyukov/relacy#33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure how this can be addressed. In code like
when the background thread notifies the foreground thread to wake up and unblock, the communication channel (condition variable, atomic, other) may become destroyed when the sync_wait objects on the stack created by the foreground thread go out of scope. Whether the notifier uses
done.store(true); done.notify_one();ordone = true; cv.notify();, I'm not sure how to ensure the foreground thread will wake up and continue, but only after the background thread has finishing doing the notification...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lewissbaker - wondering if you might be able to shed some light on this .. I seem to have hit a mental dead end in not understanding how it's possible to to have a sync_wait() without allocating the internal run_loop on the heap, and giving a strong reference to the receiver to guarantee lifetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccotter The receiver is the one allocating the shared state (currently on the stack).
In fact, the code you mentioned above was very similar to the code where I detected the bug this fix is for (see here: https://github.com/olympus-robotics/hephaestus/blob/main/modules/concurrency/tests/context_tests.cpp#L96).
In fact, the solution I came up with should be actually safe. I don't have a full proof, but ran my testcase for around a million times.
Let me try to answer why what you are describing should be safe:
The execution context signals completion by setting performing the following steps (in a strongly "happens before" order):
1.1. Set the finishing of the run loop to true
1.2. (optional) signalling the run loop
1.3. Set done to true
1.4. Signal done
The execution context waiting on completion is performing the following steps:
2.1. If finished is true, goto 5.
2.2. Wait for new task
2.3. Execute task
2.4. Goto 1.
2.6. Wait for done to be set to true
2.7. return (ends lifetime of the state)
Without the 'done' step, we run into the situation that step 1.2 might access invalid memory because we already ended up in 2.7. too early.
Previously, this was ensured by actually holding the mutex while signalling the condition variable (
stdexec/include/stdexec/__detail/__run_loop.hpp
Line 210 in b273eee
I can see how there is still a tiny window which can cause a use after free as you described (even after removing the loop).
I think the only way this can be truly fixed is to introduce a count of 'tasks in flight' for the run_loop:
run_loop::__push_backcall increments this countrun_loop::__execute_alldecrements this count by the amount it handledrun_loop::finishincrements the count before setting the finishing flag and decrements it before returning.run_loop::runis looping as long as the count is non zero and finishing is set to true.I wanted to avoid this, since it sounded expensive and the solution I propose here at least seems to work...