-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](executor) Fix rare self-deadlock that can cause the time-sharing task executor to hang. #58273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix](executor) Fix rare self-deadlock that can cause the time-sharing task executor to hang. #58273
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run nonConcurrent |
90c5711 to
4fab545
Compare
|
run nonConcurrent |
|
run buildall |
TPC-H: Total hot run time: 34076 ms |
ClickBench: Total hot run time: 27.91 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run nonConcurrent |
7 similar comments
|
run nonConcurrent |
|
run nonConcurrent |
|
run nonConcurrent |
|
run nonConcurrent |
|
run nonConcurrent |
|
run nonConcurrent |
|
run nonConcurrent |
4fab545 to
d3e7247
Compare
|
run buildall |
d3e7247 to
f3cc40b
Compare
|
run buildall |
TPC-H: Total hot run time: 35954 ms |
TPC-DS: Total hot run time: 186512 ms |
ClickBench: Total hot run time: 27.2 s |
|
run nonConcurrent |
f3cc40b to
8651d73
Compare
|
run buildall |
8651d73 to
79a2525
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run nonConcurrent |
70ed1bd to
e1e8ec2
Compare
|
run buildall |
TPC-H: Total hot run time: 32094 ms |
TPC-DS: Total hot run time: 172925 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…g task executor to hang. (#58273) ### What problem does this PR solve? ### Release note #### Summary 1. Fix a self-deadlock that can occur in `_dispatch_thread` when a destructor path attempts to re-acquire the same mutex already held by the thread. The root cause is destructors (triggered while holding `_lock`) performing operations that try to re-acquire the same mutex. A safe fix should ensure destructors that may call `remove_task()` run outside the `_lock` scope or avoid re-locking the same mutex inside destructor paths. 2. Call `_task_executor->wait()` in `TaskExecutorSimplifiedScanScheduler::stop()`. --- #### Details / Reproduction steps 1. `std::shared_ptr<PrioritizedSplitRunner> split = _tokenless->_entries->take();` 3. `l.lock();` — `_dispatch_thread` acquires `_lock`. 4. After the `while` loop finishes, `split` goes out of scope and the `shared_ptr` is destroyed. 5. `PrioritizedSplitRunner` destructor runs → destroys `_split_runner` (`ScannerSplitRunner`). 6. `ScannerSplitRunner::_scan_func` destructor runs → destroys captured `ctx` (`std::shared_ptr<ScannerContext>`). 7. `ScannerContext::~ScannerContext()` calls `remove_task()`. 8. `remove_task()` attempts to acquire `_lock`. 9. Result: **self-deadlock** because `_lock` is already held by `_dispatch_thread`. --- #### Solution We must explicitly release 'split' BEFORE acquiring _lock to avoid self-deadlock. The destructor chain (`PrioritizedSplitRunner` -> `ScannerSplitRunner`-> `_scan_func` lambda -> captured `ScannerContext`) may call `remove_task()` which tries to acquire `_lock`. Since `_lock` is not a recursive mutex, this would deadlock.
…g task executor to hang. (apache#58273) ### What problem does this PR solve? ### Release note #### Summary 1. Fix a self-deadlock that can occur in `_dispatch_thread` when a destructor path attempts to re-acquire the same mutex already held by the thread. The root cause is destructors (triggered while holding `_lock`) performing operations that try to re-acquire the same mutex. A safe fix should ensure destructors that may call `remove_task()` run outside the `_lock` scope or avoid re-locking the same mutex inside destructor paths. 2. Call `_task_executor->wait()` in `TaskExecutorSimplifiedScanScheduler::stop()`. --- #### Details / Reproduction steps 1. `std::shared_ptr<PrioritizedSplitRunner> split = _tokenless->_entries->take();` 3. `l.lock();` — `_dispatch_thread` acquires `_lock`. 4. After the `while` loop finishes, `split` goes out of scope and the `shared_ptr` is destroyed. 5. `PrioritizedSplitRunner` destructor runs → destroys `_split_runner` (`ScannerSplitRunner`). 6. `ScannerSplitRunner::_scan_func` destructor runs → destroys captured `ctx` (`std::shared_ptr<ScannerContext>`). 7. `ScannerContext::~ScannerContext()` calls `remove_task()`. 8. `remove_task()` attempts to acquire `_lock`. 9. Result: **self-deadlock** because `_lock` is already held by `_dispatch_thread`. --- #### Solution We must explicitly release 'split' BEFORE acquiring _lock to avoid self-deadlock. The destructor chain (`PrioritizedSplitRunner` -> `ScannerSplitRunner`-> `_scan_func` lambda -> captured `ScannerContext`) may call `remove_task()` which tries to acquire `_lock`. Since `_lock` is not a recursive mutex, this would deadlock.
…g task executor to hang. (apache#58273) 1. Fix a self-deadlock that can occur in `_dispatch_thread` when a destructor path attempts to re-acquire the same mutex already held by the thread. The root cause is destructors (triggered while holding `_lock`) performing operations that try to re-acquire the same mutex. A safe fix should ensure destructors that may call `remove_task()` run outside the `_lock` scope or avoid re-locking the same mutex inside destructor paths. 2. Call `_task_executor->wait()` in `TaskExecutorSimplifiedScanScheduler::stop()`. --- 1. `std::shared_ptr<PrioritizedSplitRunner> split = _tokenless->_entries->take();` 3. `l.lock();` — `_dispatch_thread` acquires `_lock`. 4. After the `while` loop finishes, `split` goes out of scope and the `shared_ptr` is destroyed. 5. `PrioritizedSplitRunner` destructor runs → destroys `_split_runner` (`ScannerSplitRunner`). 6. `ScannerSplitRunner::_scan_func` destructor runs → destroys captured `ctx` (`std::shared_ptr<ScannerContext>`). 7. `ScannerContext::~ScannerContext()` calls `remove_task()`. 8. `remove_task()` attempts to acquire `_lock`. 9. Result: **self-deadlock** because `_lock` is already held by `_dispatch_thread`. --- We must explicitly release 'split' BEFORE acquiring _lock to avoid self-deadlock. The destructor chain (`PrioritizedSplitRunner` -> `ScannerSplitRunner`-> `_scan_func` lambda -> captured `ScannerContext`) may call `remove_task()` which tries to acquire `_lock`. Since `_lock` is not a recursive mutex, this would deadlock.
What problem does this PR solve?
Release note
Summary
_dispatch_threadwhen a destructor path attempts to re-acquire the same mutex already held by the thread. The root cause is destructors (triggered while holding_lock) performing operations that try to re-acquire the same mutex. A safe fix should ensure destructors that may callremove_task()run outside the_lockscope or avoid re-locking the same mutex inside destructor paths._task_executor->wait()inTaskExecutorSimplifiedScanScheduler::stop().Details / Reproduction steps
std::shared_ptr<PrioritizedSplitRunner> split = _tokenless->_entries->take();l.lock();—_dispatch_threadacquires_lock.whileloop finishes,splitgoes out of scope and theshared_ptris destroyed.PrioritizedSplitRunnerdestructor runs → destroys_split_runner(ScannerSplitRunner).ScannerSplitRunner::_scan_funcdestructor runs → destroys capturedctx(std::shared_ptr<ScannerContext>).ScannerContext::~ScannerContext()callsremove_task().remove_task()attempts to acquire_lock._lockis already held by_dispatch_thread.Solution
We must explicitly release 'split' BEFORE acquiring _lock to avoid self-deadlock. The destructor chain (
PrioritizedSplitRunner->ScannerSplitRunner->_scan_funclambda -> capturedScannerContext) may callremove_task()which tries to acquire_lock. Since_lockis not a recursive mutex, this would deadlock.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)