BUG: Fix ParallelSparseFieldLevelSet deadlock with PlatformMultiThreader (supersedes #6321)#6337
Conversation
ParallelSparseFieldLevelSetImageFilter::Iterate() dispatches its
per-iteration ThreadedApplyUpdate body via mt->ParallelizeArray().
The body invokes SignalNeighborsAndWait() several times, blocking on
per-thread std::condition_variables until both neighbor work units
have arrived at the same phase. This neighbor-only protocol assumes
every work unit holds its own concurrent OS thread for the full
duration of the dispatch. Pool/TBB-backed ParallelizeArray does not
honor this: N work units are submitted to a shared worker pool whose
size is bounded by core count, and CV-blocked tasks pin the pool so
queued work units never start and the dispatch deadlocks.
Reported as the intermittent timeout of
itkParallelSparseFieldLevelSetImageFilterTest in CDash test 2570265561
("Trying mf->Update()" followed by a 1000s timeout) on Windows
Release with TBB.
The existing test fails at ~3% per run under TBB, which is too rare
to catch the bug reliably in CI. This commit adds a new test
itkParallelSparseFieldLevelSetImageFilterRobustnessTest that drives
the deadlock to a high per-test-invocation probability:
Scenario 1 (sweep x repeat): cycle 1/2/4/8/11/16/32 work-unit
counts for 20 outer iterations, totalling 140 short filter runs.
Per-run pool-starvation deadlock probability is low but
amplification gives a ~95% per-test-invocation hit rate.
Scenario 2 (determinism): three repeated runs at workUnits=11 must
produce bit-identical output. Guards against future barrier
rewrites that introduce numerical non-determinism.
Scenario 3 (concurrent multi-pipeline): eight pipelines run
simultaneously via std::async (88 work units competing for a
core-count-bounded worker pool). Cross-pipeline contention
directly probes the dispatch-starvation deadlock path.
Local measurements on a 16-core x86 box with TBB:
- Upstream code, 30 invocations: 30 timeouts (100% deadlock).
- Upstream code, POOL/PLATFORM: pass (no worker shortage possible).
- Test runtime with a working synchronization: ~5s wallclock.
- CTest TIMEOUT set to 30 (6x leeway over expected runtime).
This commit intentionally introduces a failing test on master. The
proposed fix lands in a follow-up PR.
Iterate()'s ThreadedApplyUpdate dispatch blocks in SignalNeighborsAndWait until neighbor work units arrive, so every work unit needs its own concurrent OS thread. Pool/TBB ParallelizeArray queues N tasks onto a bounded worker set; CV-blocked tasks pin the workers so queued units never start, deadlocking. Force PlatformMultiThreader with DynamicMultiThreadingOff in the constructor (as SLICImageFilter does) so every work unit gets a dedicated OS thread. The neighbor-only sync protocol is unchanged. Supersedes InsightSoftwareConsortium#6321, which spawned explicit std::threads for only one ParallelizeArray call; forcing the threader preserves the filter's fixed thread==work-unit assumption across all parallel sections. Co-Authored-By: Bradley Lowekamp <blowekamp@mail.nih.gov>
fc15a07
into
InsightSoftwareConsortium:main
|
| Filename | Overview |
|---|---|
| Modules/Segmentation/LevelSets/include/itkParallelSparseFieldLevelSetImageFilter.hxx | 3-line constructor change: DynamicMultiThreadingOff() + SetMultiThreader(PlatformMultiThreader::New()) fixes the pool-starvation deadlock; matches the established itkSLICImageFilter pattern. Public SetMultiThreader on the base class could silently undo the fix if called post-construction. |
| Modules/Segmentation/LevelSets/test/itkParallelSparseFieldLevelSetImageFilterRobustnessGTest.cxx | New GTest suite with SweepRepeat, Determinism, and ConcurrentMultiPipeline scenarios. runWithDeadline is not exception-safe: an uncaught ITK exception from body() leaves the watchdog thread unjoinable, causing std::terminate() to kill the entire test binary. |
| Modules/Segmentation/LevelSets/test/CMakeLists.txt | Adds creategoogletestdriver block for the new GTest file; follows existing ITK conventions with no explicit itk_add_test calls needed (gtest_discover_tests handles registration). Clean change. |
Sequence Diagram
sequenceDiagram
participant Caller
participant Filter as ParallelSparseFieldLevelSetImageFilter
participant PMT as PlatformMultiThreader
participant T1 as WorkUnit Thread 1..N
Caller->>Filter: New() [constructor]
Filter->>Filter: DynamicMultiThreadingOff()
Filter->>PMT: New()
Filter->>Filter: SetMultiThreader(PMT)
Caller->>Filter: Update()
Filter->>PMT: ParallelizeArray(ThreadedApplyUpdate)
PMT->>T1: Spawn N dedicated OS threads
T1->>T1: ThreadedApplyUpdate + SignalNeighborsAndWait
Note over T1: All N threads run concurrently
T1-->>PMT: All threads complete
PMT-->>Filter: Return
Filter-->>Caller: Output ready
Reviews (1): Last reviewed commit: "BUG: Fix ParallelSparseFieldLevelSet dea..." | Re-trigger Greptile
| template <typename TFunctor> | ||
| void | ||
| runWithDeadline(unsigned int seconds, TFunctor && body) | ||
| { | ||
| std::mutex m; | ||
| std::condition_variable cv; | ||
| bool done = false; | ||
| std::thread watchdog([&] { | ||
| std::unique_lock<std::mutex> lk(m); | ||
| if (!cv.wait_for(lk, std::chrono::seconds(seconds), [&] { return done; })) | ||
| { | ||
| std::cerr << "Deadline exceeded (" << seconds << "s): ParallelizeArray dispatch deadlock\n"; | ||
| std::abort(); | ||
| } | ||
| }); | ||
| body(); | ||
| { | ||
| const std::lock_guard<std::mutex> lk(m); | ||
| done = true; | ||
| } | ||
| cv.notify_all(); | ||
| watchdog.join(); | ||
| } |
There was a problem hiding this comment.
runWithDeadline terminates the test driver on any pipeline exception
If body() throws (e.g., an itk::ExceptionObject from mf->Update()), the code after body() is never reached — done is never set, cv.notify_all() is never called, and watchdog.join() is never called. The watchdog destructor then invokes std::terminate() because the thread is still joinable, which kills the entire GTest binary. All subsequent test cases (Determinism, ConcurrentMultiPipeline) are silently lost rather than reported. A try/catch around body() that signals done and joins before rethrowing would let GTest report the failure normally.
| template <typename TFunctor> | |
| void | |
| runWithDeadline(unsigned int seconds, TFunctor && body) | |
| { | |
| std::mutex m; | |
| std::condition_variable cv; | |
| bool done = false; | |
| std::thread watchdog([&] { | |
| std::unique_lock<std::mutex> lk(m); | |
| if (!cv.wait_for(lk, std::chrono::seconds(seconds), [&] { return done; })) | |
| { | |
| std::cerr << "Deadline exceeded (" << seconds << "s): ParallelizeArray dispatch deadlock\n"; | |
| std::abort(); | |
| } | |
| }); | |
| body(); | |
| { | |
| const std::lock_guard<std::mutex> lk(m); | |
| done = true; | |
| } | |
| cv.notify_all(); | |
| watchdog.join(); | |
| } | |
| try | |
| { | |
| body(); | |
| } | |
| catch (...) | |
| { | |
| { | |
| const std::lock_guard<std::mutex> lk(m); | |
| done = true; | |
| } | |
| cv.notify_all(); | |
| watchdog.join(); | |
| throw; | |
| } | |
| { | |
| const std::lock_guard<std::mutex> lk(m); | |
| done = true; | |
| } | |
| cv.notify_all(); | |
| watchdog.join(); | |
| } |
| // Neighbor-blocking inter-phase sync requires a dedicated OS thread per work unit. | ||
| this->DynamicMultiThreadingOff(); | ||
| this->SetMultiThreader(PlatformMultiThreader::New()); |
There was a problem hiding this comment.
Post-construction
SetMultiThreader silently re-introduces deadlock
SetMultiThreader is public on the base class, so a caller who runs filter->SetMultiThreader(TBBMultiThreader::New()) after construction replaces PlatformMultiThreader with a pool-backed threader and restores exactly the dispatch-starvation deadlock this PR fixes. Since the filter's synchronisation protocol mandates one dedicated OS thread per work unit, the constructor should at minimum document that overriding the threader breaks the filter — or note it in the class-level Doxygen rather than only in a constructor comment.
|
|
||
| #include "gtest/gtest.h" | ||
|
|
||
| #include "itkLevelSetFunction.h" | ||
| #include "itkParallelSparseFieldLevelSetImageFilter.h" |
There was a problem hiding this comment.
File-level comment references a transient resource and exceeds the prose-budget cap
prose-budget.md caps in-source comments at one short line (≤ 100 chars) and explicitly forbids references to the PR ("See the PR for the full analysis") because the PR is a transient resource — "the link rots; the prose remains as noise." The second sentence and the "See the PR" cross-reference should be removed; the surviving single line should capture only the non-obvious invariant that the reader needs with the diff open.
Context Used: AGENTS.md (source)
Fixes the
ParallelSparseFieldLevelSetImageFilter::Iterate()dispatch-starvation deadlock by forcingPlatformMultiThreaderin the constructor (per @blowekamp's suggestion on #6321), rather than spawning explicitstd::threads. Intended to supersede #6321.Root cause and why this approach
Iterate()dispatchesThreadedApplyUpdateviamt->ParallelizeArray(); the body blocks inSignalNeighborsAndWait()until neighbor work units arrive. This neighbor-only protocol requires every work unit to hold its own concurrent OS thread for the full dispatch. Pool/TBB-backedParallelizeArrayqueues N tasks onto a bounded worker set — CV-blocked tasks pin all workers, queued units never start, deadlock.#6321 swapped only the one
ParallelizeArraycontainingSignalNeighborsAndWaitfor explicitstd::threads. @blowekamp noted that leaves other multi-threader/work-unit combinations potentially vulnerable, and suggested instead hard-codingPlatformMultiThreaderto preserve the filter's original fixed-thread assumption — the same patternitkSLICImageFilteruses:PlatformMultiThreader::SingleMethodExecutespawns one dedicated OS thread per work unit (calling thread + N−1 spawned), andParallelizeArrayHelperwith range == work-unit count gives each thread exactly one index — so everyThreadedApplyUpdate/SignalNeighborsAndWaitruns on its own concurrent thread. The neighbor-sync protocol and all data members are unchanged; the fix applies to every parallel section in the filter, not just one.Local validation — 700 runs, 0 failures
Built with
Module_ITKTBB=ONand exercised under both threaders:itkParallelSparseFieldLevelSetImageFilterTest(baseline image compare)700 total test executions, zero failures, zero deadlocks. Output bit-identical every run (the Determinism scenario and the MD5-pinned baseline both pass). Full
ninjabuild (3345 targets) clean;pre-commit run --all-filesclean.Note: the original deadlock is timing-/core-count-dependent (needs work units > available TBB workers). On the 48-core validation host
m_NumOfWorkUnitscaps to ≤48, so the committed sweep does not starve there; the deterministic 100%-deadlock reproduction is on a 16-core box (see #6321) and CI's TBB legs.