Skip to content

ENH: Add consistency (should fail) to force test failure in ParallelSparseFieldLevelSet deadlock#6320

Merged
hjmjohnson merged 1 commit into
InsightSoftwareConsortium:mainfrom
hjmjohnson:test/pslsif-deadlock-demo
May 27, 2026
Merged

ENH: Add consistency (should fail) to force test failure in ParallelSparseFieldLevelSet deadlock#6320
hjmjohnson merged 1 commit into
InsightSoftwareConsortium:mainfrom
hjmjohnson:test/pslsif-deadlock-demo

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

Adds a deterministic-failure regression test for the intermittent timeout of itkParallelSparseFieldLevelSetImageFilterTest (e.g. CDash test 2570265561).

This test intentionally fails on main — it is the first half of a two-PR sequence that demonstrates the bug here, then fixes it in a follow-up PR.

ParallelSparseFieldLevelSetImageFilter::Iterate() dispatches its inner ThreadedApplyUpdate body via mt->ParallelizeArray(). The body invokes SignalNeighborsAndWait() several times, blocking on per-thread std::condition_variables until both neighbor work units have arrived at the same phase. The neighbor-only sync requires every work unit to hold its own concurrent OS thread for the full duration of the dispatch. Pool/TBB-backed ParallelizeArray does not honor this: N work units are submitted to a shared worker pool whose size is bounded by core count, and CV-blocked tasks pin the pool so queued work units never start and the dispatch deadlocks. The existing baseline test fails at ~3% per run under TBB on a 16-core box — far too rare to catch the bug reliably.

Test scenarios

itkParallelSparseFieldLevelSetImageFilterRobustnessTest:

  1. Sweep × repeat. Cycle the work-unit counts {1, 2, 4, 8, 11, 16, 32} for 20 outer iterations (140 short filter runs). Each filter run has a low per-invocation deadlock probability; cycling amplifies it to ~95% per test invocation while keeping total runtime under ~10 s with a working synchronization. Output must be finite and in a plausible range at every count.

  2. Determinism. Three repeated runs at workUnits = 11 must produce bit-identical output. Guards future barrier rewrites against introducing numerical non-determinism.

  3. Concurrent multi-pipeline. Eight pipelines run simultaneously via std::async (8 × 11 = 88 work units competing for a single core-count-bounded worker pool). Cross-pipeline worker contention directly probes the dispatch-starvation deadlock.

Measured failure rate

Local soak on a 16-core x86 box with TBB:

Configuration Result
Current master, TBB, 30 invocations 30/30 timeouts (100% deadlock)
Current master, POOL, 20 invocations 20/20 pass (no worker shortage path)
Current master, PLATFORM, 20 invocations 20/20 pass (no worker shortage path)
Expected runtime with a working fix ~5 s wallclock

CTest TIMEOUT set to 30 s → 6× leeway over expected runtime on the local box, accommodates slower CI hardware while still flagging the deadlock within a reasonable bound.

How to verify locally
cmake -DModule_ITKTBB:BOOL=ON build-standard   # ensure TBB available
ninja -C build-standard ITKLevelSetsTestDriver
env ITK_GLOBAL_DEFAULT_THREADER=TBB \
    build-standard/bin/ITKLevelSetsTestDriver \
    itkParallelSparseFieldLevelSetImageFilterRobustnessTest
# expect deadlock at 30 s timeout on master

@github-actions github-actions Bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Enhancement Improvement of existing methods or implementation type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct area:Segmentation Issues affecting the Segmentation module labels May 21, 2026
@hjmjohnson hjmjohnson marked this pull request as ready for review May 21, 2026 22:33
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR adds a deterministic regression test (itkParallelSparseFieldLevelSetImageFilterRobustnessTest) designed to reliably expose the pool/TBB deadlock in ParallelSparseFieldLevelSetImageFilter::Iterate(), where SignalNeighborsAndWait CV-blocks pin all pool workers when work-unit count exceeds pool size. The test is intentionally expected to deadlock (and timeout) on main under TBB; the fix is planned for a follow-up PR.

  • Scenario 1 cycles 7 work-unit counts for 20 sweeps (140 filter runs at 30 iterations each) to amplify a ~1-3% per-invocation deadlock probability above 95% per test run.
  • Scenario 2 asserts bit-identical output across three independent runs at workUnits=11 to guard future barrier rewrites against numerical non-determinism.
  • Scenario 3 launches 8 concurrent pipelines via std::async (88 total work units) to probe cross-pipeline pool-starvation directly; a stale file-level comment incorrectly states "four pipelines" while the code and per-scenario comment use eight."

Confidence Score: 3/5

The test logic is sound, but a factual comment mismatch and prose-budget violations should be corrected before merging.

The "four pipelines" claim in the top-of-file summary conflicts with both kConcurrentPipelines = 8 and the per-scenario comment, giving any reader half the actual stress-load figure. The 23-line file-level comment block, CDash test-number citation, and multi-line scenario prose inside main() all violate ITK's hard prose-budget cap, which AGENTS.md calls a review-blocking defect. The determinism assumption in Scenario 2 also warrants confirmation before the test contract is locked in.

itkParallelSparseFieldLevelSetImageFilterRobustnessTest.cxx needs the four-to-eight fix, comment trimming to meet the prose budget, and confirmation that the determinism assertion holds across independent filter runs.

Important Files Changed

Filename Overview
Modules/Segmentation/LevelSets/test/itkParallelSparseFieldLevelSetImageFilterRobustnessTest.cxx New 337-line regression test for the deadlock; has a stale "four pipelines" count in the file-level comment (code uses 8), a 23-line comment block violating the prose-budget cap, and uses the legacy test pattern rather than GoogleTest.
Modules/Segmentation/LevelSets/test/CMakeLists.txt Registers the new test source and adds an itk_add_test entry with a 30-second CTest TIMEOUT; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant Test as RobustnessTest (main)
    participant S1 as Scenario 1 (sweep x repeat)
    participant S2 as Scenario 2 (determinism)
    participant S3 as Scenario 3 (concurrent)
    participant F as MorphFilter::Update()
    participant PA as ParallelizeArray
    participant WU as WorkUnit (thread)

    Test->>S1: 20 reps x 7 work-unit counts
    S1->>F: run_one(wu, 30)
    F->>PA: dispatch N work units
    PA->>WU: SignalNeighborsAndWait() blocks on CV
    Note over PA,WU: Bug: pool size M < N, all workers blocked, deadlock
    WU-->>F: phase complete (if fix applied)
    F-->>S1: output image
    S1->>Test: isfinite + range check

    Test->>S2: "3 independent runs at workUnits=11"
    S2->>F: run_one(11, 100) x3
    F-->>S2: output image
    S2->>Test: bit-identical comparison

    Test->>S3: 6 reps x 8 async pipelines
    par 8 concurrent pipelines
        S3->>F: run_one(11, 60) via std::async
        F->>PA: "8x11=88 work units contend same pool"
        PA-->>F: all units complete (if fix applied)
    end
    F-->>S3: 8 output images
    S3->>Test: all outputs IsNotNull
Loading

Reviews (1): Last reviewed commit: "ENH: Add robustness test exposing Parall..." | Re-trigger Greptile

@hjmjohnson hjmjohnson force-pushed the test/pslsif-deadlock-demo branch 4 times, most recently from 5738416 to b89cbc4 Compare May 23, 2026 18:48
ParallelSparseFieldLevelSetImageFilter::Iterate() dispatches its
per-iteration ThreadedApplyUpdate body via mt->ParallelizeArray().
The body invokes SignalNeighborsAndWait() several times, blocking on
per-thread std::condition_variables until both neighbor work units
have arrived at the same phase.  This neighbor-only protocol assumes
every work unit holds its own concurrent OS thread for the full
duration of the dispatch.  Pool/TBB-backed ParallelizeArray does not
honor this: N work units are submitted to a shared worker pool whose
size is bounded by core count, and CV-blocked tasks pin the pool so
queued work units never start and the dispatch deadlocks.

Reported as the intermittent timeout of
itkParallelSparseFieldLevelSetImageFilterTest in CDash test 2570265561
("Trying mf->Update()" followed by a 1000s timeout) on Windows
Release with TBB.

The existing test fails at ~3% per run under TBB, which is too rare
to catch the bug reliably in CI.  This commit adds a new test
itkParallelSparseFieldLevelSetImageFilterRobustnessTest that drives
the deadlock to a high per-test-invocation probability:

  Scenario 1 (sweep x repeat): cycle 1/2/4/8/11/16/32 work-unit
    counts for 20 outer iterations, totalling 140 short filter runs.
    Per-run pool-starvation deadlock probability is low but
    amplification gives a ~95% per-test-invocation hit rate.

  Scenario 2 (determinism): three repeated runs at workUnits=11 must
    produce bit-identical output.  Guards against future barrier
    rewrites that introduce numerical non-determinism.

  Scenario 3 (concurrent multi-pipeline): eight pipelines run
    simultaneously via std::async (88 work units competing for a
    core-count-bounded worker pool).  Cross-pipeline contention
    directly probes the dispatch-starvation deadlock path.

Local measurements on a 16-core x86 box with TBB:
  - Upstream code, 30 invocations: 30 timeouts (100% deadlock).
  - Upstream code, POOL/PLATFORM: pass (no worker shortage possible).
  - Test runtime with a working synchronization: ~5s wallclock.
  - CTest TIMEOUT set to 30 (6x leeway over expected runtime).

This commit intentionally introduces a failing test on master.  The
proposed fix lands in a follow-up PR.
@hjmjohnson hjmjohnson force-pushed the test/pslsif-deadlock-demo branch from b89cbc4 to bf38f7e Compare May 23, 2026 23:06
@hjmjohnson
Copy link
Copy Markdown
Member Author

2503 - ParallelSparseFieldLevelSetRobustness.ConcurrentMultiPipeline (Subprocess aborted) 

is the flakey test that this more rigorous test helps to expose more often (about 40% of the time).

@hjmjohnson hjmjohnson changed the title ENH: Add robustness test exposing ParallelSparseFieldLevelSet deadlock ENH: Add robustness test force test failure in ParallelSparseFieldLevelSet deadlock May 24, 2026
@hjmjohnson
Copy link
Copy Markdown
Member Author

This commit reproduces the ParallelSparseFieldLevelSetImageFilter dispatch-starvation deadlock robustly across platforms — at HEAD bf38f7e0e4 it is the only failing test in the entire matrix, tripping independently on Linux-x86, Windows, and ARM.

CI evidence (HEAD bf38f7e)

Every failing job reports 99% tests passed, 1 tests failed, and the single failure is always ParallelSparseFieldLevelSetRobustness.ConcurrentMultiPipeline:

Job Result Test outcome
Pixi-Cxx (ubuntu-22.04) fail ConcurrentMultiPipeline — Subprocess aborted (watchdog)
Pixi-Cxx (windows-2022) fail ConcurrentMultiPipeline — Failed
ARMBUILD-Ubuntu-24.04-arm fail ConcurrentMultiPipeline — Subprocess aborted (watchdog)
Pixi-Cxx (macos-15) pass (no worker-shortage path hit)
ARMBUILD-x86_64-rosetta, ARMBUILD-Python pass

"Subprocess aborted" is the test's internal watchdog calling std::abort() when the dispatch fails to complete within the deadline — i.e. the deadlock, caught deterministically rather than hanging the runner. All other ~4500 tests pass on every platform, confirming the failure is isolated to the demonstrated bug and not collateral breakage.

Why it reproduces on some platforms and not others

The deadlock requires the TBB worker-shortage path: when ParallelizeArray queues N CV-blocking work units onto an M-worker pool with M < N, the blocked units pin all workers and queued units never start. Runners with enough workers, or using the platform/pool backend, complete normally — hence macOS passing here. The fix is in the stacked PR #6321 (replaces the one starving ParallelizeArray dispatch with explicit std::threads).

@hjmjohnson hjmjohnson changed the title ENH: Add robustness test force test failure in ParallelSparseFieldLevelSet deadlock ENH: (Should fail) Add consistency to force test failure in ParallelSparseFieldLevelSet deadlock May 25, 2026
@hjmjohnson hjmjohnson changed the title ENH: (Should fail) Add consistency to force test failure in ParallelSparseFieldLevelSet deadlock ENH: Add consistency (should fail) to force test failure in ParallelSparseFieldLevelSet deadlock May 25, 2026
@hjmjohnson hjmjohnson requested a review from dzenanz May 25, 2026 14:13
@dzenanz dzenanz requested review from blowekamp and seanm and removed request for dzenanz May 25, 2026 14:34
@hjmjohnson hjmjohnson merged commit bf38f7e into InsightSoftwareConsortium:main May 27, 2026
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Segmentation Issues affecting the Segmentation module type:Enhancement Improvement of existing methods or implementation type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant