Skip to content

fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139

Merged
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/fix/remove-fail-fast
Apr 5, 2026
Merged

fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/fix/remove-fail-fast

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 3, 2026

Problem

-x (fail-fast) was set in two places — pyproject.toml addopts and run_ci_test.sh — making one redundant. More importantly, through investigation we found that -x was being used as a workaround for the wrong problem.

The actual risk is that distributed fixture teardown deadlocks when a rank is hanging:

  1. Rank 0 hangs inside an NCCL collective
  2. Rank 1 fails a test → pytest proceeds to fixture teardown
  3. Teardown calls barrier() + destroy_process_group() — both require all-rank coordination
  4. Rank 0 can't participate → both ranks deadlock indefinitely

With -x, rank 1 exits fast enough that torchrun kills rank 0 before teardown runs, which avoided the symptom but not the cause.

Fix

Guard the barrier before teardown with a 30s timeout in both teardown sites:

  • tests/unit_tests/conftest.py — session-level cleanup fixture
  • tests/unit_tests/test_utilities.pyUtils.destroy_model_parallel()

If the barrier times out, a rank is unresponsive and we bail without calling destroy_process_group, breaking the deadlock. The session still exits non-zero (torchrun already recorded the failure).

With this in place, -x is no longer needed for session safety, so it is removed from both pyproject.toml and run_ci_test.sh. This means on a real failure the full suite continues running on the non-failing ranks, giving a complete picture of what broke.

Verification

Tested with a purpose-built scenario: rank 0 stuck in dist.all_reduce(), rank 1 fails before the collective. Without this fix, the session hung indefinitely (had to docker kill). With this fix, both with -x and without -x variants exit cleanly in ~6-7s.

Scenario2  WITH  fail-fast  →  7s  ✅
Scenario2  WITHOUT fail-fast →  6s  ✅  (previously: ∞, deadlock)

For the Python-level hang scenario (Scenario 1), removing -x produces the expected behaviour — rank 1 runs remaining tests after the failure, giving a fuller picture before torchrun kills the hanging rank:

Scenario1  WITH  fail-fast  →  13s  (stops at first failure)
Scenario1  WITHOUT fail-fast →  19s  (runs all remaining tests, +6s = 3×2s tests)

🤖 Generated with Claude Code

Fail-fast (-x) was set in two places — pyproject.toml addopts and
run_ci_test.sh — making one redundant. More importantly, our investigation
showed that -x is the wrong fix: the real risk is that distributed fixture
teardown (barrier + destroy_process_group) deadlocks when a rank is hanging,
not that pytest keeps running tests too long.

Fix the root cause instead: wrap the barrier in cleanup (conftest.py) and
destroy_model_parallel (test_utilities.py) with a 30s timeout. If the
barrier times out a rank is unresponsive and we bail without calling
destroy_process_group, breaking the deadlock. This makes -x unnecessary
for session safety.

Remove -x from pyproject.toml addopts and run_ci_test.sh so that on a
real test failure the full suite still runs on the non-failing ranks,
giving a complete picture of what broke rather than stopping at the first
failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g ko3n1g requested a review from skyw April 3, 2026 23:00
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 3, 2026

/ok to test

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 3, 2026
Copy link
Copy Markdown
Contributor

@skyw skyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

A note on the try: over torch.distributed.barrier, it doesn't do a lot for NCCL backend.

@ko3n1g ko3n1g marked this pull request as ready for review April 3, 2026 23:08
@ko3n1g ko3n1g requested a review from a team as a code owner April 3, 2026 23:08
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 3, 2026 23:08
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Apr 5, 2026
@ko3n1g ko3n1g added this pull request to the merge queue Apr 5, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24005023889

Merged via the queue into NVIDIA:main with commit 0b8306b Apr 5, 2026
67 of 71 checks passed
@ko3n1g ko3n1g deleted the ko3n1g/fix/remove-fail-fast branch April 5, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants