fix: Remove fail-fast (-x) and guard distributed teardown against deadlock by ko3n1g · Pull Request #4139 · NVIDIA/Megatron-LM

ko3n1g · 2026-04-03T23:00:03Z

Problem

-x (fail-fast) was set in two places — pyproject.toml addopts and run_ci_test.sh — making one redundant. More importantly, through investigation we found that -x was being used as a workaround for the wrong problem.

The actual risk is that distributed fixture teardown deadlocks when a rank is hanging:

Rank 0 hangs inside an NCCL collective
Rank 1 fails a test → pytest proceeds to fixture teardown
Teardown calls barrier() + destroy_process_group() — both require all-rank coordination
Rank 0 can't participate → both ranks deadlock indefinitely

With -x, rank 1 exits fast enough that torchrun kills rank 0 before teardown runs, which avoided the symptom but not the cause.

Fix

Guard the barrier before teardown with a 30s timeout in both teardown sites:

tests/unit_tests/conftest.py — session-level cleanup fixture
tests/unit_tests/test_utilities.py — Utils.destroy_model_parallel()

If the barrier times out, a rank is unresponsive and we bail without calling destroy_process_group, breaking the deadlock. The session still exits non-zero (torchrun already recorded the failure).

With this in place, -x is no longer needed for session safety, so it is removed from both pyproject.toml and run_ci_test.sh. This means on a real failure the full suite continues running on the non-failing ranks, giving a complete picture of what broke.

Verification

Tested with a purpose-built scenario: rank 0 stuck in dist.all_reduce(), rank 1 fails before the collective. Without this fix, the session hung indefinitely (had to docker kill). With this fix, both with -x and without -x variants exit cleanly in ~6-7s.

Scenario2  WITH  fail-fast  →  7s  ✅
Scenario2  WITHOUT fail-fast →  6s  ✅  (previously: ∞, deadlock)

For the Python-level hang scenario (Scenario 1), removing -x produces the expected behaviour — rank 1 runs remaining tests after the failure, giving a fuller picture before torchrun kills the hanging rank:

Scenario1  WITH  fail-fast  →  13s  (stops at first failure)
Scenario1  WITHOUT fail-fast →  19s  (runs all remaining tests, +6s = 3×2s tests)

🤖 Generated with Claude Code

Fail-fast (-x) was set in two places — pyproject.toml addopts and run_ci_test.sh — making one redundant. More importantly, our investigation showed that -x is the wrong fix: the real risk is that distributed fixture teardown (barrier + destroy_process_group) deadlocks when a rank is hanging, not that pytest keeps running tests too long. Fix the root cause instead: wrap the barrier in cleanup (conftest.py) and destroy_model_parallel (test_utilities.py) with a 30s timeout. If the barrier times out a rank is unresponsive and we bail without calling destroy_process_group, breaking the deadlock. This makes -x unnecessary for session safety. Remove -x from pyproject.toml addopts and run_ci_test.sh so that on a real test failure the full suite still runs on the non-failing ranks, giving a complete picture of what broke rather than stopping at the first failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

copy-pr-bot · 2026-04-03T23:00:07Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ko3n1g · 2026-04-03T23:06:38Z

/ok to test

skyw

LGTM.

A note on the try: over torch.distributed.barrier, it doesn't do a lot for NCCL backend.

svcnvidia-nemo-ci · 2026-04-05T15:53:07Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24005023889

ko3n1g requested a review from skyw April 3, 2026 23:00

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 3, 2026

skyw approved these changes Apr 3, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test April 3, 2026 23:07 Inactive

ko3n1g marked this pull request as ready for review April 3, 2026 23:08

ko3n1g requested a review from a team as a code owner April 3, 2026 23:08

svcnvidia-nemo-ci requested a review from a team April 3, 2026 23:08

svcnvidia-nemo-ci added the complexity: low label Apr 3, 2026

chtruong814 approved these changes Apr 5, 2026

View reviewed changes

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Apr 5, 2026

ko3n1g added this pull request to the merge queue Apr 5, 2026

Merged via the queue into NVIDIA:main with commit 0b8306b Apr 5, 2026
67 of 71 checks passed

ko3n1g deleted the ko3n1g/fix/remove-fail-fast branch April 5, 2026 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139

fix: Remove fail-fast (-x) and guard distributed teardown against deadlock#4139
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/fix/remove-fail-fast

ko3n1g commented Apr 3, 2026

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

ko3n1g commented Apr 3, 2026

Uh oh!

skyw left a comment

Uh oh!

svcnvidia-nemo-ci commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ko3n1g commented Apr 3, 2026

Problem

Fix

Verification

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

ko3n1g commented Apr 3, 2026

Uh oh!

skyw left a comment

Choose a reason for hiding this comment

Uh oh!

svcnvidia-nemo-ci commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants