fix: port already in use & nccl errors in gpu tests #21335

deependujha · 2025-11-05T14:20:52Z

What does this PR do?

thanks, pr: #21341

This revert removes the port management logic and switches to a simpler approach for addressing port-related issues, as this is specific to standalone test environments.
The initial step for this change has been introduced in Lightning-AI/utilities — reference: feat: specify standalone port utilities#447

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--21335.org.readthedocs.build/en/21335/

codecov · 2025-11-05T14:28:13Z

Codecov Report

❌ Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 87%. Comparing base (19912d0) to head (4927768).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #21335   +/-   ##
=======================================
- Coverage      87%      87%   -0%     
=======================================
  Files         270      269    -1     
  Lines       23799    23708   -91     
=======================================
- Hits        20629    20536   -93     
- Misses       3170     3172    +2

.lightning/workflows/pytorch.yml

.lightning/workflows/fabric.yml

src/lightning/fabric/plugins/environments/lightning.py

…and retr…" This reverts commit 6a8d943.

…-issue

littlebullGit · 2025-11-07T03:22:53Z

If I understand the fix correctly, we are now rely on the test creator to explicitly set the STANDALONE_PORT env variable to avoid the port conflict? How can it prevent the port conflict if multiple test with the same range got launched at the same time and run into conflict ? Maybe I missed something? @deependujha

deependujha · 2025-11-07T03:50:54Z

Hi @littlebullGit,

We aren't using pytest-xdist to launch tests in parallel, but rather standalone shell script.

standalone shell script doesn't launches tests at once, but, it does it one by one, stores PID, and then keeps checking if they passed, and then launches next batch of tests.

pytest tests/.../test_file.py::test_name

Since the shell script itself is a single process, it can reliably use a list to store used ports and assign new for next test launch.

Hence, we can reliably say that multiple tests won't be launched with the same port.

littlebullGit · 2025-11-07T03:59:36Z

Hi @littlebullGit,

We aren't using pytest-xdist to launch tests in parallel, but rather standalone shell script.

standalone shell script doesn't launches tests at once, but, it does it one by one, stores PID, and then keeps checking if they passed, and then launches next batch of tests.
pytest tests/.../test_file.py::test_name
Since the shell script itself is a single process, it can reliably use a list to store used ports and assign new for next test launch.

Hence, we can reliably say that multiple tests won't be launched with the same port.

Now it makes sense. Since the shell is guarantee to be the only process on the server which issue ports, it effectively does the same as the file lock approach I did for the port manager.

deependujha · 2025-11-07T04:01:49Z

yeah, and it was really cool.

The only issue with it was, this was not a user-facing bug, but a simple test launching issue, and shell script specifying port is a decent fix imo.

Thanks again for your time and great effort.

update

1fb08b5

deependujha requested review from Borda, ethanwharris, justusschock, lantiga and tchaton as code owners November 5, 2025 14:20

github-actions bot added the fabric lightning.fabric.Fabric label Nov 5, 2025

bhimrazy reviewed Nov 5, 2025

View reviewed changes

.lightning/workflows/pytorch.yml Outdated Show resolved Hide resolved

update with correct link

f4608dc

deependujha commented Nov 5, 2025

View reviewed changes

.lightning/workflows/fabric.yml Outdated Show resolved Hide resolved

Apply suggestion from @deependujha

156b8ce

bhimrazy reviewed Nov 5, 2025

View reviewed changes

src/lightning/fabric/plugins/environments/lightning.py Outdated Show resolved Hide resolved

bhimrazy mentioned this pull request Nov 5, 2025

[wip]: debug port issue #21328

Closed

7 tasks

Revert "Fix EADDRINUSE errors in distributed tests with port manager …

4efe570

…and retr…" This reverts commit 6a8d943.

bhimrazy mentioned this pull request Nov 6, 2025

Revert "Fix EADDRINUSE errors in distributed tests with port manager and retry logic" #21341

Closed

Merge branch 'revert-21309-fix/port-manager-eaddrinuse' into fix/port…

9ac7f92

…-issue

github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 6, 2025

claymore

4927768

deependujha changed the title ~~[wip]: port issue~~ fix: port already in use & nccl errors in gpu tests Nov 6, 2025

bhimrazy approved these changes Nov 6, 2025

View reviewed changes

deependujha mentioned this pull request Nov 6, 2025

feat(fabric): introduce process-safe port management #21313

Closed

justusschock approved these changes Nov 6, 2025

View reviewed changes

deependujha merged commit c913649 into Lightning-AI:master Nov 6, 2025
118 of 119 checks passed

deependujha deleted the fix/port-issue branch November 6, 2025 08:15

deependujha mentioned this pull request Nov 6, 2025

fix: update port manager to use multiprocessing lock for process safety #21311

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: port already in use & nccl errors in gpu tests #21335

fix: port already in use & nccl errors in gpu tests #21335

deependujha commented Nov 5, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

littlebullGit commented Nov 7, 2025

Uh oh!

deependujha commented Nov 7, 2025 •

edited

Loading

Uh oh!

littlebullGit commented Nov 7, 2025

Uh oh!

deependujha commented Nov 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: port already in use & nccl errors in gpu tests #21335

fix: port already in use & nccl errors in gpu tests #21335

Conversation

deependujha commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

codecov bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

littlebullGit commented Nov 7, 2025

Uh oh!

deependujha commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

littlebullGit commented Nov 7, 2025

Uh oh!

deependujha commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deependujha commented Nov 5, 2025 •

edited

Loading

codecov bot commented Nov 5, 2025 •

edited

Loading

deependujha commented Nov 7, 2025 •

edited

Loading

deependujha commented Nov 7, 2025 •

edited

Loading