-
Notifications
You must be signed in to change notification settings - Fork 3.6k
fix: port already in use & nccl errors in gpu tests #21335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #21335 +/- ##
=======================================
- Coverage 87% 87% -0%
=======================================
Files 270 269 -1
Lines 23799 23708 -91
=======================================
- Hits 20629 20536 -93
- Misses 3170 3172 +2 |
…and retr…" This reverts commit 6a8d943.
|
If I understand the fix correctly, we are now rely on the test creator to explicitly set the STANDALONE_PORT env variable to avoid the port conflict? How can it prevent the port conflict if multiple test with the same range got launched at the same time and run into conflict ? Maybe I missed something? @deependujha |
|
Hi @littlebullGit, We aren't using standalone shell script doesn't launches tests at once, but, it does it one by one, stores PID, and then keeps checking if they passed, and then launches next batch of tests. pytest tests/.../test_file.py::test_nameSince the shell script itself is a single process, it can reliably use a list to store used ports and assign new for next test launch. Hence, we can reliably say that multiple tests won't be launched with the same port. |
Now it makes sense. Since the shell is guarantee to be the only process on the server which issue ports, it effectively does the same as the file lock approach I did for the port manager. |
|
yeah, and it was really cool. The only issue with it was, this was not a user-facing bug, but a simple test launching issue, and shell script specifying port is a decent fix imo. Thanks again for your time and great effort. |
What does this PR do?
Reverts #21309
closes #21313 #21311
thanks, pr: #21341
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--21335.org.readthedocs.build/en/21335/