Skip to content

Commit fa8c278

Browse files
committed
Ensure EADDRINUSE retries pick a fresh port
- Drop cached MASTER_PORT before rerun so LightningEnvironment re-allocates a new port instead of reusing the TIME_WAIT socket - Extend backoff to 1.0s to give the OS time to close TCPStore sockets - Prevents NCCL connect errors caused by retries hitting the same port
1 parent df45d53 commit fa8c278

File tree

2 files changed

+12
-2
lines changed

2 files changed

+12
-2
lines changed

tests/tests_fabric/conftest.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -261,11 +261,16 @@ def pytest_runtest_makereport(item: pytest.Item, call: pytest.CallInfo) -> None:
261261
manager = get_port_manager()
262262
manager.release_all()
263263

264+
# Clear MASTER_PORT so cluster environment allocates a fresh port on retry
265+
import os
266+
267+
os.environ.pop("MASTER_PORT", None)
268+
264269
# Re-run the test by raising Rerun exception
265270
# Note: This requires pytest-rerunfailures plugin
266271
import time
267272

268-
time.sleep(0.5) # Brief delay to let ports settle
273+
time.sleep(1.0) # Wait for OS to release ports from TIME_WAIT state
269274

270275
# If pytest-rerunfailures is available, use it
271276
try:

tests/tests_pytorch/conftest.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -389,11 +389,16 @@ def pytest_runtest_makereport(item: pytest.Item, call: pytest.CallInfo) -> None:
389389
manager = get_port_manager()
390390
manager.release_all()
391391

392+
# Clear MASTER_PORT so cluster environment allocates a fresh port on retry
393+
import os
394+
395+
os.environ.pop("MASTER_PORT", None)
396+
392397
# Re-run the test by raising Rerun exception
393398
# Note: This requires pytest-rerunfailures plugin
394399
import time
395400

396-
time.sleep(0.5) # Brief delay to let ports settle
401+
time.sleep(1.0) # Wait for OS to release ports from TIME_WAIT state
397402

398403
# If pytest-rerunfailures is available, use it
399404
try:

0 commit comments

Comments
 (0)