Fix EADDRINUSE errors in distributed tests with port manager and retry logic #21309

littlebullGit · 2025-10-23T00:06:14Z

Pull Request Description

What does this PR do?

This PR eliminates EADDRINUSE (address already in use) errors in distributed tests by implementing a two-layer port management strategy.

Problem

Distributed tests were experiencing frequent EADDRINUSE errors in CI because:

OS keeps released ports in TIME_WAIT state for 30-120 seconds
Tests immediately tried to reuse the same ports → collision
Spawned processes set MASTER_PORT without tracking
No mechanism to prevent rapid port reallocation

Solution

Layer 1: Proactive Prevention (Deque-based tracking)

New PortManager class (src/lightning/fabric/utilities/port_manager.py):

1024-slot queue to track recently released ports
Prevents reallocation until ports cycle out
Thread-safe singleton for process-wide coordination
Tracks externally assigned ports via reserve_existing_port()

Environment integration (src/lightning/fabric/plugins/environments/lightning.py):

Check and reserve pre-existing MASTER_PORT on startup
Properly release and clean up ports in teardown()

Test fixtures (tests/*/conftest.py):

Capture and track MASTER_PORT set by spawned child processes
Clean up environment variables between tests

Layer 2: Reactive Recovery (Retry logic)

Automatic retry on EADDRINUSE (tests/*/conftest.py):

pytest_runtest_makereport() hook detects EADDRINUSE errors
Automatically retries up to 3 times with 0.5s delay
Clears port manager state between retries
Logs retry attempts for debugging

Why Both Layers?

Deque alone: Would fail if >1024 ports allocated during TIME_WAIT window
Retry alone: Would be slow and mask the underlying problem

Together: Deque prevents 99% of conflicts, retry handles the 1% edge case

Changes

Core Implementation

✅ New: src/lightning/fabric/utilities/port_manager.py (212 lines, 100% test coverage)
✅ Modified: src/lightning/fabric/plugins/environments/lightning.py (port reservation/cleanup)
✅ Modified: src/lightning/fabric/utilities/__init__.py (export port manager)

Test Infrastructure

✅ New: tests/tests_fabric/utilities/test_port_manager.py (45 tests)
✅ Modified: tests/tests_fabric/conftest.py (retry logic)
✅ Modified: tests/tests_pytorch/conftest.py (retry logic)

Test Results

tests/tests_fabric/utilities/test_port_manager.py::45 tests PASSED
tests/tests_fabric/plugins/environments/test_lightning.py::10 tests PASSED

Does your PR introduce any breaking changes?

No - All changes are internal to the test infrastructure. No user-facing API changes.

Before submitting

Was this discussed/approved via a GitHub issue? (yes, EADDRINUSE errors in distributed tests #21308)
Did you write any new necessary tests?
Did you make sure all tests pass locally?
Did you update the documentation (if necessary)?
- Not necessary - internal test infrastructure only

📚 Documentation preview 📚: https://pytorch-lightning--21309.org.readthedocs.build/en/21309/

Implemented a thread-safe port reservation system to prevent EADDRINUSE errors in distributed training tests. - Created PortManager class with mutex-protected port allocation - Updated find_free_network_port() to use PortManager - Enhanced test teardown to release ports after completion - Added 24 comprehensive tests (17 unit + 7 integration) - Added context manager for automatic port cleanup Fixes port collision issues in: - tests_fabric/strategies/test_ddp_integration.py::test_clip_gradients - tests_pytorch/strategies/test_fsdp.py::test_checkpoint_multi_gpus

codecov · 2025-10-23T00:12:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87%. Comparing base (f58a176) to head (fa8c278).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #21309    +/-   ##
========================================
- Coverage      87%      87%    -0%     
========================================
  Files         269      270     +1     
  Lines       23707    23801    +94     
========================================
- Hits        20679    20643    -36     
- Misses       3028     3158   +130

- Add logging when port queue utilization >80% to detect exhaustion - Enhance RuntimeError with detailed diagnostics (queue utilization %) - Add DistNetworkError type check in pytorch conftest for subprocess failures - Add test coverage for high queue utilization warning - Helps diagnose EADDRINUSE issues in CI distributed tests

tests/tests_fabric/utilities/test_port_manager.py

Borda · 2025-10-23T06:55:01Z

@deependujha let's merge this one as it is to the unblock master and prepare another/follow-up PR with your suggestions 🦩

deependujha · 2025-10-23T07:10:57Z

interesting. seems more work is required. Also copying logs from litbot UI doesn't work properly.

Only copies what's visible on screen, not the selection.

Borda · 2025-10-23T07:22:08Z

Only copies what's visible on screen, not the selection.

E           torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 50379, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

../.venv/lib/python3.12/site-packages/torch/distributed/rendezvous.py:198: DistNetworkError

- Drop cached MASTER_PORT before rerun so LightningEnvironment re-allocates a new port instead of reusing the TIME_WAIT socket - Extend backoff to 1.0s to give the OS time to close TCPStore sockets - Prevents NCCL connect errors caused by retries hitting the same port

deependujha · 2025-10-23T14:50:01Z

@Borda try selecting longer piece of text. In my case, it copies only a small portion of the selected text.

deependujha · 2025-10-23T16:13:21Z

thanks for the great work @littlebullGit 💜⚡️

deependujha · 2025-10-24T04:29:01Z

Hi @littlebullGit, we're still sometimes getting the error: https://lightning.ai/lightning-ai/ci/jobs/ci-run-lightning-ai-pytorch-lightning-213433f-pytorch-yml-lit-job-lightning-3-12-c7ac01cf?app_id=jobs&job_detail_tab=logs

We run gpu tests in a batch of 5. The current implementation of PortManager is thread-safe. I believe, we need to make it process safe.

Maybe it should track used ports via a file (or similar) to coordinate across processes?

littlebullGit · 2025-10-24T13:12:19Z

Hi @littlebullGit, we're still sometimes getting the error: https://lightning.ai/lightning-ai/ci/jobs/ci-run-lightning-ai-pytorch-lightning-213433f-pytorch-yml-lit-job-lightning-3-12-c7ac01cf?app_id=jobs&job_detail_tab=logs

We run gpu tests in a batch of 5. The current implementation of PortManager is thread-safe. I believe, we need to make it process safe.

Maybe it should track used ports via a file (or similar) to coordinate across processes?

Saw your PR. Will take a look and see if I can help.

littlebullGit · 2025-10-26T04:46:28Z

Hi @littlebullGit, we're still sometimes getting the error: https://lightning.ai/lightning-ai/ci/jobs/ci-run-lightning-ai-pytorch-lightning-213433f-pytorch-yml-lit-job-lightning-3-12-c7ac01cf?app_id=jobs&job_detail_tab=logs
We run gpu tests in a batch of 5. The current implementation of PortManager is thread-safe. I believe, we need to make it process safe.
Maybe it should track used ports via a file (or similar) to coordinate across processes?

Saw your PR. Will take a look and see if I can help.

#21313
@deependujha please see above PR

littlebullGit added 2 commits October 22, 2025 19:55

Fabric: reserve externally provided MASTER_PORT values

92d5a9e

littlebullGit requested review from Borda, ethanwharris, justusschock, lantiga and tchaton as code owners October 23, 2025 00:06

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Oct 23, 2025

littlebullGit mentioned this pull request Oct 23, 2025

Fix ModelCheckpoint with manual optimization and every_n_train_steps #21239

Open

SkafteNicki approved these changes Oct 23, 2025

View reviewed changes

SkafteNicki mentioned this pull request Oct 23, 2025

Remove Redundant Host & Device Synchronizations #21233

Open

7 tasks

Borda approved these changes Oct 23, 2025

View reviewed changes

chlog

d2ee194

deependujha approved these changes Oct 23, 2025

View reviewed changes

tests/tests_fabric/utilities/test_port_manager.py Show resolved Hide resolved

tests/tests_fabric/utilities/test_port_manager.py Show resolved Hide resolved

Empty-Commit

df45d53

deependujha merged commit 6a8d943 into Lightning-AI:master Oct 23, 2025
112 checks passed

deependujha mentioned this pull request Oct 24, 2025

fix: update port manager to use multiprocessing lock for process safety #21311

Open

7 tasks

littlebullGit mentioned this pull request Oct 25, 2025

feat(fabric): introduce process-safe port management #21313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix EADDRINUSE errors in distributed tests with port manager and retry logic #21309

Fix EADDRINUSE errors in distributed tests with port manager and retry logic #21309

littlebullGit commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Borda commented Oct 23, 2025

Uh oh!

deependujha commented Oct 23, 2025 •

edited

Loading

Uh oh!

Borda commented Oct 23, 2025

Uh oh!

Uh oh!

deependujha commented Oct 23, 2025 •

edited

Loading

Uh oh!

deependujha commented Oct 23, 2025

Uh oh!

deependujha commented Oct 24, 2025

Uh oh!

littlebullGit commented Oct 24, 2025

Uh oh!

littlebullGit commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix EADDRINUSE errors in distributed tests with port manager and retry logic #21309

Fix EADDRINUSE errors in distributed tests with port manager and retry logic #21309

Conversation

littlebullGit commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

What does this PR do?

Problem

Solution

Layer 1: Proactive Prevention (Deque-based tracking)

Layer 2: Reactive Recovery (Retry logic)

Why Both Layers?

Changes

Core Implementation

Test Infrastructure

Test Results

Does your PR introduce any breaking changes?

Before submitting

Uh oh!

codecov bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Borda commented Oct 23, 2025

Uh oh!

deependujha commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Borda commented Oct 23, 2025

Uh oh!

Uh oh!

deependujha commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deependujha commented Oct 23, 2025

Uh oh!

deependujha commented Oct 24, 2025

Uh oh!

littlebullGit commented Oct 24, 2025

Uh oh!

littlebullGit commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

littlebullGit commented Oct 23, 2025 •

edited by github-actions bot

Loading

codecov bot commented Oct 23, 2025 •

edited

Loading

deependujha commented Oct 23, 2025 •

edited

Loading

deependujha commented Oct 23, 2025 •

edited

Loading