-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix EADDRINUSE errors in distributed tests with port manager and retry logic #21309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix EADDRINUSE errors in distributed tests with port manager and retry logic #21309
Conversation
Implemented a thread-safe port reservation system to prevent EADDRINUSE errors in distributed training tests. - Created PortManager class with mutex-protected port allocation - Updated find_free_network_port() to use PortManager - Enhanced test teardown to release ports after completion - Added 24 comprehensive tests (17 unit + 7 integration) - Added context manager for automatic port cleanup Fixes port collision issues in: - tests_fabric/strategies/test_ddp_integration.py::test_clip_gradients - tests_pytorch/strategies/test_fsdp.py::test_checkpoint_multi_gpus
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #21309 +/- ##
========================================
- Coverage 87% 87% -0%
========================================
Files 269 270 +1
Lines 23707 23801 +94
========================================
- Hits 20679 20643 -36
- Misses 3028 3158 +130 |
- Add logging when port queue utilization >80% to detect exhaustion - Enhance RuntimeError with detailed diagnostics (queue utilization %) - Add DistNetworkError type check in pytorch conftest for subprocess failures - Add test coverage for high queue utilization warning - Helps diagnose EADDRINUSE issues in CI distributed tests
|
@deependujha let's merge this one as it is to the unblock master and prepare another/follow-up PR with your suggestions 🦩 |
|
interesting. seems more work is required. Also copying logs from litbot UI doesn't work properly. Only copies what's visible on screen, not the selection. |
|
- Drop cached MASTER_PORT before rerun so LightningEnvironment re-allocates a new port instead of reusing the TIME_WAIT socket - Extend backoff to 1.0s to give the OS time to close TCPStore sockets - Prevents NCCL connect errors caused by retries hitting the same port
|
@Borda try selecting longer piece of text. In my case, it copies only a small portion of the selected text. |
|
thanks for the great work @littlebullGit 💜⚡️ |
|
Hi @littlebullGit, we're still sometimes getting the error: https://lightning.ai/lightning-ai/ci/jobs/ci-run-lightning-ai-pytorch-lightning-213433f-pytorch-yml-lit-job-lightning-3-12-c7ac01cf?app_id=jobs&job_detail_tab=logs We run gpu tests in a batch of 5. The current implementation of Maybe it should |
Saw your PR. Will take a look and see if I can help. |
#21313 |
Pull Request Description
What does this PR do?
Fixes #21308
This PR eliminates
EADDRINUSE(address already in use) errors in distributed tests by implementing a two-layer port management strategy.Problem
Distributed tests were experiencing frequent
EADDRINUSEerrors in CI because:MASTER_PORTwithout trackingSolution
Layer 1: Proactive Prevention (Deque-based tracking)
New PortManager class (
src/lightning/fabric/utilities/port_manager.py):reserve_existing_port()Environment integration (
src/lightning/fabric/plugins/environments/lightning.py):MASTER_PORTon startupteardown()Test fixtures (
tests/*/conftest.py):MASTER_PORTset by spawned child processesLayer 2: Reactive Recovery (Retry logic)
Automatic retry on EADDRINUSE (
tests/*/conftest.py):pytest_runtest_makereport()hook detects EADDRINUSE errorsWhy Both Layers?
Deque alone: Would fail if >1024 ports allocated during TIME_WAIT window
Retry alone: Would be slow and mask the underlying problem
Together: Deque prevents 99% of conflicts, retry handles the 1% edge case
Changes
Core Implementation
src/lightning/fabric/utilities/port_manager.py(212 lines, 100% test coverage)src/lightning/fabric/plugins/environments/lightning.py(port reservation/cleanup)src/lightning/fabric/utilities/__init__.py(export port manager)Test Infrastructure
tests/tests_fabric/utilities/test_port_manager.py(45 tests)tests/tests_fabric/conftest.py(retry logic)tests/tests_pytorch/conftest.py(retry logic)Test Results
Does your PR introduce any breaking changes?
No - All changes are internal to the test infrastructure. No user-facing API changes.
Before submitting
📚 Documentation preview 📚: https://pytorch-lightning--21309.org.readthedocs.build/en/21309/