Skip to content

Commit 2b435f5

Browse files
committed
Revert 'Fix EADDRINUSE errors with port manager (Lightning-AI#21309)'
This reverts commit 6a8d943. Reason: The solution was thread-safe but not process-safe, which doesn't solve the actual GPU CI issue where tests run in parallel processes. The implementation added 1,208 lines of complex code that: - Doesn't fix the root cause (process-level conflicts) - Is over-engineered (12-24x more code than needed) - Violates repository patterns We'll implement a simpler, process-safe solution instead.
1 parent a883890 commit 2b435f5

File tree

7 files changed

+7
-1208
lines changed

7 files changed

+7
-1208
lines changed

src/lightning/fabric/CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
2727

2828
### Fixed
2929

30-
- Fixed `EADDRINUSE` errors in distributed tests with port manager and retry logic ([#21309](https://github.com/Lightning-AI/pytorch-lightning/pull/21309))
30+
-
3131

3232

3333
---

src/lightning/fabric/plugins/environments/lightning.py

Lines changed: 6 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@
1313
# limitations under the License.
1414

1515
import os
16+
import socket
1617

1718
from typing_extensions import override
1819

1920
from lightning.fabric.plugins.environments.cluster_environment import ClusterEnvironment
20-
from lightning.fabric.utilities.port_manager import get_port_manager
2121
from lightning.fabric.utilities.rank_zero import rank_zero_only
2222

2323

@@ -104,38 +104,16 @@ def teardown(self) -> None:
104104
if "WORLD_SIZE" in os.environ:
105105
del os.environ["WORLD_SIZE"]
106106

107-
if self._main_port != -1:
108-
get_port_manager().release_port(self._main_port)
109-
self._main_port = -1
110-
111-
os.environ.pop("MASTER_PORT", None)
112-
os.environ.pop("MASTER_ADDR", None)
113-
114107

115108
def find_free_network_port() -> int:
116109
"""Finds a free port on localhost.
117110
118111
It is useful in single-node training when we don't want to connect to a real main node but have to set the
119112
`MASTER_PORT` environment variable.
120113
121-
The allocated port is reserved and won't be returned by subsequent calls until it's explicitly released.
122-
123-
Returns:
124-
A port number that is reserved and free at the time of allocation
125-
126114
"""
127-
# If an external launcher already specified a MASTER_PORT (for example, torch.distributed.spawn or
128-
# multiprocessing helpers), reserve it through the port manager so no other test reuses the same number.
129-
if "MASTER_PORT" in os.environ:
130-
master_port_str = os.environ["MASTER_PORT"]
131-
try:
132-
existing_port = int(master_port_str)
133-
except ValueError:
134-
pass
135-
else:
136-
port_manager = get_port_manager()
137-
if port_manager.reserve_existing_port(existing_port):
138-
return existing_port
139-
140-
port_manager = get_port_manager()
141-
return port_manager.allocate_port()
115+
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
116+
s.bind(("", 0))
117+
port = s.getsockname()[1]
118+
s.close()
119+
return port

src/lightning/fabric/utilities/port_manager.py

Lines changed: 0 additions & 233 deletions
This file was deleted.

0 commit comments

Comments
 (0)