Skip to content

Conversation

allenwang28
Copy link
Contributor

This will enable multiple vLLM replicas to be spun up on the same local host.

Tested this by changing

services:
  policy:
    procs: 2
    num_replicas: 1
    with_gpus: true

in apps/vllm/llama3_8b.yaml.

Doing so without my change showed this:

  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 578, in init_worker_distributed_environment
    init_distributed_environment(parallel_config.world_size, rank,
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 976, in init_distributed_environment
    torch.distributed.init_process_group(
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1752, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 230, in _tcp_rendezvous_handler
    store = _create_c10d_store(
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 198, in _create_c10d_store
    return TCPStore(
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 12345, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 22, 2025
@allenwang28 allenwang28 merged commit 00b2c98 into meta-pytorch:main Sep 23, 2025
5 checks passed
@allenwang28 allenwang28 deleted the vllm_port branch September 23, 2025 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants