Use different ports for different GPU IDs #214

allenwang28 · 2025-09-22T21:31:45Z

This will enable multiple vLLM replicas to be spun up on the same local host.

Tested this by changing

services:
  policy:
    procs: 2
    num_replicas: 1
    with_gpus: true

in apps/vllm/llama3_8b.yaml.

Doing so without my change showed this:

  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 578, in init_worker_distributed_environment
    init_distributed_environment(parallel_config.world_size, rank,
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 976, in init_distributed_environment
    torch.distributed.init_process_group(
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1752, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 230, in _tcp_rendezvous_handler
    store = _create_c10d_store(
  File "/home/allencwang/.conda/envs/forge_test_2/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 198, in _create_c10d_store
    return TCPStore(
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 12345, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

src/forge/controller/provisioner.py

ty Jack Co-authored-by: Jack-Khuu <[email protected]>

different ports for different GPU IDs

ace6559

allenwang28 requested review from Jack-Khuu and JenniferWang September 22, 2025 21:31

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 22, 2025

Jack-Khuu approved these changes Sep 23, 2025

View reviewed changes

Jack-Khuu reviewed Sep 23, 2025

View reviewed changes

src/forge/controller/provisioner.py Outdated Show resolved Hide resolved

Update src/forge/controller/provisioner.py

3a93b04

ty Jack Co-authored-by: Jack-Khuu <[email protected]>

allenwang28 merged commit 00b2c98 into meta-pytorch:main Sep 23, 2025
5 checks passed

allenwang28 deleted the vllm_port branch September 23, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use different ports for different GPU IDs #214

Use different ports for different GPU IDs #214

Uh oh!

allenwang28 commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use different ports for different GPU IDs #214

Use different ports for different GPU IDs #214

Uh oh!

Conversation

allenwang28 commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants