Initial multi-host support #151

allenwang28 · 2025-09-12T14:43:17Z

This PR introduces initial multi-host support. For context, see #144

To summarize, the main limitations we have around Monarch's HostMesh right now are:

You can't spawn multiple proc meshes on a HostMesh
Actors spawned on different remote meshes cannot communicate with each other

You can, however, facilitate communications from the main/client/controller. This PR creates multiple host meshes as we will normally, but there are a few limitations:

Only supports SLURM remote scheduling for now
For worker os.env/PTD setup, I saw strange vLLM initialization issues if I tried to set environment variables after the proc_mesh was already created. The ideal path:
- Create a host mesh
- Get the master addr/port from a host in the mesh
- Create the proc mesh, setting the env variable
Can't show a real multi-host example until Titan integration is landed in the example.
Can't use TorchStore because this requires remote actors that can communicate with each other. We can work around this with DCP save/load to NFS

Key API changes:

ProcessConfig uses num_hosts to determine whether or not to use a remote allocation. None means spawn locally, num_hosts >=1 will run an actual remote allocation
ServiceConfig correspondingly introduces hosts_per_replica to utilize num_hosts

Other extras:

Added in gitignore for some SLURM stuff
Added in a launcher that runs the same GRPO example, just with multi-host enabled
Preps vLLM to work with DeepSeek. This won't work because we need 2 hosts, and we can't set up PTD correctly, but we can run it with 1 host and see an OOM (expected)
- Also added a config for Qwen 32B. It works locally w/ 4 hosts, but is currently set with hosts_per_replica=1 just to test multi-host
- Policy currently places the workers on the remote host, but the controller is placed on the client. This is because of the HostMesh limitations

For reviewers, I'd urge to not focus too much on provisioner because that'll be changed once we have proper HostMesh support!

Sample logs: P1944199171

allenwang28 · 2025-09-12T14:44:12Z

src/forge/actors/policy.py

note, I added as_engine_args because when I was trying DeepSeek I saw some weird pickling issues when trying to do the get_vllm_args.choose(). I previously spent several hours trying to debug it until I decided it wasn't worth it.

joecummings · 2025-09-12T18:48:39Z

apps/grpo/main.py

This would be awesome!

apps/vllm/deepseek_r1.yaml

apps/vllm/main.py

joecummings · 2025-09-12T18:51:24Z

src/forge/actors/policy.py

ah this is needed for EngineConfig so we can do

@classmethod def as_engine_args(cls, config: Mapping | EngineConfig) ...

otherwise it'd need to be like

@classmethod def as_engine_args(cls, config: Mapping | "EngineConfig") ...

and Python complains about the latter

src/forge/controller/provisioner.py

joecummings · 2025-09-12T18:53:24Z

src/forge/controller/provisioner.py

n00b question: is there a way to query this information from a Python API?

i'm not sure tbh, torchx doesn't have this which makes me think there isn't

joecummings · 2025-09-12T18:54:04Z

src/forge/types.py

Will this no longer work on mast then?

yeah i'm gonna build MAST integration in another PR

LucasLLC · 2025-09-12T14:51:44Z

apps/vllm/main.py

I think this snuck in from BS that I did and might not be necessary. Unless you found something?

ah, so this app isn't running with torchstore, it's that when I try and run DeepSeekv3 it takes like an hour to download the weights. I added this here so it doesn't crash

From my understanding, this timeout is decoupled from the e2e latency of an endpoint

pbontrager

Left some nits but it looks great so I'll pre-approve

apps/grpo/multihost.yaml

pbontrager · 2025-09-15T16:07:57Z

launcher/job.sbatch

I was under the impression that torchx meant we didn't need this? Either way, I think this should be in the GRPO app and take the config still as a conditional.

yeah it's convoluted, we run sbatch to schedule the controller, then the controller calls sbatch through torchx.

We need the controller to run on a GPU node so that Monarch's build doesn't complain because it's built with tensor engine. I'm not sure what the right long-term solution is quite yet

I am going to leave it for now, but will think on this more

pbontrager · 2025-09-15T16:10:46Z

src/forge/controller/proc_mesh.py

What's the reason for this? Can't we just point directly to it?

I promise I'll fix this later lol

pbontrager · 2025-09-15T16:11:59Z

src/forge/controller/provisioner.py

At this point it would be useful to have a diagram of all the concepts related to services so it'll be easier to maintain.

allenwang28 added 8 commits September 11, 2025 18:38

initial commit w/ simple vllm

3ab5c51

vllm on a host, provisioner abstraction

f4691ca

vllm with provisioner works

54b4444

local capabilities still work

a103ae9

initial grpo working multi-host

e0f9228

grpo running with two hosts

7c1faab

some cleanups

6fb46ff

cleanups

46e92eb

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 12, 2025

allenwang28 commented Sep 12, 2025

View reviewed changes

allenwang28 added 3 commits September 12, 2025 14:44

provisioner sets gpus

bc06a0a

lint

373fb05

linter again

9b3016e

allenwang28 requested review from LucasLLC and pbontrager September 12, 2025 14:49

remove gpu in provisioner

396b521

joecummings reviewed Sep 12, 2025

View reviewed changes

allenwang28 added 3 commits September 12, 2025 18:59

yaml ordering

b12810e

remove comment

130442c

merge conflict

fbe9b45

allenwang28 mentioned this pull request Sep 12, 2025

Pull VllmConfig construction up to Policy #154

Merged

adds multihost

e3a02d6

LucasLLC approved these changes Sep 15, 2025

View reviewed changes

merge

c139154

pbontrager approved these changes Sep 15, 2025

View reviewed changes

allenwang28 added 2 commits September 15, 2025 16:14

rename multihost yaml

39d399c

last change

52592a6

allenwang28 merged commit d7ecfc6 into meta-pytorch:main Sep 15, 2025
5 checks passed

allenwang28 deleted the host_mesh_2 branch September 15, 2025 17:15

Initial multi-host support #151

Initial multi-host support #151

Uh oh!

Conversation

allenwang28 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

allenwang28 commented Sep 12, 2025 •

edited

Loading