Skip to content

Conversation

allenwang28
Copy link
Contributor

@allenwang28 allenwang28 commented Sep 12, 2025

This PR introduces initial multi-host support. For context, see #144

To summarize, the main limitations we have around Monarch's HostMesh right now are:

  1. You can't spawn multiple proc meshes on a HostMesh
  2. Actors spawned on different remote meshes cannot communicate with each other

You can, however, facilitate communications from the main/client/controller. This PR creates multiple host meshes as we will normally, but there are a few limitations:

  • Only supports SLURM remote scheduling for now
  • For worker os.env/PTD setup, I saw strange vLLM initialization issues if I tried to set environment variables after the proc_mesh was already created. The ideal path:
    • Create a host mesh
    • Get the master addr/port from a host in the mesh
    • Create the proc mesh, setting the env variable
  • Can't show a real multi-host example until Titan integration is landed in the example.
  • Can't use TorchStore because this requires remote actors that can communicate with each other. We can work around this with DCP save/load to NFS

Key API changes:

  1. ProcessConfig uses num_hosts to determine whether or not to use a remote allocation. None means spawn locally, num_hosts >=1 will run an actual remote allocation
  2. ServiceConfig correspondingly introduces hosts_per_replica to utilize num_hosts

Other extras:

  • Added in gitignore for some SLURM stuff
  • Added in a launcher that runs the same GRPO example, just with multi-host enabled
  • Preps vLLM to work with DeepSeek. This won't work because we need 2 hosts, and we can't set up PTD correctly, but we can run it with 1 host and see an OOM (expected)
    • Also added a config for Qwen 32B. It works locally w/ 4 hosts, but is currently set with hosts_per_replica=1 just to test multi-host
    • Policy currently places the workers on the remote host, but the controller is placed on the client. This is because of the HostMesh limitations

For reviewers, I'd urge to not focus too much on provisioner because that'll be changed once we have proper HostMesh support!

Sample logs: P1944199171

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 12, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, I added as_engine_args because when I was trying DeepSeek I saw some weird pickling issues when trying to do the get_vllm_args.choose(). I previously spent several hours trying to debug it until I decided it wasn't worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be awesome!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this is needed for EngineConfig so we can do

@classmethod
def as_engine_args(cls, config: Mapping | EngineConfig) ...

otherwise it'd need to be like

@classmethod
def as_engine_args(cls, config: Mapping | "EngineConfig") ...

and Python complains about the latter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: is there a way to query this information from a Python API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure tbh, torchx doesn't have this which makes me think there isn't

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this no longer work on mast then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i'm gonna build MAST integration in another PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this snuck in from BS that I did and might not be necessary. Unless you found something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, so this app isn't running with torchstore, it's that when I try and run DeepSeekv3 it takes like an hour to download the weights. I added this here so it doesn't crash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, this timeout is decoupled from the e2e latency of an endpoint

Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits but it looks great so I'll pre-approve

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that torchx meant we didn't need this? Either way, I think this should be in the GRPO app and take the config still as a conditional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's convoluted, we run sbatch to schedule the controller, then the controller calls sbatch through torchx.

We need the controller to run on a GPU node so that Monarch's build doesn't complain because it's built with tensor engine. I'm not sure what the right long-term solution is quite yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to leave it for now, but will think on this more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this? Can't we just point directly to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I promise I'll fix this later lol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it would be useful to have a diagram of all the concepts related to services so it'll be easier to maintain.

@allenwang28 allenwang28 merged commit d7ecfc6 into meta-pytorch:main Sep 15, 2025
5 checks passed
@allenwang28 allenwang28 deleted the host_mesh_2 branch September 15, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants