-
Notifications
You must be signed in to change notification settings - Fork 24
Initial multi-host support #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/forge/actors/policy.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, I added as_engine_args
because when I was trying DeepSeek I saw some weird pickling issues when trying to do the get_vllm_args.choose()
. I previously spent several hours trying to debug it until I decided it wasn't worth it.
apps/grpo/main.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be awesome!
src/forge/actors/policy.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah this is needed for EngineConfig
so we can do
@classmethod
def as_engine_args(cls, config: Mapping | EngineConfig) ...
otherwise it'd need to be like
@classmethod
def as_engine_args(cls, config: Mapping | "EngineConfig") ...
and Python complains about the latter
src/forge/controller/provisioner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: is there a way to query this information from a Python API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure tbh, torchx doesn't have this which makes me think there isn't
src/forge/types.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this no longer work on mast then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i'm gonna build MAST integration in another PR
apps/vllm/main.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this snuck in from BS that I did and might not be necessary. Unless you found something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, so this app isn't running with torchstore, it's that when I try and run DeepSeekv3 it takes like an hour to download the weights. I added this here so it doesn't crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, this timeout is decoupled from the e2e latency of an endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits but it looks great so I'll pre-approve
launcher/job.sbatch
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that torchx meant we didn't need this? Either way, I think this should be in the GRPO app and take the config still as a conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's convoluted, we run sbatch to schedule the controller, then the controller calls sbatch through torchx.
We need the controller to run on a GPU node so that Monarch's build doesn't complain because it's built with tensor engine. I'm not sure what the right long-term solution is quite yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to leave it for now, but will think on this more
src/forge/controller/proc_mesh.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this? Can't we just point directly to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I promise I'll fix this later lol
src/forge/controller/provisioner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point it would be useful to have a diagram of all the concepts related to services so it'll be easier to maintain.
This PR introduces initial multi-host support. For context, see #144
To summarize, the main limitations we have around Monarch's HostMesh right now are:
You can, however, facilitate communications from the main/client/controller. This PR creates multiple host meshes as we will normally, but there are a few limitations:
os.env
/PTD setup, I saw strange vLLM initialization issues if I tried to set environment variables after theproc_mesh
was already created. The ideal path:Key API changes:
ProcessConfig
usesnum_hosts
to determine whether or not to use a remote allocation.None
means spawn locally,num_hosts >=1
will run an actual remote allocationServiceConfig
correspondingly introduceshosts_per_replica
to utilizenum_hosts
Other extras:
hosts_per_replica=1
just to test multi-hostFor reviewers, I'd urge to not focus too much on provisioner because that'll be changed once we have proper HostMesh support!
Sample logs: P1944199171