-
Notifications
You must be signed in to change notification settings - Fork 16
Mast setup #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mast setup #285
Conversation
Ritesh1905
commented
Oct 1, 2025
- Attempt 2, of the mast setup PR
- Has a read me to help setup the mast env.
Can we change the design a bit?
There are other things I'd want cleaned up, but I think if we can make this proposed change that is sufficient for the sake of this PR. |
async def launch_mast_job(self): | ||
handle = self.create_server_handle() | ||
server_spec = info(handle) | ||
if server_spec and server_spec.state == AppState.RUNNING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this at all? why not always just return commands.get_or_create... ?
|
||
return allocator, alloc_constraints | ||
|
||
async def create_host_mesh(self, name: str, num_hosts: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as an observation, imo we should be able to remove most of this complexity with host mesh. Specifically we should use a single host mesh for the entire job, remove TG specific logic, and go from there.
} | ||
|
||
# Function to mount a single workspace to /mnt/wsfuse | ||
mount_workspace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to address here, but in the future we shouldn't make this a requirement for launching mast jobs. By leaving this open it's actually possible the remote job uses some locally passed assets
0106aaf
to
3a36aa5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
almost there! Thanks for your patience @Ritesh1905
src/forge/types.py
Outdated
class Launcher(Enum): | ||
MAST = "mast" | ||
SLURM = "slurm" | ||
LOCAL = "local" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there isn't such thing as a local launcher, can we remove LOCAL?
src/forge/controller/launcher.py
Outdated
return f"{self.scheduler_name}:///{self.job_name}" | ||
|
||
|
||
def get_launcher(cfg: DictConfig | None = None) -> BaseLauncher: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep OmegaConf constrained to only the apps as it isn't explicit enough. The pattern we've been following throughout Forge is something like this:
@dataclass
class LauncherConfig:
...
@dataclass
class SlurmConfig:
...
@dataclass
class MastConfig:
...
Would also like to get rid of the Local launcher:
def get_launcher(cfg: LauncherConfig | None = None) -> BaseLauncher | None:
"""Returns the launcher given the configuration."""
if not cfg:
return None
if isinstance(cfg, MastConfig):
return MastLauncher(cfg)
elif isinstance(cfg, SlurmConfig):
return SlurmLauncher(cfg)
else:
raise ValueError(f"Unsupported config provided, got {cfg}")
src/forge/controller/provisioner.py
Outdated
self._host_gpu_map = { | ||
self._this_host_id: GpuManager(available_local_devices), | ||
} | ||
self.launcher: BaseLauncher = get_launcher(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.launcher: BaseLauncher = get_launcher(cfg) | |
self.launcher: BaseLauncher | None = get_launcher(cfg) | |
if not self.launcher: | |
logger.warning("Launcher not provided, remote allocations will not work.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? Thoughts on defaulting get_launcher
to return Slurm, rather than throwing this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because people running on like a dev GPU might set this unknowingly and wonder why they're getting a SLURM error
src/forge/controller/provisioner.py
Outdated
|
||
async def initialize(self): | ||
"""Call this after creating the instance""" | ||
await self.launcher.initialize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await self.launcher.initialize() | |
if self.launcher is not None: | |
await self.launcher.initialize() |
async def create_host_mesh(self, name: str, num_hosts: int) -> HostMesh: | ||
"""Creates a remote server and a HostMesh on it.""" | ||
# no need to lock here because this is already locked behind `get_proc_mesh` | ||
logger.debug(f"Creating remote server for alloc {name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self.launcher:
raise RuntimeError("You tried to create a remote allocation by specifying the number of hosts on an actor or service, but no launcher was specified.")
3a36aa5
to
6dbc5ee
Compare
@@ -0,0 +1,162 @@ | |||
# Grouped Relative Policy Optimization (GRPO) | |||
# >>> python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wrong cmd