Skip to content

Conversation

Ritesh1905
Copy link
Contributor

  • Attempt 2, of the mast setup PR
  • Has a read me to help setup the mast env.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 1, 2025
@allenwang28
Copy link
Contributor

Can we change the design a bit?

  • Let's retain the Provisioner as original
  • We introduce scheduler specific classes like this:
class BaseScheduler:
    async def initialize(self):
        pass

    def get_allocator(self):
        pass


class MastScheduler(BaseScheduler):
    async def initialize(self):
        await self._build_appdef(...)
        await self._launch_mast_job(...)
        ...

    async def get_allocator(self):
        allocator = MastAllocator(...)
        return allocator, alloc_constraints


class SlurmScheduler(BaseScheduler):
    async def initialize(self):
        # no op

    async def get_allocator(self):
        appdef = hyperactor.host_mesh(...)
        server_info = commands.get_or_create(...)
        alloc = RemoteAllocator(...)
        return alloc, None

There are other things I'd want cleaned up, but I think if we can make this proposed change that is sufficient for the sake of this PR.

async def launch_mast_job(self):
handle = self.create_server_handle()
server_spec = info(handle)
if server_spec and server_spec.state == AppState.RUNNING:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this at all? why not always just return commands.get_or_create... ?


return allocator, alloc_constraints

async def create_host_mesh(self, name: str, num_hosts: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as an observation, imo we should be able to remove most of this complexity with host mesh. Specifically we should use a single host mesh for the entire job, remove TG specific logic, and go from there.

}

# Function to mount a single workspace to /mnt/wsfuse
mount_workspace() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to address here, but in the future we shouldn't make this a requirement for launching mast jobs. By leaving this open it's actually possible the remote job uses some locally passed assets

@Ritesh1905 Ritesh1905 force-pushed the rithesh/final_mast_changes branch 3 times, most recently from 0106aaf to 3a36aa5 Compare October 3, 2025 04:01
Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost there! Thanks for your patience @Ritesh1905

class Launcher(Enum):
MAST = "mast"
SLURM = "slurm"
LOCAL = "local"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there isn't such thing as a local launcher, can we remove LOCAL?

return f"{self.scheduler_name}:///{self.job_name}"


def get_launcher(cfg: DictConfig | None = None) -> BaseLauncher:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep OmegaConf constrained to only the apps as it isn't explicit enough. The pattern we've been following throughout Forge is something like this:

@dataclass
class LauncherConfig:
    ...

@dataclass
class SlurmConfig:
    ...

@dataclass
class MastConfig:
    ...

Would also like to get rid of the Local launcher:

def get_launcher(cfg: LauncherConfig | None = None) ->  BaseLauncher | None:
    """Returns the launcher given the configuration."""
    if not cfg:
        return None

    if isinstance(cfg, MastConfig):
        return MastLauncher(cfg)
    elif isinstance(cfg, SlurmConfig):
        return SlurmLauncher(cfg)
    else:
        raise ValueError(f"Unsupported config provided, got {cfg}") 

self._host_gpu_map = {
self._this_host_id: GpuManager(available_local_devices),
}
self.launcher: BaseLauncher = get_launcher(cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.launcher: BaseLauncher = get_launcher(cfg)
self.launcher: BaseLauncher | None = get_launcher(cfg)
if not self.launcher:
logger.warning("Launcher not provided, remote allocations will not work.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? Thoughts on defaulting get_launcher to return Slurm, rather than throwing this exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because people running on like a dev GPU might set this unknowingly and wonder why they're getting a SLURM error


async def initialize(self):
"""Call this after creating the instance"""
await self.launcher.initialize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await self.launcher.initialize()
if self.launcher is not None:
await self.launcher.initialize()

async def create_host_mesh(self, name: str, num_hosts: int) -> HostMesh:
"""Creates a remote server and a HostMesh on it."""
# no need to lock here because this is already locked behind `get_proc_mesh`
logger.debug(f"Creating remote server for alloc {name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not self.launcher:
    raise RuntimeError("You tried to create a remote allocation by specifying the number of hosts on an actor or service, but no launcher was specified.")

@Ritesh1905 Ritesh1905 force-pushed the rithesh/final_mast_changes branch from 3a36aa5 to 6dbc5ee Compare October 3, 2025 18:24
@Ritesh1905 Ritesh1905 merged commit 08404cb into main Oct 3, 2025
7 checks passed
@allenwang28 allenwang28 mentioned this pull request Oct 3, 2025
@@ -0,0 +1,162 @@
# Grouped Relative Policy Optimization (GRPO)
# >>> python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrong cmd

@Ritesh1905 Ritesh1905 deleted the rithesh/final_mast_changes branch October 7, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants