Mast setup #285

Ritesh1905 · 2025-10-01T22:38:19Z

Attempt 2, of the mast setup PR
Has a read me to help setup the mast env.

apps/mast/main.py

allenwang28 · 2025-10-02T15:32:01Z

Can we change the design a bit?

Let's retain the Provisioner as original
We introduce scheduler specific classes like this:

class BaseScheduler:
    async def initialize(self):
        pass

    def get_allocator(self):
        pass


class MastScheduler(BaseScheduler):
    async def initialize(self):
        await self._build_appdef(...)
        await self._launch_mast_job(...)
        ...

    async def get_allocator(self):
        allocator = MastAllocator(...)
        return allocator, alloc_constraints


class SlurmScheduler(BaseScheduler):
    async def initialize(self):
        # no op

    async def get_allocator(self):
        appdef = hyperactor.host_mesh(...)
        server_info = commands.get_or_create(...)
        alloc = RemoteAllocator(...)
        return alloc, None

There are other things I'd want cleaned up, but I think if we can make this proposed change that is sufficient for the sake of this PR.

LucasLLC · 2025-10-02T17:47:14Z

src/forge/controller/launcher/mast.py

+    async def launch_mast_job(self):
+        handle = self.create_server_handle()
+        server_spec = info(handle)
+        if server_spec and server_spec.state == AppState.RUNNING:


Why do this at all? why not always just return commands.get_or_create... ?

LucasLLC · 2025-10-02T17:48:27Z

src/forge/controller/launcher/mast.py

+
+        return allocator, alloc_constraints
+
+    async def create_host_mesh(self, name: str, num_hosts: int):


just as an observation, imo we should be able to remove most of this complexity with host mesh. Specifically we should use a single host mesh for the entire job, remove TG specific logic, and go from there.

LucasLLC · 2025-10-02T17:49:14Z

apps/mast/env_setup.sh

+}
+
+# Function to mount a single workspace to /mnt/wsfuse
+mount_workspace() {


Don't need to address here, but in the future we shouldn't make this a requirement for launching mast jobs. By leaving this open it's actually possible the remote job uses some locally passed assets

allenwang28

almost there! Thanks for your patience @Ritesh1905

allenwang28 · 2025-10-03T16:08:28Z

src/forge/types.py

+class Launcher(Enum):
+    MAST = "mast"
+    SLURM = "slurm"
+    LOCAL = "local"


since there isn't such thing as a local launcher, can we remove LOCAL?

allenwang28 · 2025-10-03T16:10:10Z

src/forge/controller/launcher.py

+        return f"{self.scheduler_name}:///{self.job_name}"
+
+
+def get_launcher(cfg: DictConfig | None = None) -> BaseLauncher:


I would like to keep OmegaConf constrained to only the apps as it isn't explicit enough. The pattern we've been following throughout Forge is something like this:

@dataclass class LauncherConfig: ... @dataclass class SlurmConfig: ... @dataclass class MastConfig: ...

Would also like to get rid of the Local launcher:

def get_launcher(cfg: LauncherConfig | None = None) -> BaseLauncher | None: """Returns the launcher given the configuration.""" if not cfg: return None if isinstance(cfg, MastConfig): return MastLauncher(cfg) elif isinstance(cfg, SlurmConfig): return SlurmLauncher(cfg) else: raise ValueError(f"Unsupported config provided, got {cfg}")

allenwang28 · 2025-10-03T16:25:33Z

src/forge/controller/provisioner.py

        self._host_gpu_map = {
            self._this_host_id: GpuManager(available_local_devices),
        }
+        self.launcher: BaseLauncher = get_launcher(cfg)


Suggested change

self.launcher: BaseLauncher = get_launcher(cfg)

self.launcher: BaseLauncher | None = get_launcher(cfg)

if not self.launcher:

logger.warning("Launcher not provided, remote allocations will not work.")

why? Thoughts on defaulting get_launcher to return Slurm, rather than throwing this exception?

because people running on like a dev GPU might set this unknowingly and wonder why they're getting a SLURM error

allenwang28 · 2025-10-03T16:25:54Z

src/forge/controller/provisioner.py

+
+    async def initialize(self):
+        """Call this after creating the instance"""
+        await self.launcher.initialize()


Suggested change

await self.launcher.initialize()

if self.launcher is not None:

await self.launcher.initialize()

allenwang28 · 2025-10-03T16:26:47Z

src/forge/controller/provisioner.py

    async def create_host_mesh(self, name: str, num_hosts: int) -> HostMesh:
        """Creates a remote server and a HostMesh on it."""
        # no need to lock here because this is already locked behind `get_proc_mesh`
        logger.debug(f"Creating remote server for alloc {name}")


if not self.launcher: raise RuntimeError("You tried to create a remote allocation by specifying the number of hosts on an actor or service, but no launcher was specified.")

daniellepintz · 2025-10-04T15:01:08Z

apps/mast/qwen3_14b_mast.yaml

@@ -0,0 +1,162 @@
+# Grouped Relative Policy Optimization (GRPO)
+# >>> python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml


nit: wrong cmd

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 1, 2025

vidhyav reviewed Oct 1, 2025

View reviewed changes

apps/mast/main.py Outdated Show resolved Hide resolved

vidhyav approved these changes Oct 1, 2025

View reviewed changes

LucasLLC reviewed Oct 2, 2025

View reviewed changes

Ritesh1905 force-pushed the rithesh/final_mast_changes branch 3 times, most recently from 0106aaf to 3a36aa5 Compare October 3, 2025 04:01

rithesh and others added 15 commits October 3, 2025 09:26

inital changes

6aebb09

minor change

721c705

interim changes

db5db98

fix the bug

75aa422

some more fixes

5be9699

park

7e69a6b

parking changes

649fe83

config changes

0d683a4

working changes

097642d

minor changes

f241634

lints

1c79423

clean up some changes

cccaf50

failing tests

91216ba

some design changes

ab5a197

unit test issues

9d41973

allenwang28 reviewed Oct 3, 2025

View reviewed changes

suggested changes

6dbc5ee

Ritesh1905 force-pushed the rithesh/final_mast_changes branch from 3a36aa5 to 6dbc5ee Compare October 3, 2025 18:24

allenwang28 approved these changes Oct 3, 2025

View reviewed changes

failing tests

d313c59

Ritesh1905 merged commit 08404cb into main Oct 3, 2025
7 checks passed

allenwang28 mentioned this pull request Oct 3, 2025

Provisioner cleanup #311

Merged

daniellepintz reviewed Oct 4, 2025

View reviewed changes

Ritesh1905 deleted the rithesh/final_mast_changes branch October 7, 2025 17:30


		return allocator, alloc_constraints

		async def create_host_mesh(self, name: str, num_hosts: int):

		return f"{self.scheduler_name}:///{self.job_name}"


		def get_launcher(cfg: DictConfig \| None = None) -> BaseLauncher:

-        self.launcher: BaseLauncher = get_launcher(cfg)
+        self.launcher: BaseLauncher | None = get_launcher(cfg)
+        if not self.launcher:
+            logger.warning("Launcher not provided, remote allocations will not work.")

	await self.launcher.initialize()
	if self.launcher is not None:
	await self.launcher.initialize()

		@@ -0,0 +1,162 @@
		# Grouped Relative Policy Optimization (GRPO)
		# >>> python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Mast setup #285

Mast setup #285

Conversation

Ritesh1905 commented Oct 1, 2025

Uh oh!

Uh oh!

allenwang28 commented Oct 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants