Skip to content

Conversation

allenwang28
Copy link
Contributor

This PR implements a few cleanups after #285

Changes:

  • Introduces a utility for just get_remote_info() - given a host mesh, spawn a proc and a setup actor, get the host port and address of the hostmesh, and kill the proc. This happens regardless of the underlying launcher. Slight refactoring in the bootstrap
    • Launchers no longer return this remote info
  • Monarch now complains (with AssertionError) if you try and create an empty workspace. This doesn't really apply for SLURM, so we just create a tempfile and pass it to the workspace
  • Gets rid of controller.proc_mesh.get_proc_mesh etc. and redirects everything to just use provisioner.get_proc_mesh
  • Add host mesh <> proc tracking. I basically thought that I could do multiple spawn_procs(...) but I see that the prior procs die still. The intention was to enable the colocation of the vLLM controller and worker. We do this tracking because Monarch doesn't yet let you get the host_mesh from the proc_mesh (but it will eventually)
  • Also adds the capability to provide a host mesh to get_proc_mesh - this way you can colocate things

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2025
except ImportError as e:
print(f"Warning: Monarch meta/fb inetrnal imports failed: {e}")
print("Monarch functionality will be limited")
# This means there is an error with MAST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it silently fail?

Copy link
Contributor

@Ritesh1905 Ritesh1905 Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, yes. (However some sort of logging would be idea.) until we figure out what would be a right way to segregate meta-only code blocks. These dependencies are meta internal and are installed via an internal fbpkg monarch build. The env setup for mast requires a separate installation script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added in a block that says "if imports failed and you're trying to use MAST, print an error messaging say imports failed and that you should check your build was correct"

If None, an address will be detected.
Returns:
A proc mesh.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the actual class, e.g. ProcMesh?

also add a typehint?

Comment on lines 232 to 233
env_vars["HYPERACTOR_MESSAGE_DELIVER_TIMEOUT_SECS"] = "600"
env_vars["HYPERACTOR_MESSAGE_DELIVERY_TIMEOUT_SECS"] = "600"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need these both? what's the difference between these two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch! Removed the DELIVER which was a typo

@allenwang28 allenwang28 merged commit 2dc7707 into meta-pytorch:main Oct 6, 2025
7 checks passed
@allenwang28 allenwang28 deleted the qwen32 branch October 6, 2025 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants