Skip to content

Conversation

allenwang28
Copy link
Contributor

@allenwang28 allenwang28 commented Oct 14, 2025

Addresses and closes both #171 and closes out #144

This is needed for both enabling torchtitan / RLTrainer across multiple nodes and colocates the vLLM policy with its workers.

Example run: https://wandb.ai/cabernet-team/grpo-training/runs/snehzq5o

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 14, 2025
@allenwang28 allenwang28 changed the title Uses Use monarch's distributed setup utility and colocate vLLM with its workers Oct 14, 2025
Copy link
Member

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity: can you link to a running WandB project?

@allenwang28 allenwang28 merged commit aa59857 into meta-pytorch:main Oct 15, 2025
9 checks passed
@allenwang28 allenwang28 deleted the titan_setup branch October 15, 2025 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Utilize setup_env_for_distributed in Trainer and RefModel instead of manually creating Torchrun env variables

10 participants