Skip to content

Conversation

allenwang28
Copy link
Contributor

Tested with both:

python -m apps.grpo.main --config=apps/gpro/qwen3_1_7b.yaml

MONARCH_HOSTMESH_V1=1 TORCHSTORE_USE_RDMA=1 python -m apps.grpo.main --config=apps/grpo/qwen3_1_7b.yaml

Notes:

with DCP

  rl_trainer_perf/push_weights/dcp_save/duration_avg_s: 5.659896367986221
  rl_trainer_perf/push_weights/dcp_save/duration_max_s: 5.659896367986221
  policy_worker_perf/update_weights/total_duration_avg_s: 6.1911617270088755
  policy_worker_perf/update_weights/total_duration_max_s: 6.1911617270088755

with RDMA

  rl_trainer_perf/push_weights/ts_save/duration_avg_s: 3.6854572310112417
  rl_trainer_perf/push_weights/ts_save/duration_max_s: 3.6854572310112417
  policy_worker_perf/update_weights/total_duration_avg_s: 4.642838474013843
  policy_worker_perf/update_weights/total_duration_max_s: 4.642838474013843

will test slurm/mast and make the relevant changes next

@allenwang28 allenwang28 requested a review from LucasLLC October 13, 2025 22:29
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 13, 2025
@allenwang28 allenwang28 merged commit 4b3b3c2 into meta-pytorch:main Oct 13, 2025
9 of 14 checks passed
@allenwang28 allenwang28 deleted the torchstore-pin branch October 13, 2025 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants