Skip to content

Conversation

allenwang28
Copy link
Contributor

Wandb run: https://wandb.ai/cabernet-team/grpo-training/runs/6ydiynis

This has run successfully for me for >10 steps. Key to avoid OOM was to use full activation checkpointing.

Tracking perf logs here for future optimization reference (filtered for the logs with biggest numbers):

  main_perf/continuous_rollouts/policy_generation/duration_avg_s: 13.589468416663294
  main_perf/continuous_rollouts/policy_generation/duration_max_s: 254.07726019609254
  main_perf/continuous_rollouts/total_duration_avg_s: 14.894019355273485
  main_perf/continuous_rollouts/total_duration_max_s: 255.39975000789855
  main_perf/continuous_training/drop_weights/duration_avg_s: 8.684610838070512
  main_perf/continuous_training/drop_weights/duration_max_s: 8.684610838070512
  main_perf/continuous_training/push_weights/duration_avg_s: 16.870715045020916
  main_perf/continuous_training/push_weights/duration_max_s: 16.870715045020916
  main_perf/continuous_training/total_duration_avg_s: 479.61130712600425
  main_perf/continuous_training/total_duration_max_s: 479.61130712600425
  main_perf/continuous_training/train_step/duration_avg_s: 4.577446123003028
  main_perf/continuous_training/train_step/duration_max_s: 4.577446123003028
  main_perf/continuous_training/update_weights/duration_avg_s: 255.71232606796548
  main_perf/continuous_training/update_weights/duration_max_s: 255.71232606796548
  main_perf/continuous_training/waiting_for_buffer/duration_avg_s: 193.76620576600544
  main_perf/continuous_training/waiting_for_buffer/duration_max_s: 193.76620576600544

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2025
dtype: bfloat16
gc_freq: 1
compile:
enable: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random q but have we tried enabling compile in any of our runs? just curious if it runs successfully (and if so, if there are any perf benefits)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we haven't, but I would also be interested

@allenwang28 allenwang28 merged commit 8cb21be into meta-pytorch:main Oct 4, 2025
7 checks passed
@allenwang28 allenwang28 deleted the qwen32_config branch October 4, 2025 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants