Skip to content

[GRPO app] Logging hygiene request #372

@allenwang28

Description

@allenwang28

🐛 Describe the bug

As I run GRPO app, I had some ideas on some logging hygiene that could help us as we continue to test.

For context, what we see right now:

>>> Aggregated Logs (2025-10-10 14:04:57) >>>
[3 similar log lines] [PolicyWorker-0/4] 2025-10-10 14:04:57 INFO [PolicyWorker::update] start updating weights to version 3
[4 similar log lines] [PolicyWorker-0/4] 2025-10-10 14:04:57 INFO Loading policy_ver_0000000003.dcp_whole_state_dict from DCP with handle policy_ver_0000000003.dcp_whole_state_dict
<<< Aggregated Logs (2025-10-10 14:05:00) <<<

[1] [ReferenceModel-1/4] 2025-10-10 14:05:01 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:01 INFO [GC] Performing periodic GC collection took 0.02 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:01 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:01 INFO [GC] Performing periodic GC collection took 0.02 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:03 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:03 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:03 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:03 INFO [GC] Performing periodic GC collection took 0.01 seconds
>>> Aggregated Logs (2025-10-10 14:05:00) >>>
[3 similar log lines] [PolicyWorker-2/4] 2025-10-10 14:05:00 INFO [PolicyWorker::update] start updating weights to version 3
[4 similar log lines] [PolicyWorker-0/4] 2025-10-10 14:05:00 INFO Loading policy_ver_0000000003.dcp_whole_state_dict from DCP with handle policy_ver_0000000003.dcp_whole_state_dict
<<< Aggregated Logs (2025-10-10 14:05:03) <<<

[2] [ReferenceModel-2/4] 2025-10-10 14:05:04 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:04 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:04 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:04 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:06 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:06 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:06 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:06 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:08 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:08 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:08 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:08 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:09 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:09 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:09 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:09 INFO [GC] Performing periodic GC collection took 0.02 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:11 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:11 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:11 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:11 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:13 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:13 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:13 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:13 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:14 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:14 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:14 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:14 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:16 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:16 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:16 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:16 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:18 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:18 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:18 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:18 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:19 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:19 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:19 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:19 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:21 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:21 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:21 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:21 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:23 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:23 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:23 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:23 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:24 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:24 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:24 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:24 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:26 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:26 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:26 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:26 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:28 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:28 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:28 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:28 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:29 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:29 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:29 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:29 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:31 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:31 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:31 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:31 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:33 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:33 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:33 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:33 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:34 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:35 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:35 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:35 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] INFO 10-10 14:07:19 [block_pool.py:321] Successfully reset prefix cache
[0] [Policy-0/1] 2025-10-10 14:07:19 INFO Weight update completed (now v3)
[0] [ReferenceModel-0/4] 2025-10-10 14:07:42 INFO [GC] Performing periodic GC collection took 0.02 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:42 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:42 INFO [GC] Performing periodic GC collection took 0.02 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:42 INFO [GC] Performing periodic GC collection took 0.02 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:44 INFO [GC] Performing periodic GC collection took 0.02 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:44 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:44 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:44 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:46 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:46 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:46 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:46 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:47 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:47 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:47 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:47 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:49 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:49 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:49 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:49 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:54 INFO [GC] Performing periodic GC collection took 0.02 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:54 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:54 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:54 INFO [GC] Performing periodic GC collection took 0.02 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:59 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:59 INFO [GC] Performing periodic GC collection took 0.02 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:59 INFO [GC] Performing periodic GC collection took 0.02 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:59 INFO [GC] Performing periodic GC collection took 0.02 seconds
[0] INFO 10-10 14:08:00 [block_pool.py:321] Successfully reset prefix cache
[0] [Policy-0/1] 2025-10-10 14:08:00 INFO Weight update completed (now v3)
Dropping weights @ version 2
Dropped weights @ version 2, took 5.83 seconds
WandbBackend: Logged 101 metrics at global_step 3
=== [global_logger_2aeC_r0] - METRICS STEP 3 ===
  buffer/add/count_episodes_added: 786.0
  buffer/evict/avg_policy_staleness: 1.0
  buffer/evict/max_policy_staleness: 1.0
  buffer/evict/sum_episodes_evicted: 784.0
  buffer/sample/avg_buffer_utilization: 592.0
  buffer/sample/avg_buffer_utilization_pct: 5.405405405405405
  buffer/sample/count_sample_requests: 1.0
  buffer_perf/sample/total_duration_avg_s: 0.0055667751003056765
  buffer_perf/sample/total_duration_max_s: 0.0055667751003056765
  dataset/sample/avg_sample_len: 515.0816326530612
  dataset/sample/count_samples_generated: 49.0
  main/continuous_rollouts/count_rollout_iterations: 49.0
  main_perf/continuous_rollouts/compute_logprobs/duration_avg_s: 5.996471974162423e-05
  main_perf/continuous_rollouts/compute_logprobs/duration_max_s: 0.00010304898023605347
  main_perf/continuous_rollouts/data_loading/duration_avg_s: 0.0013156006996500858
  main_perf/continuous_rollouts/data_loading/duration_max_s: 0.0015002100262790918
  main_perf/continuous_rollouts/policy_generation/duration_avg_s: 137.51652606647957
  main_perf/continuous_rollouts/policy_generation/duration_max_s: 199.73020377987996
  main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_avg_s: 16.182050521685078
  main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_max_s: 36.03378381417133
  main_perf/continuous_rollouts/reward_evaluation/duration_avg_s: 0.38161538480020757
  main_perf/continuous_rollouts/reward_evaluation/duration_max_s: 1.0126315460074693
  main_perf/continuous_rollouts/total_duration_avg_s: 154.18965516618587
  main_perf/continuous_rollouts/total_duration_max_s: 213.60761165595613
  main_perf/continuous_training/drop_weights/duration_avg_s: 5.830546634970233
  main_perf/continuous_training/drop_weights/duration_max_s: 5.830546634970233
  main_perf/continuous_training/push_weights/duration_avg_s: 14.282035800861195
  main_perf/continuous_training/push_weights/duration_max_s: 14.282035800861195
  main_perf/continuous_training/total_duration_avg_s: 224.97223744285293
  main_perf/continuous_training/total_duration_max_s: 224.97223744285293
  main_perf/continuous_training/train_step/duration_avg_s: 7.712055152049288
  main_perf/continuous_training/train_step/duration_max_s: 7.712055152049288
  main_perf/continuous_training/update_weights/duration_avg_s: 197.13718346506357
  main_perf/continuous_training/update_weights/duration_max_s: 197.13718346506357
  main_perf/continuous_training/waiting_for_buffer/duration_avg_s: 0.01041399291716516
  main_perf/continuous_training/waiting_for_buffer/duration_max_s: 0.01041399291716516
  policy/generate/avg_tokens_generated: 800.0061274509804
  policy/generate/count_requests: 42.0
  policy/generate/count_sequences_completed: 816.0
  policy/generate/sum_tokens_generated: 652805.0
  policy/update_weights/count_weight_updates: 2.0
  policy_perf/generate/end/duration_avg_s: 2.7680000057443975e-06
  policy_perf/generate/end/duration_max_s: 2.816000021994114e-06
  policy_perf/generate/generate/duration_avg_s: 74.18627746332908
  policy_perf/generate/generate/duration_max_s: 199.69165625
  policy_perf/generate/process_inputs/duration_avg_s: 0.0009526073742457029
  policy_perf/generate/process_inputs/duration_max_s: 0.002440608024597168
  policy_perf/generate/prompt_truncation/duration_avg_s: 1.6781176540854517e-05
  policy_perf/generate/prompt_truncation/duration_max_s: 6.841599941253662e-05
  policy_perf/generate/total_duration_avg_s: 72.07658744781732
  policy_perf/generate/total_duration_max_s: 199.6942355780003
  policy_perf/update_weights/avg_pending_requests: 7.0
  policy_perf/update_weights/max_pending_requests: 8.0
  policy_worker_perf/update/total_duration_avg_s: 161.0851328125
  policy_worker_perf/update/total_duration_max_s: 180.38240625
  reference_perf/forward/avg_sequence_length: 2048.0
  reference_perf/forward/compute_logprobs/duration_avg_s: 0.026915128278732294
  reference_perf/forward/compute_logprobs/duration_max_s: 0.02828870391845703
  reference_perf/forward/count_forward_passes: 200.0
  reference_perf/forward/forward/duration_avg_s: 1.6151581799316403
  reference_perf/forward/forward/duration_max_s: 1.768312744140625
  reference_perf/forward/garbage_collection/duration_avg_s: 0.014899226388931274
  reference_perf/forward/garbage_collection/duration_max_s: 0.017250463485717772
  reference_perf/forward/memory_delta_end_start_avg_gb: 9.273681640625
  reference_perf/forward/memory_peak_max_gb: 47.74769067764282
  reference_perf/forward/to_device/duration_avg_s: 0.00029641920007765293
  reference_perf/forward/to_device/duration_max_s: 0.0004175359904766083
  reference_perf/forward/total_duration_avg_s: 1.657271959719452
  reference_perf/forward/total_duration_max_s: 1.811123239449691
  reward/evaluate_response/avg_MathReward_reward: 0.6261786600496322
  reward/evaluate_response/avg_ThinkingReward_reward: 0.7697270471464038
  reward/evaluate_response/avg_total_reward: 0.6979528535980215
  reward/evaluate_response/count_MathReward_calls: 806.0
  reward/evaluate_response/count_ThinkingReward_calls: 806.0
  reward/evaluate_response/std_MathReward_reward: 0.45319055073623077
  reward/evaluate_response/std_ThinkingReward_reward: 0.3622053691857174
  reward/evaluate_response/sum_MathReward_reward: 504.70000000000357
  reward/evaluate_response/sum_ThinkingReward_reward: 620.4000000000015
  rl_trainer/avg_grpo_loss: 0.16886521875858307
  rl_trainer/count_training_steps: 8.0
  rl_trainer/learning_rate: 0.001
  rl_trainer_perf/push_weights/dcp_save/duration_avg_s: 14.125432128906251
  rl_trainer_perf/push_weights/dcp_save/duration_max_s: 14.2749453125
  rl_trainer_perf/push_weights/flatten_state_dict/duration_avg_s: 0.0010219000056385995
  rl_trainer_perf/push_weights/flatten_state_dict/duration_max_s: 0.001112671971321106
  rl_trainer_perf/push_weights/memory_delta_end_start_avg_gb: 0.0
  rl_trainer_perf/push_weights/memory_peak_max_gb: 22.953386306762695
  rl_trainer_perf/push_weights/to_hf/duration_avg_s: 0.001430628001689911
  rl_trainer_perf/push_weights/to_hf/duration_max_s: 0.001469920039176941
  rl_trainer_perf/push_weights/total_duration_avg_s: 14.127981416915546
  rl_trainer_perf/push_weights/total_duration_max_s: 14.27768739247322
  rl_trainer_perf/step/forward_backward/duration_avg_s: 7.619148498535156
  rl_trainer_perf/step/forward_backward/duration_max_s: 7.62258251953125
  rl_trainer_perf/step/memory_delta_end_start_avg_gb: 0.0008854866027832031
  rl_trainer_perf/step/memory_peak_max_gb: 68.14654541015625
  rl_trainer_perf/step/optimizer_step/duration_avg_s: 0.03404809999465942
  rl_trainer_perf/step/optimizer_step/duration_max_s: 0.034347774505615235
  rl_trainer_perf/step/save_checkpoint/duration_avg_s: 0.003174496024847031
  rl_trainer_perf/step/save_checkpoint/duration_max_s: 0.003422719955444336
  rl_trainer_perf/step/total_duration_avg_s: 7.656416046554223
  rl_trainer_perf/step/total_duration_max_s: 7.659922744065523
==============================

A few challenges:

  • We enable garbage collection per step in torchtitan (both trainer and reference model), but it ideally shouldn't be logged at all.
  • We have very comprehensive observability! but I think we can minimize the number of metrics that are being logged both in wandb and console. What is difficult is that I think all of the metrics will be useful at some point - but if I'm analyzing the training stability and system throughput, here is a starting proposal:
=== [global_logger_2aeC_r0] - METRICS STEP 3 ===
  buffer/add/count_episodes_added: 786.0
  buffer/evict/avg_policy_staleness: 1.0
  buffer/evict/max_policy_staleness: 1.0
  buffer/evict/sum_episodes_evicted: 784.0
  main/continuous_rollouts/count_rollout_iterations: 49.0
  main_perf/continuous_rollouts/policy_generation/duration_avg_s: 137.51652606647957
  main_perf/continuous_rollouts/policy_generation/duration_max_s: 199.73020377987996
  main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_avg_s: 16.182050521685078
  main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_max_s: 36.03378381417133
  main_perf/continuous_rollouts/total_duration_avg_s: 154.18965516618587
  main_perf/continuous_rollouts/total_duration_max_s: 213.60761165595613
  main_perf/continuous_training/drop_weights/duration_avg_s: 5.830546634970233
  main_perf/continuous_training/drop_weights/duration_max_s: 5.830546634970233
  main_perf/continuous_training/push_weights/duration_avg_s: 14.282035800861195
  main_perf/continuous_training/push_weights/duration_max_s: 14.282035800861195
  main_perf/continuous_training/total_duration_avg_s: 224.97223744285293
  main_perf/continuous_training/total_duration_max_s: 224.97223744285293
  main_perf/continuous_training/train_step/duration_avg_s: 7.712055152049288
  main_perf/continuous_training/train_step/duration_max_s: 7.712055152049288
  main_perf/continuous_training/update_weights/duration_avg_s: 197.13718346506357
  main_perf/continuous_training/update_weights/duration_max_s: 197.13718346506357
  main_perf/continuous_training/waiting_for_buffer/duration_avg_s: 0.01041399291716516
  main_perf/continuous_training/waiting_for_buffer/duration_max_s: 0.01041399291716516
  policy/generate/avg_tokens_generated: 800.0061274509804
  policy/generate/count_requests: 42.0
  policy/generate/count_sequences_completed: 816.0
  policy_perf/generate/generate/duration_avg_s: 74.18627746332908
  policy_perf/generate/generate/duration_max_s: 199.69165625
  policy_perf/generate/total_duration_avg_s: 72.07658744781732
  policy_perf/generate/total_duration_max_s: 199.6942355780003
  policy_worker_perf/update/total_duration_avg_s: 161.0851328125
  policy_worker_perf/update/total_duration_max_s: 180.38240625
  reference_perf/forward/memory_delta_end_start_avg_gb: 9.273681640625
  reference_perf/forward/memory_peak_max_gb: 47.74769067764282
  reference_perf/forward/total_duration_avg_s: 1.657271959719452
  reference_perf/forward/total_duration_max_s: 1.811123239449691
  reward/evaluate_response/avg_MathReward_reward: 0.6261786600496322
  reward/evaluate_response/avg_ThinkingReward_reward: 0.7697270471464038
  reward/evaluate_response/avg_total_reward: 0.6979528535980215
  reward/evaluate_response/std_MathReward_reward: 0.45319055073623077
  reward/evaluate_response/std_ThinkingReward_reward: 0.3622053691857174
  reward/evaluate_response/sum_MathReward_reward: 504.70000000000357
  reward/evaluate_response/sum_ThinkingReward_reward: 620.4000000000015
  rl_trainer/avg_grpo_loss: 0.16886521875858307
  rl_trainer/count_training_steps: 8.0
  rl_trainer/learning_rate: 0.001
  rl_trainer_perf/push_weights/dcp_save/duration_avg_s: 14.125432128906251
  rl_trainer_perf/push_weights/dcp_save/duration_max_s: 14.2749453125
  rl_trainer_perf/push_weights/to_hf/duration_avg_s: 0.001430628001689911
  rl_trainer_perf/push_weights/to_hf/duration_max_s: 0.001469920039176941
  rl_trainer_perf/push_weights/total_duration_avg_s: 14.127981416915546
  rl_trainer_perf/push_weights/total_duration_max_s: 14.27768739247322
  rl_trainer_perf/step/forward_backward/duration_avg_s: 7.619148498535156
  rl_trainer_perf/step/forward_backward/duration_max_s: 7.62258251953125
  rl_trainer_perf/step/memory_delta_end_start_avg_gb: 0.0008854866027832031
  rl_trainer_perf/step/memory_peak_max_gb: 68.14654541015625
  rl_trainer_perf/step/save_checkpoint/duration_avg_s: 0.003174496024847031
  rl_trainer_perf/step/save_checkpoint/duration_max_s: 0.003422719955444336
  rl_trainer_perf/step/total_duration_avg_s: 7.656416046554223
  rl_trainer_perf/step/total_duration_max_s: 7.659922744065523
==============================

idk if we can or should introduce the concept of "metric verbosity" or something like that, but overall it's a bit difficult to find the information I'm looking for

Versions

No response

Metadata

Metadata

Labels

better engineeringTasks which help improve eng productivity e.g. building tools, cleaning up code, writing docsdiscussionStart a discussion

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions