-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Labels
Better EngineeringTasks which help improve eng productivity e.g. building tools, cleaning up code, writing docsTasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs
Description
🐛 Describe the bug
As I run GRPO app, I had some ideas on some logging hygiene that could help us as we continue to test.
For context, what we see right now:
>>> Aggregated Logs (2025-10-10 14:04:57) >>>
[3 similar log lines] [PolicyWorker-0/4] 2025-10-10 14:04:57 INFO [PolicyWorker::update] start updating weights to version 3
[4 similar log lines] [PolicyWorker-0/4] 2025-10-10 14:04:57 INFO Loading policy_ver_0000000003.dcp_whole_state_dict from DCP with handle policy_ver_0000000003.dcp_whole_state_dict
<<< Aggregated Logs (2025-10-10 14:05:00) <<<
[1] [ReferenceModel-1/4] 2025-10-10 14:05:01 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:01 INFO [GC] Performing periodic GC collection took 0.02 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:01 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:01 INFO [GC] Performing periodic GC collection took 0.02 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:03 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:03 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:03 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:03 INFO [GC] Performing periodic GC collection took 0.01 seconds
>>> Aggregated Logs (2025-10-10 14:05:00) >>>
[3 similar log lines] [PolicyWorker-2/4] 2025-10-10 14:05:00 INFO [PolicyWorker::update] start updating weights to version 3
[4 similar log lines] [PolicyWorker-0/4] 2025-10-10 14:05:00 INFO Loading policy_ver_0000000003.dcp_whole_state_dict from DCP with handle policy_ver_0000000003.dcp_whole_state_dict
<<< Aggregated Logs (2025-10-10 14:05:03) <<<
[2] [ReferenceModel-2/4] 2025-10-10 14:05:04 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:04 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:04 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:04 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:06 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:06 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:06 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:06 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:08 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:08 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:08 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:08 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:09 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:09 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:09 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:09 INFO [GC] Performing periodic GC collection took 0.02 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:11 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:11 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:11 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:11 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:13 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:13 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:13 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:13 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:14 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:14 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:14 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:14 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:16 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:16 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:16 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:16 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:18 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:18 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:18 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:18 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:19 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:19 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:19 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:19 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:21 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:21 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:21 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:21 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:23 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:23 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:23 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:23 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:24 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:24 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:24 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:24 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:26 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:26 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:26 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:26 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:28 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:28 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:28 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:28 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:29 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:29 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:29 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:29 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:31 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:31 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:31 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:31 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:33 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:33 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:33 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:33 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:05:34 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:05:35 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:05:35 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:05:35 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] INFO 10-10 14:07:19 [block_pool.py:321] Successfully reset prefix cache
[0] [Policy-0/1] 2025-10-10 14:07:19 INFO Weight update completed (now v3)
[0] [ReferenceModel-0/4] 2025-10-10 14:07:42 INFO [GC] Performing periodic GC collection took 0.02 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:42 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:42 INFO [GC] Performing periodic GC collection took 0.02 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:42 INFO [GC] Performing periodic GC collection took 0.02 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:44 INFO [GC] Performing periodic GC collection took 0.02 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:44 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:44 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:44 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:46 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:46 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:46 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:46 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:47 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:47 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:47 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:47 INFO [GC] Performing periodic GC collection took 0.01 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:49 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:49 INFO [GC] Performing periodic GC collection took 0.01 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:49 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:49 INFO [GC] Performing periodic GC collection took 0.01 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:54 INFO [GC] Performing periodic GC collection took 0.02 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:54 INFO [GC] Performing periodic GC collection took 0.01 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:54 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:54 INFO [GC] Performing periodic GC collection took 0.02 seconds
[0] [ReferenceModel-0/4] 2025-10-10 14:07:59 INFO [GC] Performing periodic GC collection took 0.02 seconds
[3] [ReferenceModel-3/4] 2025-10-10 14:07:59 INFO [GC] Performing periodic GC collection took 0.02 seconds
[2] [ReferenceModel-2/4] 2025-10-10 14:07:59 INFO [GC] Performing periodic GC collection took 0.02 seconds
[1] [ReferenceModel-1/4] 2025-10-10 14:07:59 INFO [GC] Performing periodic GC collection took 0.02 seconds
[0] INFO 10-10 14:08:00 [block_pool.py:321] Successfully reset prefix cache
[0] [Policy-0/1] 2025-10-10 14:08:00 INFO Weight update completed (now v3)
Dropping weights @ version 2
Dropped weights @ version 2, took 5.83 seconds
WandbBackend: Logged 101 metrics at global_step 3
=== [global_logger_2aeC_r0] - METRICS STEP 3 ===
buffer/add/count_episodes_added: 786.0
buffer/evict/avg_policy_staleness: 1.0
buffer/evict/max_policy_staleness: 1.0
buffer/evict/sum_episodes_evicted: 784.0
buffer/sample/avg_buffer_utilization: 592.0
buffer/sample/avg_buffer_utilization_pct: 5.405405405405405
buffer/sample/count_sample_requests: 1.0
buffer_perf/sample/total_duration_avg_s: 0.0055667751003056765
buffer_perf/sample/total_duration_max_s: 0.0055667751003056765
dataset/sample/avg_sample_len: 515.0816326530612
dataset/sample/count_samples_generated: 49.0
main/continuous_rollouts/count_rollout_iterations: 49.0
main_perf/continuous_rollouts/compute_logprobs/duration_avg_s: 5.996471974162423e-05
main_perf/continuous_rollouts/compute_logprobs/duration_max_s: 0.00010304898023605347
main_perf/continuous_rollouts/data_loading/duration_avg_s: 0.0013156006996500858
main_perf/continuous_rollouts/data_loading/duration_max_s: 0.0015002100262790918
main_perf/continuous_rollouts/policy_generation/duration_avg_s: 137.51652606647957
main_perf/continuous_rollouts/policy_generation/duration_max_s: 199.73020377987996
main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_avg_s: 16.182050521685078
main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_max_s: 36.03378381417133
main_perf/continuous_rollouts/reward_evaluation/duration_avg_s: 0.38161538480020757
main_perf/continuous_rollouts/reward_evaluation/duration_max_s: 1.0126315460074693
main_perf/continuous_rollouts/total_duration_avg_s: 154.18965516618587
main_perf/continuous_rollouts/total_duration_max_s: 213.60761165595613
main_perf/continuous_training/drop_weights/duration_avg_s: 5.830546634970233
main_perf/continuous_training/drop_weights/duration_max_s: 5.830546634970233
main_perf/continuous_training/push_weights/duration_avg_s: 14.282035800861195
main_perf/continuous_training/push_weights/duration_max_s: 14.282035800861195
main_perf/continuous_training/total_duration_avg_s: 224.97223744285293
main_perf/continuous_training/total_duration_max_s: 224.97223744285293
main_perf/continuous_training/train_step/duration_avg_s: 7.712055152049288
main_perf/continuous_training/train_step/duration_max_s: 7.712055152049288
main_perf/continuous_training/update_weights/duration_avg_s: 197.13718346506357
main_perf/continuous_training/update_weights/duration_max_s: 197.13718346506357
main_perf/continuous_training/waiting_for_buffer/duration_avg_s: 0.01041399291716516
main_perf/continuous_training/waiting_for_buffer/duration_max_s: 0.01041399291716516
policy/generate/avg_tokens_generated: 800.0061274509804
policy/generate/count_requests: 42.0
policy/generate/count_sequences_completed: 816.0
policy/generate/sum_tokens_generated: 652805.0
policy/update_weights/count_weight_updates: 2.0
policy_perf/generate/end/duration_avg_s: 2.7680000057443975e-06
policy_perf/generate/end/duration_max_s: 2.816000021994114e-06
policy_perf/generate/generate/duration_avg_s: 74.18627746332908
policy_perf/generate/generate/duration_max_s: 199.69165625
policy_perf/generate/process_inputs/duration_avg_s: 0.0009526073742457029
policy_perf/generate/process_inputs/duration_max_s: 0.002440608024597168
policy_perf/generate/prompt_truncation/duration_avg_s: 1.6781176540854517e-05
policy_perf/generate/prompt_truncation/duration_max_s: 6.841599941253662e-05
policy_perf/generate/total_duration_avg_s: 72.07658744781732
policy_perf/generate/total_duration_max_s: 199.6942355780003
policy_perf/update_weights/avg_pending_requests: 7.0
policy_perf/update_weights/max_pending_requests: 8.0
policy_worker_perf/update/total_duration_avg_s: 161.0851328125
policy_worker_perf/update/total_duration_max_s: 180.38240625
reference_perf/forward/avg_sequence_length: 2048.0
reference_perf/forward/compute_logprobs/duration_avg_s: 0.026915128278732294
reference_perf/forward/compute_logprobs/duration_max_s: 0.02828870391845703
reference_perf/forward/count_forward_passes: 200.0
reference_perf/forward/forward/duration_avg_s: 1.6151581799316403
reference_perf/forward/forward/duration_max_s: 1.768312744140625
reference_perf/forward/garbage_collection/duration_avg_s: 0.014899226388931274
reference_perf/forward/garbage_collection/duration_max_s: 0.017250463485717772
reference_perf/forward/memory_delta_end_start_avg_gb: 9.273681640625
reference_perf/forward/memory_peak_max_gb: 47.74769067764282
reference_perf/forward/to_device/duration_avg_s: 0.00029641920007765293
reference_perf/forward/to_device/duration_max_s: 0.0004175359904766083
reference_perf/forward/total_duration_avg_s: 1.657271959719452
reference_perf/forward/total_duration_max_s: 1.811123239449691
reward/evaluate_response/avg_MathReward_reward: 0.6261786600496322
reward/evaluate_response/avg_ThinkingReward_reward: 0.7697270471464038
reward/evaluate_response/avg_total_reward: 0.6979528535980215
reward/evaluate_response/count_MathReward_calls: 806.0
reward/evaluate_response/count_ThinkingReward_calls: 806.0
reward/evaluate_response/std_MathReward_reward: 0.45319055073623077
reward/evaluate_response/std_ThinkingReward_reward: 0.3622053691857174
reward/evaluate_response/sum_MathReward_reward: 504.70000000000357
reward/evaluate_response/sum_ThinkingReward_reward: 620.4000000000015
rl_trainer/avg_grpo_loss: 0.16886521875858307
rl_trainer/count_training_steps: 8.0
rl_trainer/learning_rate: 0.001
rl_trainer_perf/push_weights/dcp_save/duration_avg_s: 14.125432128906251
rl_trainer_perf/push_weights/dcp_save/duration_max_s: 14.2749453125
rl_trainer_perf/push_weights/flatten_state_dict/duration_avg_s: 0.0010219000056385995
rl_trainer_perf/push_weights/flatten_state_dict/duration_max_s: 0.001112671971321106
rl_trainer_perf/push_weights/memory_delta_end_start_avg_gb: 0.0
rl_trainer_perf/push_weights/memory_peak_max_gb: 22.953386306762695
rl_trainer_perf/push_weights/to_hf/duration_avg_s: 0.001430628001689911
rl_trainer_perf/push_weights/to_hf/duration_max_s: 0.001469920039176941
rl_trainer_perf/push_weights/total_duration_avg_s: 14.127981416915546
rl_trainer_perf/push_weights/total_duration_max_s: 14.27768739247322
rl_trainer_perf/step/forward_backward/duration_avg_s: 7.619148498535156
rl_trainer_perf/step/forward_backward/duration_max_s: 7.62258251953125
rl_trainer_perf/step/memory_delta_end_start_avg_gb: 0.0008854866027832031
rl_trainer_perf/step/memory_peak_max_gb: 68.14654541015625
rl_trainer_perf/step/optimizer_step/duration_avg_s: 0.03404809999465942
rl_trainer_perf/step/optimizer_step/duration_max_s: 0.034347774505615235
rl_trainer_perf/step/save_checkpoint/duration_avg_s: 0.003174496024847031
rl_trainer_perf/step/save_checkpoint/duration_max_s: 0.003422719955444336
rl_trainer_perf/step/total_duration_avg_s: 7.656416046554223
rl_trainer_perf/step/total_duration_max_s: 7.659922744065523
==============================
A few challenges:
- We enable garbage collection per step in torchtitan (both trainer and reference model), but it ideally shouldn't be logged at all.
- We have very comprehensive observability! but I think we can minimize the number of metrics that are being logged both in wandb and console. What is difficult is that I think all of the metrics will be useful at some point - but if I'm analyzing the training stability and system throughput, here is a starting proposal:
=== [global_logger_2aeC_r0] - METRICS STEP 3 ===
buffer/add/count_episodes_added: 786.0
buffer/evict/avg_policy_staleness: 1.0
buffer/evict/max_policy_staleness: 1.0
buffer/evict/sum_episodes_evicted: 784.0
main/continuous_rollouts/count_rollout_iterations: 49.0
main_perf/continuous_rollouts/policy_generation/duration_avg_s: 137.51652606647957
main_perf/continuous_rollouts/policy_generation/duration_max_s: 199.73020377987996
main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_avg_s: 16.182050521685078
main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_max_s: 36.03378381417133
main_perf/continuous_rollouts/total_duration_avg_s: 154.18965516618587
main_perf/continuous_rollouts/total_duration_max_s: 213.60761165595613
main_perf/continuous_training/drop_weights/duration_avg_s: 5.830546634970233
main_perf/continuous_training/drop_weights/duration_max_s: 5.830546634970233
main_perf/continuous_training/push_weights/duration_avg_s: 14.282035800861195
main_perf/continuous_training/push_weights/duration_max_s: 14.282035800861195
main_perf/continuous_training/total_duration_avg_s: 224.97223744285293
main_perf/continuous_training/total_duration_max_s: 224.97223744285293
main_perf/continuous_training/train_step/duration_avg_s: 7.712055152049288
main_perf/continuous_training/train_step/duration_max_s: 7.712055152049288
main_perf/continuous_training/update_weights/duration_avg_s: 197.13718346506357
main_perf/continuous_training/update_weights/duration_max_s: 197.13718346506357
main_perf/continuous_training/waiting_for_buffer/duration_avg_s: 0.01041399291716516
main_perf/continuous_training/waiting_for_buffer/duration_max_s: 0.01041399291716516
policy/generate/avg_tokens_generated: 800.0061274509804
policy/generate/count_requests: 42.0
policy/generate/count_sequences_completed: 816.0
policy_perf/generate/generate/duration_avg_s: 74.18627746332908
policy_perf/generate/generate/duration_max_s: 199.69165625
policy_perf/generate/total_duration_avg_s: 72.07658744781732
policy_perf/generate/total_duration_max_s: 199.6942355780003
policy_worker_perf/update/total_duration_avg_s: 161.0851328125
policy_worker_perf/update/total_duration_max_s: 180.38240625
reference_perf/forward/memory_delta_end_start_avg_gb: 9.273681640625
reference_perf/forward/memory_peak_max_gb: 47.74769067764282
reference_perf/forward/total_duration_avg_s: 1.657271959719452
reference_perf/forward/total_duration_max_s: 1.811123239449691
reward/evaluate_response/avg_MathReward_reward: 0.6261786600496322
reward/evaluate_response/avg_ThinkingReward_reward: 0.7697270471464038
reward/evaluate_response/avg_total_reward: 0.6979528535980215
reward/evaluate_response/std_MathReward_reward: 0.45319055073623077
reward/evaluate_response/std_ThinkingReward_reward: 0.3622053691857174
reward/evaluate_response/sum_MathReward_reward: 504.70000000000357
reward/evaluate_response/sum_ThinkingReward_reward: 620.4000000000015
rl_trainer/avg_grpo_loss: 0.16886521875858307
rl_trainer/count_training_steps: 8.0
rl_trainer/learning_rate: 0.001
rl_trainer_perf/push_weights/dcp_save/duration_avg_s: 14.125432128906251
rl_trainer_perf/push_weights/dcp_save/duration_max_s: 14.2749453125
rl_trainer_perf/push_weights/to_hf/duration_avg_s: 0.001430628001689911
rl_trainer_perf/push_weights/to_hf/duration_max_s: 0.001469920039176941
rl_trainer_perf/push_weights/total_duration_avg_s: 14.127981416915546
rl_trainer_perf/push_weights/total_duration_max_s: 14.27768739247322
rl_trainer_perf/step/forward_backward/duration_avg_s: 7.619148498535156
rl_trainer_perf/step/forward_backward/duration_max_s: 7.62258251953125
rl_trainer_perf/step/memory_delta_end_start_avg_gb: 0.0008854866027832031
rl_trainer_perf/step/memory_peak_max_gb: 68.14654541015625
rl_trainer_perf/step/save_checkpoint/duration_avg_s: 0.003174496024847031
rl_trainer_perf/step/save_checkpoint/duration_max_s: 0.003422719955444336
rl_trainer_perf/step/total_duration_avg_s: 7.656416046554223
rl_trainer_perf/step/total_duration_max_s: 7.659922744065523
==============================
idk if we can or should introduce the concept of "metric verbosity" or something like that, but overall it's a bit difficult to find the information I'm looking for
Versions
No response
Metadata
Metadata
Assignees
Labels
Better EngineeringTasks which help improve eng productivity e.g. building tools, cleaning up code, writing docsTasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs