-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[training_utils] fix: RM extra scaling in KL/PG losses #4711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug where KL, PG, and VF loss metrics were being scaled incorrectly by the number of gradient accumulation steps, causing the reported metrics to be dependent on the micro-batch size. The fix removes this extra scaling factor from the metric logging in dp_actor.py and dp_critic.py. This change correctly separates the scaling needed for gradient backpropagation from metric reporting, ensuring the logged values accurately reflect the mean loss. The change is consistent, logical, and supported by the test results in the description. I have reviewed the changes and found no issues.
|
Get it. This is indeed a mistake. But there still exists bias in the fixed version.
I consider the following fix:
|
This reverts commit e1ed63a.
Good catch @yyDing1 . Since the If we want to still do the aggregation in
new_metrics = dict()
for key, val in metrics.items():
if "max" in key:
new_metrics[key] = np.max(val)
elif "min" in key:
new_metrics[key] = np.min(val)
elif key.endswith("+"):
new_metrics[key[:-1]] = np.sum(val)
else:
new_metrics[key] = np.mean(val)
return new_metrics
self.actor_optimizer.zero_grad()
metrics["actor/pg_loss"] = [loss * len(metrics["actor/pg_loss"]) for loss in metrics["actor/pg_loss"]]
metrics["actor/kl_loss"] = [loss * len(metrics["actor/kl_loss"]) for loss in metrics["actor/kl_loss"]]
return metricsWDYT? |
|
Yes, agree with you that we should make a minimal change while preserving the existing logic for computing the mean of the values. |
Agreed, since option 1 would require a modification to |
|
I think your current implementation looks better. If you have no concerns, we can merge it. |
### What does this PR do? The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches **scaled by the number of micro-batches**. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling. First scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/workers/actor/dp_actor.py#L533 Second scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/utils/metric/utils.py#L53 ### Test On `main`, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4: <img width="970" height="640" alt="image" src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7" /> Decreasing micro-batch size on this branch does not effect metric magnitude: <img width="988" height="644" alt="image" src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9" /> ```bash python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.dataloader_num_workers=0 \ data.return_full_prompt=True \ data.train_files=$SAVE_PATH/gsm8k/train.parquet \ data.val_files=$SAVE_PATH/gsm8k/test.parquet \ data.train_batch_size=8 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=10 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_fix_metrics' \ trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=5 \ trainer.resume_mode="disable" \ trainer.total_epochs=15 \ actor_rollout_ref.actor.use_torch_compile=False \ actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \ trainer.val_before_train=False \ actor_rollout_ref.rollout.enforce_eager=True \ actor_rollout_ref.ref.fsdp_config.use_torch_compile=False ``` ### Design & Code Changes RM scaling in `dp_actor`
### What does this PR do? The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches **scaled by the number of micro-batches**. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling. First scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/workers/actor/dp_actor.py#L533 Second scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/utils/metric/utils.py#L53 ### Test On `main`, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4: <img width="970" height="640" alt="image" src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7" /> Decreasing micro-batch size on this branch does not effect metric magnitude: <img width="988" height="644" alt="image" src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9" /> ```bash python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.dataloader_num_workers=0 \ data.return_full_prompt=True \ data.train_files=$SAVE_PATH/gsm8k/train.parquet \ data.val_files=$SAVE_PATH/gsm8k/test.parquet \ data.train_batch_size=8 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=10 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_fix_metrics' \ trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=5 \ trainer.resume_mode="disable" \ trainer.total_epochs=15 \ actor_rollout_ref.actor.use_torch_compile=False \ actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \ trainer.val_before_train=False \ actor_rollout_ref.rollout.enforce_eager=True \ actor_rollout_ref.ref.fsdp_config.use_torch_compile=False ``` ### Design & Code Changes RM scaling in `dp_actor`
### What does this PR do? The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches **scaled by the number of micro-batches**. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling. First scaling: https://github.com/volcengine/verl/blob/66877dd430742841ae6e779c9c66b0ceae29b41d/verl/workers/actor/dp_actor.py#L533 Second scaling: https://github.com/volcengine/verl/blob/66877dd430742841ae6e779c9c66b0ceae29b41d/verl/utils/metric/utils.py#L53 ### Test On `main`, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4: <img width="970" height="640" alt="image" src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7" /> Decreasing micro-batch size on this branch does not effect metric magnitude: <img width="988" height="644" alt="image" src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9" /> ```bash python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.dataloader_num_workers=0 \ data.return_full_prompt=True \ data.train_files=$SAVE_PATH/gsm8k/train.parquet \ data.val_files=$SAVE_PATH/gsm8k/test.parquet \ data.train_batch_size=8 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=10 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_fix_metrics' \ trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=5 \ trainer.resume_mode="disable" \ trainer.total_epochs=15 \ actor_rollout_ref.actor.use_torch_compile=False \ actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \ trainer.val_before_train=False \ actor_rollout_ref.rollout.enforce_eager=True \ actor_rollout_ref.ref.fsdp_config.use_torch_compile=False ``` ### Design & Code Changes RM scaling in `dp_actor`
### What does this PR do? The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches **scaled by the number of micro-batches**. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling. First scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/workers/actor/dp_actor.py#L533 Second scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/utils/metric/utils.py#L53 ### Test On `main`, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4: <img width="970" height="640" alt="image" src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7" /> Decreasing micro-batch size on this branch does not effect metric magnitude: <img width="988" height="644" alt="image" src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9" /> ```bash python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.dataloader_num_workers=0 \ data.return_full_prompt=True \ data.train_files=$SAVE_PATH/gsm8k/train.parquet \ data.val_files=$SAVE_PATH/gsm8k/test.parquet \ data.train_batch_size=8 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=10 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_fix_metrics' \ trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=5 \ trainer.resume_mode="disable" \ trainer.total_epochs=15 \ actor_rollout_ref.actor.use_torch_compile=False \ actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \ trainer.val_before_train=False \ actor_rollout_ref.rollout.enforce_eager=True \ actor_rollout_ref.ref.fsdp_config.use_torch_compile=False ``` ### Design & Code Changes RM scaling in `dp_actor`
What does this PR do?
The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches scaled by the number of micro-batches. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling.
First scaling:
verl/verl/workers/actor/dp_actor.py
Line 533 in cd4072d
Second scaling:
verl/verl/utils/metric/utils.py
Line 53 in cd4072d
Test
On
main, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4:Decreasing micro-batch size on this branch does not effect metric magnitude:
python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.dataloader_num_workers=0 \ data.return_full_prompt=True \ data.train_files=$SAVE_PATH/gsm8k/train.parquet \ data.val_files=$SAVE_PATH/gsm8k/test.parquet \ data.train_batch_size=8 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=10 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_fix_metrics' \ trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=5 \ trainer.resume_mode="disable" \ trainer.total_epochs=15 \ actor_rollout_ref.actor.use_torch_compile=False \ actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \ trainer.val_before_train=False \ actor_rollout_ref.rollout.enforce_eager=True \ actor_rollout_ref.ref.fsdp_config.use_torch_compile=FalseDesign & Code Changes
RM scaling in
dp_actor