[training_utils] fix: RM extra scaling in KL/PG losses #4711

JacobHelwig · 2025-12-29T04:45:06Z

What does this PR do?

The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches scaled by the number of micro-batches. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling.

First scaling:

verl/verl/workers/actor/dp_actor.py

Line 533 in cd4072d

    
           micro_batch_metrics["actor/kl_loss"] = kl_loss.detach().item() * loss_scale_factor

Second scaling:

verl/verl/utils/metric/utils.py

Line 53 in cd4072d

metrics[key] = np.mean(val)

Test

On main, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4:

Decreasing micro-batch size on this branch does not effect metric magnitude:

python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.dataloader_num_workers=0 \
    data.return_full_prompt=True \
    data.train_files=$SAVE_PATH/gsm8k/train.parquet \
    data.val_files=$SAVE_PATH/gsm8k/test.parquet \
    data.train_batch_size=8 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=10 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_fix_metrics' \
    trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.resume_mode="disable" \
    trainer.total_epochs=15 \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \
    trainer.val_before_train=False \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.ref.fsdp_config.use_torch_compile=False

Design & Code Changes

RM scaling in dp_actor

gemini-code-assist

Code Review

This pull request addresses a bug where KL, PG, and VF loss metrics were being scaled incorrectly by the number of gradient accumulation steps, causing the reported metrics to be dependent on the micro-batch size. The fix removes this extra scaling factor from the metric logging in dp_actor.py and dp_critic.py. This change correctly separates the scaling needed for gradient backpropagation from metric reporting, ensuring the logged values accurately reflect the mean loss. The change is consistent, logical, and supported by the test results in the description. I have reviewed the changes and found no issues.

yyDing1 · 2025-12-29T07:07:27Z

Get it. This is indeed a mistake. But there still exists bias in the fixed version.

It's ok for "use_dynamic_bsz=False", as the loss scaling factor is fixed across multiple micro-batches (i.e., 1 / grad_acc)
It will be wrong for "use_dynamic_bsz=True", as the micro-batch size may differ.

I consider the following fix:

keep the micro-batch-level log, i.e., micro_batch_metrics["actor/kl_loss"] = kl_loss.detach().item() * loss_scale_factor
In the process of aggregating the results of the micro-batch log, use sum instead of mean, as it has been scaled.

This reverts commit e1ed63a.

JacobHelwig · 2025-12-29T07:57:20Z

Get it. This is indeed a mistake. But there still exists bias in the fixed version.

It's ok for "use_dynamic_bsz=False", as the loss scaling factor is fixed across multiple micro-batches (i.e., 1 / grad_acc)

It will be wrong for "use_dynamic_bsz=True", as the micro-batch size may differ.

I consider the following fix:

keep the micro-batch-level log, i.e., micro_batch_metrics["actor/kl_loss"] = kl_loss.detach().item() * loss_scale_factor

In the process of aggregating the results of the micro-batch log, use sum instead of mean, as it has been scaled.

Good catch @yyDing1 . Since the np.mean in verl.utils.metric.utils.reduce_metrics is still needed for some of the metrics returned from the update (e.g., pg_metrics), I just handled the aggregation directly in the update function: e43dadb

If we want to still do the aggregation in reduce_metrics instead of in the update, I see 2 options that keep the np.mean in reduce_metrics:

Option 1: Add a "+" to the end of the names of metrics that need to be summed, rm it after summing in reduce_metrics:

new_metrics = dict()
for key, val in metrics.items():
    if "max" in key:
        new_metrics[key] = np.max(val)
    elif "min" in key:
        new_metrics[key] = np.min(val)
    elif key.endswith("+"):
        new_metrics[key[:-1]] = np.sum(val)
    else:
        new_metrics[key] = np.mean(val)
return new_metrics

Option 2: Don't change reduce metrics; in the update function, multiply each element in KL/PG loss lists by number of grad accumulation steps so the mean in reduce_metrics is doing a sum:

self.actor_optimizer.zero_grad()
metrics["actor/pg_loss"] = [loss * len(metrics["actor/pg_loss"]) for loss in metrics["actor/pg_loss"]]
metrics["actor/kl_loss"] = [loss * len(metrics["actor/kl_loss"]) for loss in metrics["actor/kl_loss"]]
return metrics

WDYT?

yyDing1 · 2025-12-29T13:39:41Z

Yes, agree with you that we should make a minimal change while preserving the existing logic for computing the mean of the values.

JacobHelwig · 2025-12-29T17:10:52Z

Yes, agree with you that we should make a minimal change while preserving the existing logic for computing the mean of the values.

Agreed, since option 1 would require a modification to reduce_metrics, maybe the current approach or option 2 are best for keeping the changes isolated. IMO, the current approach might be a bit more readable/less hacky than option 2, but I'm good on either. WDYT?

yyDing1 · 2025-12-30T06:20:02Z

I think your current implementation looks better. If you have no concerns, we can merge it.

### What does this PR do? The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches **scaled by the number of micro-batches**. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling. First scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/workers/actor/dp_actor.py#L533 Second scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/utils/metric/utils.py#L53 ### Test On `main`, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4: <img width="970" height="640" alt="image" src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7" /> Decreasing micro-batch size on this branch does not effect metric magnitude: <img width="988" height="644" alt="image" src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9" /> ```bash python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.dataloader_num_workers=0 \ data.return_full_prompt=True \ data.train_files=$SAVE_PATH/gsm8k/train.parquet \ data.val_files=$SAVE_PATH/gsm8k/test.parquet \ data.train_batch_size=8 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=10 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_fix_metrics' \ trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=5 \ trainer.resume_mode="disable" \ trainer.total_epochs=15 \ actor_rollout_ref.actor.use_torch_compile=False \ actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \ trainer.val_before_train=False \ actor_rollout_ref.rollout.enforce_eager=True \ actor_rollout_ref.ref.fsdp_config.use_torch_compile=False ``` ### Design & Code Changes RM scaling in `dp_actor`

### What does this PR do? The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches **scaled by the number of micro-batches**. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling. First scaling: https://github.com/volcengine/verl/blob/66877dd430742841ae6e779c9c66b0ceae29b41d/verl/workers/actor/dp_actor.py#L533 Second scaling: https://github.com/volcengine/verl/blob/66877dd430742841ae6e779c9c66b0ceae29b41d/verl/utils/metric/utils.py#L53 ### Test On `main`, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4: <img width="970" height="640" alt="image" src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7" /> Decreasing micro-batch size on this branch does not effect metric magnitude: <img width="988" height="644" alt="image" src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9" /> ```bash python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.dataloader_num_workers=0 \ data.return_full_prompt=True \ data.train_files=$SAVE_PATH/gsm8k/train.parquet \ data.val_files=$SAVE_PATH/gsm8k/test.parquet \ data.train_batch_size=8 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=10 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_fix_metrics' \ trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=5 \ trainer.resume_mode="disable" \ trainer.total_epochs=15 \ actor_rollout_ref.actor.use_torch_compile=False \ actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \ trainer.val_before_train=False \ actor_rollout_ref.rollout.enforce_eager=True \ actor_rollout_ref.ref.fsdp_config.use_torch_compile=False ``` ### Design & Code Changes RM scaling in `dp_actor`

### What does this PR do? The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches **scaled by the number of micro-batches**. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling. First scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/workers/actor/dp_actor.py#L533 Second scaling: https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/utils/metric/utils.py#L53 ### Test On `main`, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4: <img width="970" height="640" alt="image" src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7" /> Decreasing micro-batch size on this branch does not effect metric magnitude: <img width="988" height="644" alt="image" src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9" /> ```bash python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.dataloader_num_workers=0 \ data.return_full_prompt=True \ data.train_files=$SAVE_PATH/gsm8k/train.parquet \ data.val_files=$SAVE_PATH/gsm8k/test.parquet \ data.train_batch_size=8 \ data.max_prompt_length=512 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=10 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=5 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","wandb"]' \ trainer.project_name='verl_fix_metrics' \ trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=20 \ trainer.test_freq=5 \ trainer.resume_mode="disable" \ trainer.total_epochs=15 \ actor_rollout_ref.actor.use_torch_compile=False \ actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \ trainer.val_before_train=False \ actor_rollout_ref.rollout.enforce_eager=True \ actor_rollout_ref.ref.fsdp_config.use_torch_compile=False ``` ### Design & Code Changes RM scaling in `dp_actor`

Fix metric scaling

874ff53

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

JacobHelwig added 3 commits December 29, 2025 01:25

Sum reduction

e1ed63a

Revert "Sum reduction"

ffd5813

This reverts commit e1ed63a.

Reduce in update fn

e43dadb

yyDing1 self-requested a review December 30, 2025 06:21

yyDing1 approved these changes Dec 30, 2025

View reviewed changes

yyDing1 merged commit 0af6a38 into volcengine:main Dec 30, 2025
50 of 53 checks passed

Caleb66666 mentioned this pull request Dec 31, 2025

[bug-metrcis] "critic/advantages/mean" 指标标量统计非常小 #4747

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training_utils] fix: RM extra scaling in KL/PG losses #4711

[training_utils] fix: RM extra scaling in KL/PG losses #4711

Uh oh!

JacobHelwig commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

yyDing1 commented Dec 29, 2025

Uh oh!

JacobHelwig commented Dec 29, 2025 •

edited

Loading

Uh oh!

yyDing1 commented Dec 29, 2025 •

edited

Loading

Uh oh!

JacobHelwig commented Dec 29, 2025

Uh oh!

yyDing1 commented Dec 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[training_utils] fix: RM extra scaling in KL/PG losses #4711

[training_utils] fix: RM extra scaling in KL/PG losses #4711

Uh oh!

Conversation

JacobHelwig commented Dec 29, 2025

What does this PR do?

Test

Design & Code Changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yyDing1 commented Dec 29, 2025

Uh oh!

JacobHelwig commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyDing1 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacobHelwig commented Dec 29, 2025

Uh oh!

yyDing1 commented Dec 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JacobHelwig commented Dec 29, 2025 •

edited

Loading

yyDing1 commented Dec 29, 2025 •

edited

Loading