Skip to content

Conversation

@JacobHelwig
Copy link
Collaborator

What does this PR do?

The KL/ PG losses currently logged are scaled by the number of micro-batches twice. The result is that the logged metrics represent the mean value across micro-batches scaled by the number of micro-batches. This PR only scales once so that the logged metrics represent the mean across micro-batches with no extra scaling.

First scaling:

micro_batch_metrics["actor/kl_loss"] = kl_loss.detach().item() * loss_scale_factor

Second scaling:

metrics[key] = np.mean(val)

Test

On main, decreasing micro-batch size from 8->2 decreases logged loss by a factor of 4:

image

Decreasing micro-batch size on this branch does not effect metric magnitude:

image
python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.dataloader_num_workers=0 \
    data.return_full_prompt=True \
    data.train_files=$SAVE_PATH/gsm8k/train.parquet \
    data.val_files=$SAVE_PATH/gsm8k/test.parquet \
    data.train_batch_size=8 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=10 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_fix_metrics' \
    trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.resume_mode="disable" \
    trainer.total_epochs=15 \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \
    trainer.val_before_train=False \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.ref.fsdp_config.use_torch_compile=False

Design & Code Changes

RM scaling in dp_actor

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where KL, PG, and VF loss metrics were being scaled incorrectly by the number of gradient accumulation steps, causing the reported metrics to be dependent on the micro-batch size. The fix removes this extra scaling factor from the metric logging in dp_actor.py and dp_critic.py. This change correctly separates the scaling needed for gradient backpropagation from metric reporting, ensuring the logged values accurately reflect the mean loss. The change is consistent, logical, and supported by the test results in the description. I have reviewed the changes and found no issues.

@yyDing1
Copy link
Collaborator

yyDing1 commented Dec 29, 2025

Get it. This is indeed a mistake. But there still exists bias in the fixed version.

  • It's ok for "use_dynamic_bsz=False", as the loss scaling factor is fixed across multiple micro-batches (i.e., 1 / grad_acc)
  • It will be wrong for "use_dynamic_bsz=True", as the micro-batch size may differ.

I consider the following fix:

  • keep the micro-batch-level log, i.e., micro_batch_metrics["actor/kl_loss"] = kl_loss.detach().item() * loss_scale_factor
  • In the process of aggregating the results of the micro-batch log, use sum instead of mean, as it has been scaled.

@JacobHelwig
Copy link
Collaborator Author

JacobHelwig commented Dec 29, 2025

Get it. This is indeed a mistake. But there still exists bias in the fixed version.

  • It's ok for "use_dynamic_bsz=False", as the loss scaling factor is fixed across multiple micro-batches (i.e., 1 / grad_acc)
  • It will be wrong for "use_dynamic_bsz=True", as the micro-batch size may differ.

I consider the following fix:

  • keep the micro-batch-level log, i.e., micro_batch_metrics["actor/kl_loss"] = kl_loss.detach().item() * loss_scale_factor
  • In the process of aggregating the results of the micro-batch log, use sum instead of mean, as it has been scaled.

Good catch @yyDing1 . Since the np.mean in verl.utils.metric.utils.reduce_metrics is still needed for some of the metrics returned from the update (e.g., pg_metrics), I just handled the aggregation directly in the update function: e43dadb

If we want to still do the aggregation in reduce_metrics instead of in the update, I see 2 options that keep the np.mean in reduce_metrics:

  • Option 1: Add a "+" to the end of the names of metrics that need to be summed, rm it after summing in reduce_metrics:
new_metrics = dict()
for key, val in metrics.items():
    if "max" in key:
        new_metrics[key] = np.max(val)
    elif "min" in key:
        new_metrics[key] = np.min(val)
    elif key.endswith("+"):
        new_metrics[key[:-1]] = np.sum(val)
    else:
        new_metrics[key] = np.mean(val)
return new_metrics
  • Option 2: Don't change reduce metrics; in the update function, multiply each element in KL/PG loss lists by number of grad accumulation steps so the mean in reduce_metrics is doing a sum:
self.actor_optimizer.zero_grad()
metrics["actor/pg_loss"] = [loss * len(metrics["actor/pg_loss"]) for loss in metrics["actor/pg_loss"]]
metrics["actor/kl_loss"] = [loss * len(metrics["actor/kl_loss"]) for loss in metrics["actor/kl_loss"]]
return metrics

WDYT?

@yyDing1
Copy link
Collaborator

yyDing1 commented Dec 29, 2025

Yes, agree with you that we should make a minimal change while preserving the existing logic for computing the mean of the values.

@JacobHelwig
Copy link
Collaborator Author

Yes, agree with you that we should make a minimal change while preserving the existing logic for computing the mean of the values.

Agreed, since option 1 would require a modification to reduce_metrics, maybe the current approach or option 2 are best for keeping the changes isolated. IMO, the current approach might be a bit more readable/less hacky than option 2, but I'm good on either. WDYT?

@yyDing1
Copy link
Collaborator

yyDing1 commented Dec 30, 2025

I think your current implementation looks better. If you have no concerns, we can merge it.

@yyDing1 yyDing1 self-requested a review December 30, 2025 06:21
@yyDing1 yyDing1 merged commit 0af6a38 into volcengine:main Dec 30, 2025
50 of 53 checks passed
boren-ms pushed a commit to boren-ms/verl that referenced this pull request Dec 30, 2025
### What does this PR do?

The KL/ PG losses currently logged are scaled by the number of
micro-batches twice. The result is that the logged metrics represent the
mean value across micro-batches **scaled by the number of
micro-batches**. This PR only scales once so that the logged metrics
represent the mean across micro-batches with no extra scaling.

First scaling:


https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/workers/actor/dp_actor.py#L533

Second scaling:


https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/utils/metric/utils.py#L53

### Test

On `main`, decreasing micro-batch size from 8->2 decreases logged loss
by a factor of 4:

<img width="970" height="640" alt="image"
src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7"
/>

Decreasing micro-batch size on this branch does not effect metric
magnitude:

<img width="988" height="644" alt="image"
src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9"
/>


```bash
python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.dataloader_num_workers=0 \
    data.return_full_prompt=True \
    data.train_files=$SAVE_PATH/gsm8k/train.parquet \
    data.val_files=$SAVE_PATH/gsm8k/test.parquet \
    data.train_batch_size=8 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=10 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_fix_metrics' \
    trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.resume_mode="disable" \
    trainer.total_epochs=15 \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \
    trainer.val_before_train=False \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.ref.fsdp_config.use_torch_compile=False
```

### Design & Code Changes

RM scaling in `dp_actor`
jsfanfanfan pushed a commit to meituan-search/verl that referenced this pull request Jan 9, 2026
### What does this PR do?

The KL/ PG losses currently logged are scaled by the number of
micro-batches twice. The result is that the logged metrics represent the
mean value across micro-batches **scaled by the number of
micro-batches**. This PR only scales once so that the logged metrics
represent the mean across micro-batches with no extra scaling.

First scaling:


https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/workers/actor/dp_actor.py#L533

Second scaling:


https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/utils/metric/utils.py#L53

### Test

On `main`, decreasing micro-batch size from 8->2 decreases logged loss
by a factor of 4:

<img width="970" height="640" alt="image"
src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7"
/>

Decreasing micro-batch size on this branch does not effect metric
magnitude:

<img width="988" height="644" alt="image"
src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9"
/>


```bash
python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.dataloader_num_workers=0 \
    data.return_full_prompt=True \
    data.train_files=$SAVE_PATH/gsm8k/train.parquet \
    data.val_files=$SAVE_PATH/gsm8k/test.parquet \
    data.train_batch_size=8 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=10 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_fix_metrics' \
    trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.resume_mode="disable" \
    trainer.total_epochs=15 \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \
    trainer.val_before_train=False \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.ref.fsdp_config.use_torch_compile=False
```

### Design & Code Changes

RM scaling in `dp_actor`
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
### What does this PR do?

The KL/ PG losses currently logged are scaled by the number of
micro-batches twice. The result is that the logged metrics represent the
mean value across micro-batches **scaled by the number of
micro-batches**. This PR only scales once so that the logged metrics
represent the mean across micro-batches with no extra scaling.

First scaling:


https://github.com/volcengine/verl/blob/66877dd430742841ae6e779c9c66b0ceae29b41d/verl/workers/actor/dp_actor.py#L533

Second scaling:


https://github.com/volcengine/verl/blob/66877dd430742841ae6e779c9c66b0ceae29b41d/verl/utils/metric/utils.py#L53

### Test

On `main`, decreasing micro-batch size from 8->2 decreases logged loss
by a factor of 4:

<img width="970" height="640" alt="image"
src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7"
/>

Decreasing micro-batch size on this branch does not effect metric
magnitude:

<img width="988" height="644" alt="image"
src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9"
/>


```bash
python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.dataloader_num_workers=0 \
    data.return_full_prompt=True \
    data.train_files=$SAVE_PATH/gsm8k/train.parquet \
    data.val_files=$SAVE_PATH/gsm8k/test.parquet \
    data.train_batch_size=8 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=10 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_fix_metrics' \
    trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.resume_mode="disable" \
    trainer.total_epochs=15 \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \
    trainer.val_before_train=False \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.ref.fsdp_config.use_torch_compile=False
```

### Design & Code Changes

RM scaling in `dp_actor`
sophiayyya pushed a commit to sophiayyya/verl that referenced this pull request Jan 25, 2026
### What does this PR do?

The KL/ PG losses currently logged are scaled by the number of
micro-batches twice. The result is that the logged metrics represent the
mean value across micro-batches **scaled by the number of
micro-batches**. This PR only scales once so that the logged metrics
represent the mean across micro-batches with no extra scaling.

First scaling:


https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/workers/actor/dp_actor.py#L533

Second scaling:


https://github.com/volcengine/verl/blob/cd4072daad2652794ecff0b5816a05afedff8608/verl/utils/metric/utils.py#L53

### Test

On `main`, decreasing micro-batch size from 8->2 decreases logged loss
by a factor of 4:

<img width="970" height="640" alt="image"
src="https://github.com/user-attachments/assets/9d6cf0a5-1cef-46ad-9d4b-c1d1d56a9af7"
/>

Decreasing micro-batch size on this branch does not effect metric
magnitude:

<img width="988" height="644" alt="image"
src="https://github.com/user-attachments/assets/c8f6bc34-da02-4469-8e16-58b53c6235a9"
/>


```bash
python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.dataloader_num_workers=0 \
    data.return_full_prompt=True \
    data.train_files=$SAVE_PATH/gsm8k/train.parquet \
    data.val_files=$SAVE_PATH/gsm8k/test.parquet \
    data.train_batch_size=8 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    +actor_rollout_ref.ref.model.path=Qwen/Qwen2.5-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=10 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_fix_metrics' \
    trainer.experiment_name='NEW/ppo_micro_batch_size_per_gpu2' \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.resume_mode="disable" \
    trainer.total_epochs=15 \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.actor.fsdp_config.use_torch_compile=False \
    trainer.val_before_train=False \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.ref.fsdp_config.use_torch_compile=False
```

### Design & Code Changes

RM scaling in `dp_actor`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants