Unexpected long reward time in example script

Hi, I am running your example script [`grpo_7B.sh`](https://github.com/mit-han-lab/fastrl/tree/main/examples/grpo_7B.sh).

**Environment:**
- Machine: 8×H20 GPUs
- Modified hyperparameters: Changed `max_response_length` from 32k to 16k (other unchanged)
- Training steps: 200

**Issue:**
I ran training for 200 steps and observed the following timing distribution:

![Timing distribution](https://github.com/user-attachments/assets/1ea2bc0e-c265-402b-80a9-8ebc501eb600)

The figure plots `timing_s/gen`, `timing_s/reward` over `timing_s/step` per step. The reward time (orange line) dominates the step time.

**Statistics:**
- Mean distribution across all steps:
  - Rollout: ~30%
  - Reward: ~40%

**Question:**
I think the reward time is unexpectedly long. I don't know if this is expected behavior. Could you please help clarify?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected long reward time in example script #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected long reward time in example script #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions