Skip to content

Unexpected long reward time in example script #6

@baihuajun24

Description

@baihuajun24

Hi, I am running your example script grpo_7B.sh.

Environment:

  • Machine: 8×H20 GPUs
  • Modified hyperparameters: Changed max_response_length from 32k to 16k (other unchanged)
  • Training steps: 200

Issue:
I ran training for 200 steps and observed the following timing distribution:

Timing distribution

The figure plots timing_s/gen, timing_s/reward over timing_s/step per step. The reward time (orange line) dominates the step time.

Statistics:

  • Mean distribution across all steps:
    • Rollout: ~30%
    • Reward: ~40%

Question:
I think the reward time is unexpectedly long. I don't know if this is expected behavior. Could you please help clarify?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions