-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Hi, I am running your example script grpo_7B.sh.
Environment:
- Machine: 8×H20 GPUs
- Modified hyperparameters: Changed
max_response_lengthfrom 32k to 16k (other unchanged) - Training steps: 200
Issue:
I ran training for 200 steps and observed the following timing distribution:
The figure plots timing_s/gen, timing_s/reward over timing_s/step per step. The reward time (orange line) dominates the step time.
Statistics:
- Mean distribution across all steps:
- Rollout: ~30%
- Reward: ~40%
Question:
I think the reward time is unexpectedly long. I don't know if this is expected behavior. Could you please help clarify?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
