-
Notifications
You must be signed in to change notification settings - Fork 80
Description
Hi, thanks for the great work!
I have a question about the training configuration.
In the paper and your code, the configuration is below.
Generation: 512 prompts × 16 responses = 8,192 responses
Minibatch updates: 512 responses × 16 iterations per rollout
This means 15 gradient updates are performed on the old policy, causing policy drift.
Instead, if we split the generation into 16 smaller rollouts:
Generation: 32 prompts × 16 responses = 512 responses
Minibatch updates: 512 responses × 1 update
This would have the same total computation but without policy drift, since each gradient update is performed on freshly sampled data.
So my question is — is there a specific reason or advantage to the current design (large rollout + multiple minibatch updates) beyond what I described? Or am I misunderstanding something about the training process?
I'm also wondering if this is just a common convention in RL training that I'm not aware of, since my background in RL is not that deep. Any clarification would be appreciated! Thanks in advance.