Skip to content

Policy drift from minibatch updates vs. Smaller rollout batches #40

@SangYeop-Lee

Description

@SangYeop-Lee

Hi, thanks for the great work!

I have a question about the training configuration.

In the paper and your code, the configuration is below.

Generation:       512 prompts × 16 responses = 8,192 responses
Minibatch updates: 512 responses × 16 iterations per rollout

This means 15 gradient updates are performed on the old policy, causing policy drift.

Instead, if we split the generation into 16 smaller rollouts:

Generation:       32 prompts × 16 responses = 512 responses
Minibatch updates: 512 responses × 1 update

This would have the same total computation but without policy drift, since each gradient update is performed on freshly sampled data.

So my question is — is there a specific reason or advantage to the current design (large rollout + multiple minibatch updates) beyond what I described? Or am I misunderstanding something about the training process?

I'm also wondering if this is just a common convention in RL training that I'm not aware of, since my background in RL is not that deep. Any clarification would be appreciated! Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions