Policy drift from minibatch updates vs. Smaller rollout batches

Hi, thanks for the great work!

I have a question about the training configuration.

In the paper and your code, the configuration is below.

```
Generation:       512 prompts × 16 responses = 8,192 responses
Minibatch updates: 512 responses × 16 iterations per rollout
```

This means 15 gradient updates are performed on the old policy, causing policy drift.

Instead, if we split the generation into 16 smaller rollouts:

```
Generation:       32 prompts × 16 responses = 512 responses
Minibatch updates: 512 responses × 1 update
```

This would have the same total computation but without policy drift, since each gradient update is performed on freshly sampled data.

So my question is — is there a specific reason or advantage to the current design (large rollout + multiple minibatch updates) beyond what I described? Or am I misunderstanding something about the training process?

I'm also wondering if this is just a common convention in RL training that I'm not aware of, since my background in RL is not that deep. Any clarification would be appreciated! Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy drift from minibatch updates vs. Smaller rollout batches #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Policy drift from minibatch updates vs. Smaller rollout batches #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions