RL method used in section 3.1 and 3.3

In the paper, specifically in section 3.1 Clip Higher and section 3.3 Dynamic Sampling, I found that there are two figures about Entropy as follows:

<img width="571" alt="Image" src="https://github.com/user-attachments/assets/07b27d43-5534-4e6a-bde9-39f9f307ef81" />

<img width="562" alt="Image" src="https://github.com/user-attachments/assets/e2077490-d222-4bd7-b6e8-6d70a88247a9" />

Can you tell me the difference in the RL method used to create these figures? From my understanding, figure 2 uses actor model trained on GRPO/PPO while figure 4 uses model trained on DAPO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RL method used in section 3.1 and 3.3 #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RL method used in section 3.1 and 3.3 #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions