In the paper, specifically in section 3.1 Clip Higher and section 3.3 Dynamic Sampling, I found that there are two figures about Entropy as follows:
Can you tell me the difference in the RL method used to create these figures? From my understanding, figure 2 uses actor model trained on GRPO/PPO while figure 4 uses model trained on DAPO