Entropy dynamics of RL training

This example shows the two algorithms Clip_B and Clip_V from the work On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models.

Data Preparation

We utilize the DAPO-Math-17k dataset as our training set. We exclude 500 questions from the training set to form the validation set (denoted by dapo-validation-500). The training set is filtered out samples from the training set with excessively high (≥ 15/16) or low (≤ 1/16) pass rates, as evaluated by Qwen2.5-7B-Instruct.

Clip_B Experiment

Apply the patch to keep entropy information in the trainer batch:

cd /path/to/Trinity-RFT
git apply examples/entropy/clipb_trainer.patch

Update the dataset paths in the config file clipb.yaml to point to your local data.
Run the experiment:

trinity run examples/entropy/clipb.yaml

Clip_V Implementation

Coming soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entropy dynamics of RL training

Data Preparation

Clip_B Experiment

Clip_V Implementation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Entropy dynamics of RL training

Data Preparation

Clip_B Experiment

Clip_V Implementation