This example shows the two algorithms Clip_B and Clip_V from the work On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models.
We utilize the DAPO-Math-17k dataset as our training set. We exclude 500 questions from the training set to form the validation set (denoted by dapo-validation-500). The training set is filtered out samples from the training set with excessively high (≥ 15/16) or low (≤ 1/16) pass rates, as evaluated by Qwen2.5-7B-Instruct.
- Apply the patch to keep entropy information in the trainer batch:
cd /path/to/Trinity-RFT
git apply examples/entropy/clipb_trainer.patch-
Update the dataset paths in the config file
clipb.yamlto point to your local data. -
Run the experiment:
trinity run examples/entropy/clipb.yamlComing soon.