Skip to content

Commit e8be774

Browse files
authored
[Example] Clip_B and Clip_V from entropy dynamics (#509)
1 parent 02c7c8e commit e8be774

File tree

14 files changed

+1496
-9
lines changed

14 files changed

+1496
-9
lines changed

examples/entropy/README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Entropy dynamics of RL training
2+
3+
This example shows the two algorithms **Clip_B** and **Clip_V** from the work [On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models](https://arxiv.org/pdf/2602.03392).
4+
5+
NOTE: This example is only tested on trinity==0.5.1 and verl==0.7.0. The following experiments require `synchronizer.sync_interval=1` and `trainer.trainer_config.algorithm.rollout_correction.bypass_mode=false` to be set.
6+
7+
We also provide a runnable branch in the [Trinity-RFT](https://github.com/hiyuchang/Trinity-RFT/tree/example/entropy) repository that already includes all patches for this example.
8+
9+
## Data Preparation
10+
11+
We utilize the [DAPO-Math-17k](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) dataset as our training set. We exclude 500 questions from the training set to form the validation set (denoted by dapo-validation-500).
12+
The training set is filtered out samples from the training set with excessively high (≥ 15/16) or low (≤ 1/16) pass rates, as evaluated by Qwen2.5-7B-Instruct.
13+
14+
## Clip_B Experiment
15+
16+
1. Apply the patch to keep entropy information in the trainer batch:
17+
18+
```bash
19+
cd /path/to/Trinity-RFT
20+
git apply examples/entropy/clipb_trainer.patch
21+
# if not successful, try:
22+
# git apply --3way --ignore-whitespace examples/entropy/clipb_trainer.patch
23+
```
24+
25+
2. Update the dataset paths and other configurations in the file [`clipb.yaml`](./clipb.yaml) to point to your local data.
26+
27+
3. Run the experiment:
28+
29+
```bash
30+
trinity run examples/entropy/clipb.yaml
31+
```
32+
33+
## Clip_V Implementation
34+
35+
1. Apply the patch to keep entropy information in the trainer batch:
36+
37+
```bash
38+
cd /path/to/Trinity-RFT
39+
git apply examples/entropy/clipv_trainer.patch
40+
# if not successful, try:
41+
# git apply --3way --ignore-whitespace examples/entropy/clipv_trainer.patch
42+
```
43+
44+
2. Update the dataset paths and other configurations in the file [`clipv.yaml`](./clipv.yaml) to point to your local data.
45+
46+
3. Run the experiment:
47+
48+
```bash
49+
trinity run examples/entropy/clipv.yaml
50+
```
51+
52+
### Logic of Clip_V
53+
54+
As shown in the following flowchart, the forward pass of [examples/entropy/clipv_dp_actor.py](./clipv_dp_actor.py) outputs `log_probs`, `entropy`, and `nec`.
55+
These signals are then used by [Clip_V advantage function](../../trinity/algorithm/advantage_fn/clipv_advantage.py) to compute `xD` and clip only negative-advantage tokens. This process returns the revised `advantages`.
56+
57+
```mermaid
58+
flowchart TD
59+
A["data"]
60+
B["forward pass"]
61+
C1["log_probs"]
62+
C2["entropy (additional)"]
63+
C3["nec (additional)"]
64+
subgraph D["advantage computation"]
65+
direction TB
66+
F["xD = nec - exp(log_probs) * (entropy + log_probs)"]
67+
G["only clip negative-advantage tokens"]
68+
F --> G
69+
end
70+
E["advantages"]
71+
72+
A --> B
73+
B --> C1
74+
B --> C2
75+
B --> C3
76+
C1 --> D
77+
C2 --> D
78+
C3 --> D
79+
D --> E
80+
```

examples/entropy/clipb.yaml

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
project: math_dapo
2+
name: clipb_example
3+
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
4+
model:
5+
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-7B-Instruct}
6+
max_prompt_tokens: 1024
7+
max_response_tokens: 7168
8+
algorithm:
9+
algorithm_type: grpo_verl
10+
advantage_fn: clipb
11+
advantage_fn_args:
12+
mu: 2.5
13+
repeat_times: 16
14+
kl_loss_fn_args:
15+
kl_coef: 0.0
16+
cluster:
17+
node_num: 1
18+
gpu_per_node: 8
19+
buffer:
20+
total_epochs: 20
21+
batch_size: 64
22+
explorer_input:
23+
taskset:
24+
name: dapo_235
25+
storage_type: file
26+
path: ${oc.env:TRINITY_TASKSET_PATH} # processed DAPO-Math-17k
27+
format:
28+
prompt_key: 'question'
29+
response_key: 'ground_truth'
30+
rollout_args:
31+
temperature: 1.0
32+
logprobs: 20
33+
eval_tasksets:
34+
- name: dapo-validation-500
35+
storage_type: file
36+
path: '/path/to/dapo-validation' # validation samples from DAPO-Math-17k
37+
split: 'test'
38+
repeat_times: 32
39+
format:
40+
prompt_key: 'question'
41+
response_key: 'ground_truth'
42+
rollout_args:
43+
temperature: 0.7
44+
- name: amc23
45+
storage_type: file
46+
path: math-ai/amc23 # Path to the AMC23 dataset
47+
split: 'test'
48+
repeat_times: 32
49+
format:
50+
prompt_key: 'question'
51+
response_key: 'answer'
52+
rollout_args:
53+
temperature: 0.7
54+
- name: aime24
55+
storage_type: file
56+
path: HuggingFaceH4/aime_2024 # Path to the AIME2024 dataset
57+
split: 'train'
58+
repeat_times: 32
59+
format:
60+
prompt_key: 'problem'
61+
response_key: 'answer'
62+
rollout_args:
63+
temperature: 0.7
64+
- name: aime25
65+
storage_type: file
66+
path: math-ai/aime25 # Path to the AIME2025 dataset
67+
split: 'test'
68+
repeat_times: 32
69+
format:
70+
prompt_key: 'problem'
71+
response_key: 'answer'
72+
rollout_args:
73+
temperature: 0.7
74+
default_workflow_type: 'async_math_workflow'
75+
default_reward_fn_type: 'math_boxed_reward'
76+
trainer_input:
77+
experience_buffer:
78+
name: math_buffer
79+
storage_type: queue
80+
max_read_timeout: 7200
81+
explorer:
82+
eval_interval: 20
83+
eval_on_startup: true
84+
runner_per_model: 8
85+
rollout_model:
86+
engine_type: vllm_async
87+
engine_num: 4
88+
tensor_parallel_size: 1
89+
seed: 42
90+
trainer:
91+
trainer_type: 'verl'
92+
save_interval: 200
93+
trainer_config:
94+
algorithm:
95+
rollout_correction:
96+
bypass_mode: false
97+
synchronizer:
98+
sync_method: 'nccl'
99+
sync_interval: 1
100+
sync_timeout: 3200
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
--- a/trinity/trainer/verl_trainer.py
2+
+++ b/trinity/trainer/verl_trainer.py
3+
@@ -501,7 +501,8 @@ class VerlPPOTrainerWrapper(RayPPOTrainer, TrainEngineWrapper):
4+
}
5+
metrics.update(old_log_prob_metrics)
6+
- old_log_prob.batch.pop("entropys")
7+
+ # Keep entropys in batch so advantage_fn (e.g. Clip_B) can use it
8+
+ # old_log_prob.batch.pop("entropys")
9+
batch = batch.union(old_log_prob)
10+
if "rollout_log_probs" in batch.batch.keys():
11+
# TODO: we may want to add diff of probs too.

examples/entropy/clipv.yaml

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
project: math_dapo
2+
name: clipv_example
3+
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
4+
model:
5+
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-7B-Instruct}
6+
max_prompt_tokens: 1024
7+
max_response_tokens: 7168
8+
algorithm:
9+
algorithm_type: grpo_verl
10+
advantage_fn: clipv
11+
advantage_fn_args:
12+
mu: 8.5
13+
repeat_times: 8
14+
kl_loss_fn_args:
15+
kl_coef: 0.0
16+
cluster:
17+
node_num: 1
18+
gpu_per_node: 8
19+
buffer:
20+
total_epochs: 20
21+
batch_size: 64
22+
explorer_input:
23+
taskset:
24+
name: dapo_235
25+
storage_type: file
26+
path: ${oc.env:TRINITY_TASKSET_PATH} # processed DAPO-Math-17k
27+
format:
28+
prompt_key: 'question'
29+
response_key: 'ground_truth'
30+
rollout_args:
31+
temperature: 1.0
32+
logprobs: 20
33+
eval_tasksets:
34+
- name: dapo-validation-500
35+
storage_type: file
36+
path: '/path/to/dapo-validation' # validation samples from DAPO-Math-17k
37+
split: 'test'
38+
repeat_times: 32
39+
format:
40+
prompt_key: 'question'
41+
response_key: 'ground_truth'
42+
rollout_args:
43+
temperature: 0.7
44+
- name: amc23
45+
storage_type: file
46+
path: math-ai/amc23 # Path to the AMC23 dataset
47+
split: 'test'
48+
repeat_times: 32
49+
format:
50+
prompt_key: 'question'
51+
response_key: 'answer'
52+
rollout_args:
53+
temperature: 0.7
54+
- name: aime24
55+
storage_type: file
56+
path: HuggingFaceH4/aime_2024 # Path to the AIME2024 dataset
57+
split: 'train'
58+
repeat_times: 32
59+
format:
60+
prompt_key: 'problem'
61+
response_key: 'answer'
62+
rollout_args:
63+
temperature: 0.7
64+
- name: aime25
65+
storage_type: file
66+
path: math-ai/aime25 # Path to the AIME2025 dataset
67+
split: 'test'
68+
repeat_times: 32
69+
format:
70+
prompt_key: 'problem'
71+
response_key: 'answer'
72+
rollout_args:
73+
temperature: 0.7
74+
default_workflow_type: 'async_math_workflow'
75+
default_reward_fn_type: 'math_boxed_reward'
76+
trainer_input:
77+
experience_buffer:
78+
name: math_buffer
79+
storage_type: queue
80+
max_read_timeout: 7200
81+
explorer:
82+
eval_interval: 20
83+
eval_on_startup: true
84+
runner_per_model: 8
85+
rollout_model:
86+
engine_type: vllm_async
87+
engine_num: 4
88+
tensor_parallel_size: 1
89+
seed: 42
90+
trainer:
91+
trainer_type: 'verl'
92+
save_interval: 100
93+
trainer_config:
94+
algorithm:
95+
rollout_correction:
96+
bypass_mode: false
97+
synchronizer:
98+
sync_method: 'nccl'
99+
sync_interval: 1
100+
sync_timeout: 3600

0 commit comments

Comments
 (0)