You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/grpo_gsm8k_trainable_ruler/README.md
+15-6Lines changed: 15 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,22 +4,31 @@
4
4
Ref: ART's RULER; Kimi-k2.
5
5
6
6
7
-
Simulate a scenario where only a fraction of tasks have ground-truth answers for rule-based reward.
8
-
7
+
Simulate a scenario where only a fraction (`PROBABILITY_GROUND_TRUTH_AVAILABLE = 0.2`) of tasks have ground-truth answers.
8
+
Two RL objectives are optimized jointly: one for solution generation, the other for RULER-reward generation.
9
9
10
10
11
11
## Configurations and Metrics
12
12
13
-
The config files are located in [`gsm8k_ruler.yaml`](gsm8k_ruler.yaml) and [`train_gsm8k_ruler.yaml`](train_gsm8k_ruler.yaml).
13
+
The config files are located in [`gsm8k_ruler.yaml`](gsm8k_ruler.yaml) and [`train_gsm8k_trainable_ruler.yaml`](train_gsm8k_trainable_ruler.yaml).
14
14
15
15
Some key configs in this example are:
16
16
17
-
(TODO)
17
+
*`default_workflow_type`: set to `math_trainable_ruler_workflow`
18
+
*`std_threshold` for GRPO advantage: set to small value, filter out group of experiences with same rewards (e.g., when RULER fails to return valid scores, they are set to all zero)
19
+
*`sync_style`: use `dynamic_by_explorer`, due to filtering of experiences
20
+
*`train_batch_size`: set to 960; note that one explore step can generate more than 96 * 8 = 768 experiences
21
+
*`lr`: set to small value (2e-6) for stability, as rewards can be noisy
22
+
18
23
19
24
20
25
Some important metrics to pay attention to are:
21
26
22
-
(TODO)
27
+
*`reward`: reward calculated by rule or by RULER
28
+
*`gold_reward`: sum of `accuracy_reward` and `format_reward`, rule-based calculation with ground truth
29
+
*`judge_success`: whether RULER successfully returns a valid score (a coarse estimation, mix up two types of experiences)
30
+
*`reward_for_judger`: reward for the LLM working as a RULER reward model, calculated by mean absolute error (MAE) distance from gold scores
31
+
*`eval_accuracy`: accuracy on the evaluation set (ultimate metric for success of RL)
23
32
24
33
25
34
## Results
@@ -32,4 +41,4 @@ Compare with baseline: previous RULER workflow with Qwen2.5-1.5B-Instruct as LLM
32
41
33
42
## Potential improvements
34
43
35
-
balance number of samples / loss weights for generation vs RULER
44
+
balance number of samples / loss weights for generation vs for RULER
0 commit comments