Skip to content

Commit 3b71fa1

Browse files
hjh0119Jintao-Huang
authored andcommitted
[doc] correct the formula for DAPO soft-overlong (#5557)
* fix doc * remove if
1 parent e62db3d commit 3b71fa1

File tree

2 files changed

+8
-8
lines changed
  • docs
    • source_en/Instruction/GRPO/AdvancedResearch
    • source/Instruction/GRPO/AdvancedResearch

2 files changed

+8
-8
lines changed

docs/source/Instruction/GRPO/AdvancedResearch/DAPO.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,13 +72,13 @@ DAPO 设计了三段式长度惩罚函数:
7272
$$
7373
R_{\text{length}}(L) =
7474
\begin{cases}
75-
0, & \text{if } L \leq L_{\text{cache}} \\[10pt]
76-
-\dfrac{L - L_{\text{cache}}}{L_{\text{max}} - L_{\text{cache}}}, & \text{if } L_{\text{cache}} < L < L_{\text{max}} \\[10pt]
77-
-1, & \text{if } L \geq L_{\text{max}}
75+
0, & L \leq L_{\text{max}} - L_{\text{cache}} \\[10pt]
76+
\dfrac{(L_{\text{max}} - L_{\text{cache}}) - L}{L_{\text{cache}}}, & L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}} \\[10pt]
77+
-1, & L > L_{\text{max}}
7878
\end{cases}
7979
$$
8080

81-
在长度位于(L_cache < L < L_max)区间时设置线性递增惩罚,在(L ≥ L_max)时设置最大惩罚(-1)
81+
在长度位于 $(L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}})$ 区间时设置线性递增惩罚,在 $(L > L_{\text{max}})$ 时设置最大惩罚(-1)
8282

8383

8484
使用参数

docs/source_en/Instruction/GRPO/AdvancedResearch/DAPO.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,13 +60,13 @@ DAPO designs a three-stage length penalty function:
6060
$$
6161
R_{\text{length}}(L) =
6262
\begin{cases}
63-
0, & \text{if } L \leq L_{\text{cache}} \\[10pt]
64-
-\dfrac{L - L_{\text{cache}}}{L_{\text{max}} - L_{\text{cache}}}, & \text{if } L_{\text{cache}} < L < L_{\text{max}} \\[10pt]
65-
-1, & \text{if } L \geq L_{\text{max}}
63+
0, & L \leq L_{\text{max}} - L_{\text{cache}} \\[10pt]
64+
\dfrac{(L_{\text{max}} - L_{\text{cache}}) - L}{L_{\text{cache}}}, & L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}} \\[10pt]
65+
-1, & L > L_{\text{max}}
6666
\end{cases}
6767
$$
6868

69-
When the length falls within the interval (L_cache < L < L_max), a linearly increasing penalty is applied. For lengths (L ≥ L_max), the maximum penalty (-1) is imposed.
69+
When the length falls within the interval $(L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}})$, a linearly increasing penalty is applied. For lengths $(L > L_{\text{max}})$, the maximum penalty (-1) is imposed.
7070

7171
Parameters:
7272
- `reward_funcs soft_overlong` enables this reward function.

0 commit comments

Comments
 (0)