Skip to content

Commit fa82707

Browse files
committed
[feat] update readme
1 parent 5129f0e commit fa82707

File tree

2 files changed

+4
-4
lines changed

2 files changed

+4
-4
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1211,7 +1211,7 @@ python train_ppo.py
12111211
$$\mathcal{L}_{GRPO} = -\mathbb{E}\left[r_t \cdot A_t - \beta \cdot \text{KL}_t\right]$$
12121212

12131213
其中:
1214-
- **策略项**: $f(r_t) = r_t$ (直接使用概率比,无clip裁剪)
1214+
- **策略项**: $f(r_t) = \min(r_t, \text{clip}(r_t))$ (使用概率比的clip裁剪)
12151215
- **优势项**: $g(A_t) = \frac{R - \mu_{group}}{\sigma_{group}}$ (组内归一化,消除Critic网络)
12161216
- **正则项**: $h(\text{KL}_t) = \beta \cdot \text{KL}_t$ (token级KL散度约束)
12171217

@@ -1294,7 +1294,7 @@ python train_spo.py
12941294
|------|----------------|----------------|----------------------|----------|
12951295
| **DPO** | $\log r_w - \log r_l$ | 隐式(偏好对比) | 隐含在 $\beta$ 中 | 2 |
12961296
| **PPO** | $\min(r, \text{clip}(r))$ | $R - V(s)$ | $\beta \cdot \mathbb{E}[\text{KL}]$ | 4 |
1297-
| **GRPO** | $r$ | $\frac{R - \mu}{\sigma}$ | $\beta \cdot \text{KL}_t$ | 2 |
1297+
| **GRPO** | $\min(r, \text{clip}(r))$ | $\frac{R - \mu}{\sigma}$ | $\beta \cdot \text{KL}_t$ | 2 |
12981298
| **SPO** | $\log \pi_\theta$ | $R - B_t^{adaptive}$ | $\beta \cdot \text{KL}_t$ | 2 |
12991299

13001300
**RL是优美且自洽的**

README_en.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1193,7 +1193,7 @@ In early 2025, DeepSeek-R1 became extremely popular, and equally popular was the
11931193
$$\mathcal{L}_{GRPO} = -\mathbb{E}\left[r_t \cdot A_t - \beta \cdot \text{KL}_t\right]$$
11941194

11951195
Where:
1196-
- **Policy term**: $f(r_t) = r_t$ (directly use probability ratio, no clip clipping)
1196+
- **Policy term**: $f(r_t) = \min(r_t, \text{clip}(r_t))$ (use probability ratio with clip clipping)
11971197
- **Advantage term**: $g(A_t) = \frac{R - \mu_{group}}{\sigma_{group}}$ (within-group normalization, eliminate Critic network)
11981198
- **Regularization term**: $h(\text{KL}_t) = \beta \cdot \text{KL}_t$ (token-level KL divergence constraint)
11991199

@@ -1271,7 +1271,7 @@ We return to the "**unified framework**", reorganizing the table showing all dif
12711271
|-----------|----------------|----------------|----------------------|----------|
12721272
| **DPO** | $\log r_w - \log r_l$ | Implicit (preference contrast) | Implicit in $\beta$ | 2 |
12731273
| **PPO** | $\min(r, \text{clip}(r))$ | $R - V(s)$ | $\beta \cdot \mathbb{E}[\text{KL}]$ | 4 |
1274-
| **GRPO** | $r$ | $\frac{R - \mu}{\sigma}$ | $\beta \cdot \text{KL}_t$ | 2 |
1274+
| **GRPO** | $\min(r, \text{clip}(r))$ | $\frac{R - \mu}{\sigma}$ | $\beta \cdot \text{KL}_t$ | 2 |
12751275
| **SPO** | $\log \pi_\theta$ | $R - B_t^{adaptive}$ | $\beta \cdot \text{KL}_t$ | 2 |
12761276

12771277
**RL is Elegant and Self-Consistent**

0 commit comments

Comments
 (0)