Skip to content

Commit a39635c

Browse files
author
Sindhujach217
committed
update readme.md
1 parent a8d0d1a commit a39635c

File tree

1 file changed

+15
-18
lines changed

1 file changed

+15
-18
lines changed

src/aixpert/training/README.md

Lines changed: 15 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -19,76 +19,73 @@ This section summarizes the **training objectives** used in this repository.
1919
Given a preference tuple \((x, y_w, y_l)\) and a reference policy \(\pi_{\text{ref}}\), the **Direct Preference Optimization (DPO)** margin is defined as:
2020

2121
```math
22-
\[
2322
m(x, y_w, y_l) =
2423
\log \frac{\pi_\theta(y_w \mid x)}{\pi_\theta(y_l \mid x)}
2524
-
2625
\log \frac{\pi_{\text{ref}}(y_w \mid x)}{\pi_{\text{ref}}(y_l \mid x)}
27-
\]
2826
```
2927

3028
The **Original DPO loss** is:
3129

3230
```math
33-
\[
3431
\mathcal{L}_{\text{DPO}}(\theta)
3532
=
3633
-\mathbb{E}_{(x,y_w,y_l)}
3734
\left[
3835
\log \sigma\left(\beta \cdot m(x,y_w,y_l)\right)
3936
\right]
40-
\]
4137
```
4238
where:
43-
- \(\pi_\theta\) is the trainable policy
44-
- \(\pi_{\text{ref}}\) is the frozen reference policy
45-
- \(\beta\) is a temperature parameter
46-
- \(\sigma(\cdot)\) is the sigmoid function
39+
- ```math\(\pi_\theta\)``` is the trainable policy
40+
- ```math\(\pi_{\text{ref}}\)``` is the frozen reference policy
41+
- ```math\(\beta\)``` is a temperature parameter
42+
- ```math\(\sigma(\cdot)\)``` is the sigmoid function
4743

4844
---
4945

5046
### Factual-DPO
5147

5248
Each preference tuple additionally includes factuality indicators
53-
\((h_w, h_l) \in \{0,1\}\), where \(1\) denotes a factual violation.
49+
```math\((h_w, h_l) \in \{0,1\}\)```, where \(1\) denotes a factual violation.
5450

5551
After label transformation, define:
5652

5753
```math
5854
\Delta h = h_l - h_w \in \{0, 1\}
55+
```
5956

6057

6158
The **factuality-aware margin** is:
6259

63-
\[
60+
```math
6461
m_{\text{fact}} =
6562
m - \lambda \cdot \Delta h
66-
\]
63+
```
6764

6865
The **Factual-DPO loss** is:
6966

70-
\[
67+
```math
7168
\mathcal{L}_{\text{FactualDPO}}(\theta)
7269
=
7370
-\mathbb{E}_{(x,y_w,y_l,h_w,h_l)}
7471
\left[
7572
\log \sigma\left(\beta \cdot (m - \lambda \cdot \Delta h)\right)
7673
\right]
77-
\]
74+
```
7875

7976
where:
80-
- \(\lambda\) controls the strength of the factuality penalty
81-
- Larger \(\lambda\) enforces stronger hallucination suppression
82-
- When \(\Delta h = 0\), the loss reduces to **Original DPO**
77+
- ```math\(\lambda\)``` controls the strength of the factuality penalty
78+
- Larger ```math\(\lambda\)``` enforces stronger hallucination suppression
79+
- When ```math\(\Delta h = 0\)```, the loss reduces to **Original DPO**
8380

8481
---
8582

8683
### Key Difference
8784

8885
| Method | Optimization Target |
8986
|------|---------------------|
90-
| Original DPO | \( \log \sigma(\beta \cdot m) \) |
91-
| Factual-DPO | \( \log \sigma(\beta \cdot (m - \lambda \Delta h)) \) |
87+
| Original DPO |```math \( \log \sigma(\beta \cdot m) \)``` |
88+
| Factual-DPO | ```math\( \log \sigma(\beta \cdot (m - \lambda \Delta h)) \) ```|
9289

9390

9491

0 commit comments

Comments
 (0)