Skip to content

Commit 12adce8

Browse files
authored
Update main readme with arxiv v2 (#121)
1 parent 675ff5b commit 12adce8

File tree

10 files changed

+300
-179
lines changed

10 files changed

+300
-179
lines changed

README.md

Lines changed: 138 additions & 83 deletions
Large diffs are not rendered by default.
1.01 MB
Loading
59 KB
Binary file not shown.
205 KB
Loading
761 KB
Loading
134 KB
Loading
379 KB
Loading
668 KB
Loading

docs/sphinx_doc/source/main.md

Lines changed: 157 additions & 96 deletions
Large diffs are not rendered by default.

docs/sphinx_doc/source/tutorial/example_mix_algo.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,11 @@ $$
2020
The first term corresponds to the standard GRPO objective, which aims to maximize the expected reward. The last term is an auxiliary objective defined on expert data, encouraging the policy to imitate expert behavior. $\mu$ is a weighting factor that controls the relative importance of the two terms.
2121

2222

23+
24+
A visualization of this pipeline is as follows:
25+
26+
![](../../assets/trinity-mix.png)
27+
2328
## Step 0: Prepare the Expert Data
2429

2530
We prompt a powerful LLM to generate responses with the CoT process for some pre-defined questions. The collected dta are viewed as some experiences from an expert. We store them in a `jsonl` file `expert_data.jsonl` with the following format:

0 commit comments

Comments
 (0)