-
Notifications
You must be signed in to change notification settings - Fork 50
Intuitive understanding of the algorithm? #27
Copy link
Copy link
Open
Description
Hey authors! I find your KTO paper quite interesting and would like to explore its application in my work. I am here to see if I can have a better intuitive understanding of the algorithm especially how it compares with RL-based methods such as PPO or DPO. I could be wrong or missed some key points in the paper, and would appreciate if you can point out!
Here are some of my questions:
- Why that specific form of
r_\theta? I didn't find sentences talking about the relationship between human utility and the preference probability for a pair of sentences (which is the Bradley-Terry style). For me the formula ofr_\thetajust came out of air in definition 3.4 and (I think) a natural question is whether there is a better formulation ofr_\thetathat gives better result. Although it is explained how this definition is compared to classic prospect theory, I find it hard to understand why we should define it in nats like this. - Why a biased KL divergence works? It is hard to see the estimate is "good". The experiments shows empirically it works, but what it means? Does that mean the estimate is not really noisy, or it is the existence instead of the value of the baseline is important?
- How does KTO intuitively work? Although the 6th page has a paragraph talking about "Intuitively, KTO works as follows" but does it really make sense as we have a noisy estimate of KL and it does not have gradient flow? It's not punishing a large KL at all and a positive KL will make the model to favor a even larger
r_theta. This should only make "the model increases
the reward of a desirable example in a blunt manner" even worse.
Thanks for reading and look forward to hearing back!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels