You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In SDPO, topk tokens from the student model are used to compute KL. However, in Openclaw-RL, topk tokens from the teacher model are used to compute KL. I wonder which one shall we follow and why?