You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a reinforcement learning (RL) pipeline, we typically follow this sequence:
Supervised Fine-Tuning (SFT) with Chain-of-Thought (COT) data
Reinforcement Learning (RL)
SFT on non-reasoning data
RL again, and so on.
However, after RL, the model starts generating "think" tokens or internal reasoning steps. This raises several questions:
How do we perform SFT after RL when the model now outputs "think" tokens?
What loss function should be used to align the model correctly?
Does fine-tuning on non-reasoning data cause the model to forget COT patterns learned earlier?
If we do not have COT data for initial SFT but only non-COT data, how does this impact the model when followed by RL?
Looking for insights, research papers, or references that address these concerns. 🚀
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
In a reinforcement learning (RL) pipeline, we typically follow this sequence:
Supervised Fine-Tuning (SFT) with Chain-of-Thought (COT) data
Reinforcement Learning (RL)
SFT on non-reasoning data
RL again, and so on.
However, after RL, the model starts generating "think" tokens or internal reasoning steps. This raises several questions:
Looking for insights, research papers, or references that address these concerns. 🚀
Beta Was this translation helpful? Give feedback.
All reactions