How to Perform SFT After RL When the Model Outputs "Think" Tokens? #519

MSS444 · 2025-03-18T06:24:15Z

MSS444
Mar 18, 2025

In a reinforcement learning (RL) pipeline, we typically follow this sequence:

Supervised Fine-Tuning (SFT) with Chain-of-Thought (COT) data
Reinforcement Learning (RL)
SFT on non-reasoning data
RL again, and so on.
However, after RL, the model starts generating "think" tokens or internal reasoning steps. This raises several questions:

How do we perform SFT after RL when the model now outputs "think" tokens?
What loss function should be used to align the model correctly?
Does fine-tuning on non-reasoning data cause the model to forget COT patterns learned earlier?
If we do not have COT data for initial SFT but only non-COT data, how does this impact the model when followed by RL?

Looking for insights, research papers, or references that address these concerns. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to Perform SFT After RL When the Model Outputs "Think" Tokens? #519

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to Perform SFT After RL When the Model Outputs "Think" Tokens? #519

Uh oh!

MSS444 Mar 18, 2025

Replies: 0 comments

MSS444
Mar 18, 2025