On the Generalization of SFT:
A Reinforcement Learning Perspective with
Reward Rectification

Yongliang Wu* Yizhou Zhou* Zhou Ziheng Yingzhe Peng
Xinyu Ye Xinting Hu Wenbo Zhu Lu Qi Ming-Hsuan Yang Xu Yang

🌟 Thanks for the Feedback of Community

We are grateful for the many thoughtful comments and feedback from the community regarding DFT, ranging from discussions of related ideas to reports of its application in different scenarios. We have heard of both successes and failures when applying DFT, for instance in literary or financial tasks.

Here, we would like to clarify that we do not claim DFT can replace SFT in all cases, as noted in our limitations section:

“While our experiments demonstrate substantial gains from DFT on mathematical reasoning benchmarks, this evaluation is confined to math-focused and code-focused (will be released in next version) datasets and models up to 7 billion parameters.”

Nonetheless, these less successful cases, as well as community discussions on platforms such as Zhihu or Xiao Hong Shu about the intuitive principles behind DFT, together with our own experimental experience, have prompted us to think more deeply about the conditions under which DFT works well, and why it may be less effective in other contexts.

All this feedback reminds us of a remark by computing pioneer Richard Hamming in The Art of Doing Science and Engineering: Learning to Learn (p.27), which we have slightly adapted:

“Almost everyone who opens up a new field does not really understand it the way the followers—or the critics—do.”

We hope this work can contribute to renewed interest in exploring the interplay between SFT and RL, and in better understanding the factors that underlie both the successes and the limitations of methods like DFT. Looking ahead, we also welcome researchers who are interested in our work to improve DFT in some of the currently unsuccessful cases, or in leveraging the ideas to uncover other connections between RL algorithms and SFT, ultimately aiming to achieve RL-like benefits at the cost of SFT across a broader range of settings.

📰 News

[2025.08.08] We have released the training scripts, evaluation scripts, and model checkpoints.

Abstract

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance.

Code Implementation

DFT is a one-line change to standard SFT: scale each token’s loss by its predicted probability (detached to avoid gradient flow).

loss = loss * torch.softmax(shift_logits, dim=-1).gather(1, shift_labels.unsqueeze(-1)).squeeze(-1).detach()

⚙️ Installation

Our codebase has been tested on H100 servers with the following environment:

python 3.10.0
torch 2.6.0+cu124

git clone https://github.com/yongliang-wu/DFT.git
cd DFT

🔧 Set Up Training Environment

conda create -n DFT python=3.10 -y
conda activate DFT
cd verl
bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .

🚀 Getting Started

Step 1: Prepare Data

# Generate training data (optional: change --train_end to control volume)
python examples/data_preprocess/numina_cot.py --train_end 100000

# Generate evaluation data
python examples/data_preprocess/math_dataset.py

Step 2: Launch Training

nproc_per_node=8
project_name=numina-cot

experiment_name=numina-cot-dft-qwen-2.5-math-1.5b
save_path=checkpoints/$experiment_name

torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node \
        -m verl.trainer.fsdp_dft_trainer \
    data.train_files=data/numina_cot/train.parquet \
    data.val_files=data/math500/test.parquet \
    data.prompt_key=extra_info \
    data.response_key=extra_info \
    data.train_batch_size=256 \ 
    data.max_length=2048 \
    optim.lr=5e-5 \
    data.prompt_dict_keys=['question'] \
    data.response_dict_keys=['answer'] \
    data.micro_batch_size_per_gpu=4 \
    model.partial_pretrain=Qwen/Qwen2.5-Math-1.5B \
    model.use_liger=True \
    model.fsdp_config.model_dtype=bf16 \
    trainer.default_local_dir=$save_path \
    trainer.project_name=$project_name \
    trainer.experiment_name="$experiment_name-$(date +%Y%m%d-%H%M%S)" \
    trainer.logger=['console','tensorboard'] \
    trainer.default_hdfs_dir=null \
    trainer.test_freq=10 \
    trainer.save_freq=50 \
    trainer.total_epochs=1 \
    ulysses_sequence_parallel_size=1 \
    use_remove_padding=true

Step 3: Evaluation

To evaluate the trained model, please first follow the Qwen2.5-Math repository to set up the evaluation environment.

# Select the prompt format matching your model
PROMPT_TYPE="qwen-boxed"
# PROMPT_TYPE="llama-base-boxed"
# PROMPT_TYPE="deepseek-math"

# Set available GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Configure sampling settings
N_SAMPLING=16
TEMPERATURE=1

# Specify model and output directories
MODEL_NAME_OR_PATH=""  # e.g., checkpoints/your-model-name
OUTPUT_DIR=""          # e.g., outputs/eval_results

# Run evaluation
bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH $OUTPUT_DIR $N_SAMPLING $TEMPERATURE

Limitations

Based on our evaluations and community feedback, DFT performs strongly on tasks with non-deterministic solution trajectories—i.e., those that admit multiple valid reasoning paths—such as mathematical chain-of-thought (CoT) reasoning, solutions to highly complex coding problems, and multimodal reasoning with informative CoT. By contrast, its performance is weaker on tasks with a single, well-specified ground-truth answer, particularly when the associated CoT (if exists) is highly constrained and near-deterministic (low-entropy).

Citation

If you find this paper valuable for your research or applications, we would appreciate it if you could cite our work:

@article{wu2025generalization,
  title={On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification},
  author={Wu, Yongliang and Zhou, Yizhou and Ziheng, Zhou and Peng, Yingzhe and Ye, Xinyu and Hu, Xinting and Zhu, Wenbo and Qi, Lu and Yang, Ming-Hsuan and Yang, Xu},
  journal={arXiv preprint arXiv:2508.05629},
  year={2025}
}

Related Repositories

https://github.com/huggingface/trl: TRL supports DFT now, check this script.
https://github.com/hiyouga/LLaMA-Factory: LLaMA-Factory supports DFT now, check this script.
https://github.com/modelscope/ms-swift: ms-swift supports DFT now, check this script.
https://github.com/Lauorie/DFT: Reproduced the DFT method without using Verl.
https://github.com/volcengine/verl: Codebase used for training.
https://github.com/QwenLM/Qwen2.5-Math: Codebase used for evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
math_evaluation		math_evaluation
verl		verl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

On the Generalization of SFT:
A Reinforcement Learning Perspective with
Reward Rectification

🌟 Thanks for the Feedback of Community

📰 News

Abstract

Code Implementation

⚙️ Installation

🔧 Set Up Training Environment

🚀 Getting Started

Step 1: Prepare Data

Step 2: Launch Training

Step 3: Evaluation

Limitations

Citation

Related Repositories

About

Uh oh!

Releases

Packages

Contributors 2

Languages

yongliang-wu/DFT

Folders and files

Latest commit

History

Repository files navigation

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

🌟 Thanks for the Feedback of Community

📰 News

Abstract

Code Implementation

⚙️ Installation

🔧 Set Up Training Environment

🚀 Getting Started

Step 1: Prepare Data

Step 2: Launch Training

Step 3: Evaluation

Limitations

Citation

Related Repositories

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

On the Generalization of SFT:
A Reinforcement Learning Perspective with
Reward Rectification

Packages