Skip to content

BaohaoLiao/SAGE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

Self-Hinting Language Models Enhance Reinforcement Learning

When signal is lost for hard prompts during RL (all trajectories are wrong), the LLM self-generates a hint to help sampling, improving both prompt usage and LLM performance.

arXiv GitHub huggingface huggingface huggingface huggingface

🔥 News

  • [02/08/2026] SAGE reproduction code is released! A huggingface collection of datasets and models is also released.
  • [02/03/2026] SAGE paper is released on arXiv!

🌟 Overview

logo

When an LLM can’t sample any correct trajectory for a hard prompt, the LLM self-generates a hint from the reference solution for the prompt. The hint is then used together with the difficult prompt as input to the LLM, avoiding advantage collapse and ensuring the sampling of correct trajectories to update the policy model.

logo
Without a hint, some hard prompts are never used for GRPO, while SAGE increases the prompt usage rate by 10% for the weaker LLM.
logo

The usage of hard prompts during RL encourages LLM's exploration, leading to consistently better performance.

logo

Among all methods, SAGE retains the on-policy property of GRPO, having a similar entropy scale. And the learning from hard prompts promotes exploration, with the response length growing steadily for various LLMs.

📦 Installation

Our code is based on verl. If you already have a verl environment, you can use it and install the extra packages when prompted.

  1. Create a new environment
    python -m venv ~/.python/sage
    source ~/.python/sage/bin/activate
    
    # Or use conda
    # conda create -n sage python==3.10
    # conda activate sage
  2. Install dependencies
    pip install --upgrade pip
    pip install uv
    
    python -m uv pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128
    python -m uv pip install -U pip setuptools wheel packaging psutil
    python -m uv pip install flash-attn==2.8.0.post2 --no-build-isolation
    
    git clone https://github.com/BaohaoLiao/SAGE.git
    cd ./SAGE
    python -m uv pip install -r requirements.txt
    python -m uv pip install -e .
    python -m uv pip install vllm==0.10.1

⚡ Training

  1. Prepare training set

    bash scripts/prepare_data.sh
  2. Train with SAGE / SAGE-light. The key code is located in recipe/hint.

    bash scripts/run_sage.sh
  3. Baselines (optional)

🤗 Trained Models

Model name Link
SAGE_Llama_3.2-3B-Instruct https://huggingface.co/baohao/SAGE_Llama-3.2-3B-Instruct
SAGE-light_Llama-3.2-3B-Instruct https://huggingface.co/baohao/SAGE-light_Llama-3.2-3B-Instruct
SAGE_Qwen2.5-7B-Instruct https://huggingface.co/baohao/SAGE_Qwen2.5-7B-Instruct
SAGE-light_Qwen2.5-7B-Instruct https://huggingface.co/baohao/SAGE-light_Qwen2.5-7B-Instruct
SAGE_Qwen3-4B-Instruct-2507 https://huggingface.co/baohao/SAGE_Qwen3-4B-Instruct-2507
SAGE-light_Qwen3-4B-Instruct-2507 https://huggingface.co/baohao/SAGE-light_Qwen3-4B-Instruct-2507

🎓 Evaluation

bash scripts/eval.sh

📝 Citation

If you find SAGE useful, please cite as:

@misc{liao2026selfhintinglanguagemodelsenhance,
      title={Self-Hinting Language Models Enhance Reinforcement Learning}, 
      author={Baohao Liao and Hanze Dong and Xinxing Xu and Christof Monz and Jiang Bian},
      year={2026},
      eprint={2602.03143},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.03143}, 
}

🙏 Acknowledgments

Our code is based on verl for training, vllm for sampling, and oat for response grader. We really appreciate their contributions to the RL community.

About

Self-hinting RL increases the usage rate of hard prompts, and improves LLM's performance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors