GitHub - BaohaoLiao/SAGE: Self-hinting RL increases the usage rate of hard prompts, and improves LLM's performance.

Self-Hinting Language Models Enhance Reinforcement Learning

When signal is lost for hard prompts during RL (all trajectories are wrong), the LLM self-generates a hint to help sampling, improving both prompt usage and LLM performance.

🔥 News

[02/08/2026] SAGE reproduction code is released! A huggingface collection of datasets and models is also released.
[02/03/2026] SAGE paper is released on arXiv!

🌟 Overview

When an LLM can’t sample any correct trajectory for a hard prompt, the LLM self-generates a hint from the reference solution for the prompt. The hint is then used together with the difficult prompt as input to the LLM, avoiding advantage collapse and ensuring the sampling of correct trajectories to update the policy model.

Without a hint, some hard prompts are never used for GRPO, while SAGE increases the prompt usage rate by 10% for the weaker LLM.

The usage of hard prompts during RL encourages LLM's exploration, leading to consistently better performance.

Among all methods, SAGE retains the on-policy property of GRPO, having a similar entropy scale. And the learning from hard prompts promotes exploration, with the response length growing steadily for various LLMs.

📦 Installation

Our code is based on verl. If you already have a verl environment, you can use it and install the extra packages when prompted.

Create a new environment

python -m venv ~/.python/sage
source ~/.python/sage/bin/activate

# Or use conda
# conda create -n sage python==3.10
# conda activate sage

Install dependencies

pip install --upgrade pip
pip install uv

python -m uv pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128
python -m uv pip install -U pip setuptools wheel packaging psutil
python -m uv pip install flash-attn==2.8.0.post2 --no-build-isolation

git clone https://github.com/BaohaoLiao/SAGE.git
cd ./SAGE
python -m uv pip install -r requirements.txt
python -m uv pip install -e .
python -m uv pip install vllm==0.10.1

⚡ Training

Prepare training set
```
bash scripts/prepare_data.sh
```
Train with SAGE / SAGE-light. The key code is located in recipe/hint.
```
bash scripts/run_sage.sh
```
Baselines (optional)
- GRPO:
```
bash scripts/run_grpo.sh
```
- LUFFY: We use LUFFY's open-sourced code. The training set is already preprocessed to LUFFY's style.
- SFT: We use LUFFY's open-sourced code for SFT. The training set is already preprocessed to LUFFY's style.
- Scaf-GRPO: We use Scaf-GRPO's open-sourced code. The training set is already preprocessed to Scaf-GRPO's style.

🤗 Trained Models

Model name	Link
SAGE_Llama_3.2-3B-Instruct	https://huggingface.co/baohao/SAGE_Llama-3.2-3B-Instruct
SAGE-light_Llama-3.2-3B-Instruct	https://huggingface.co/baohao/SAGE-light_Llama-3.2-3B-Instruct
SAGE_Qwen2.5-7B-Instruct	https://huggingface.co/baohao/SAGE_Qwen2.5-7B-Instruct
SAGE-light_Qwen2.5-7B-Instruct	https://huggingface.co/baohao/SAGE-light_Qwen2.5-7B-Instruct
SAGE_Qwen3-4B-Instruct-2507	https://huggingface.co/baohao/SAGE_Qwen3-4B-Instruct-2507
SAGE-light_Qwen3-4B-Instruct-2507	https://huggingface.co/baohao/SAGE-light_Qwen3-4B-Instruct-2507

🎓 Evaluation

bash scripts/eval.sh

📝 Citation

If you find SAGE useful, please cite as:

@misc{liao2026selfhintinglanguagemodelsenhance,
      title={Self-Hinting Language Models Enhance Reinforcement Learning}, 
      author={Baohao Liao and Hanze Dong and Xinxing Xu and Christof Monz and Jiang Bian},
      year={2026},
      eprint={2602.03143},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.03143}, 
}

🙏 Acknowledgments

Our code is based on verl for training, vllm for sampling, and oat for response grader. We really appreciate their contributions to the RL community.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
assets		assets
eval		eval
paper		paper
recipe/hint		recipe/hint
scripts		scripts
verl		verl
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Hinting Language Models Enhance Reinforcement Learning

🔥 News

🌟 Overview

📦 Installation

⚡ Training

🤗 Trained Models

🎓 Evaluation

📝 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Hinting Language Models Enhance Reinforcement Learning

🔥 News

🌟 Overview

📦 Installation

⚡ Training

🤗 Trained Models

🎓 Evaluation

📝 Citation

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages