When signal is lost for hard prompts during RL (all trajectories are wrong), the LLM self-generates a hint to help sampling, improving both prompt usage and LLM performance.
- [02/08/2026] SAGE reproduction code is released! A huggingface collection of datasets and models is also released.
- [02/03/2026] SAGE paper is released on arXiv!
When an LLM can’t sample any correct trajectory for a hard prompt, the LLM self-generates a hint from the reference solution for the prompt. The hint is then used together with the difficult prompt as input to the LLM, avoiding advantage collapse and ensuring the sampling of correct trajectories to update the policy model.
Without a hint, some hard prompts are never used for GRPO, while SAGE increases the prompt usage rate by 10% for the weaker LLM.The usage of hard prompts during RL encourages LLM's exploration, leading to consistently better performance.
Among all methods, SAGE retains the on-policy property of GRPO, having a similar entropy scale. And the learning from hard prompts promotes exploration, with the response length growing steadily for various LLMs.
Our code is based on verl. If you already have a verl environment, you can use it and install the extra packages when prompted.
- Create a new environment
python -m venv ~/.python/sage source ~/.python/sage/bin/activate # Or use conda # conda create -n sage python==3.10 # conda activate sage
- Install dependencies
pip install --upgrade pip pip install uv python -m uv pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu128 python -m uv pip install -U pip setuptools wheel packaging psutil python -m uv pip install flash-attn==2.8.0.post2 --no-build-isolation git clone https://github.com/BaohaoLiao/SAGE.git cd ./SAGE python -m uv pip install -r requirements.txt python -m uv pip install -e . python -m uv pip install vllm==0.10.1
-
Prepare training set
bash scripts/prepare_data.sh
-
Train with SAGE / SAGE-light. The key code is located in
recipe/hint.bash scripts/run_sage.sh
-
Baselines (optional)
- GRPO:
bash scripts/run_grpo.sh
- LUFFY: We use LUFFY's open-sourced code. The training set is already preprocessed to LUFFY's style.
- SFT: We use LUFFY's open-sourced code for SFT. The training set is already preprocessed to LUFFY's style.
- Scaf-GRPO: We use Scaf-GRPO's open-sourced code. The training set is already preprocessed to Scaf-GRPO's style.
- GRPO:
| Model name | Link |
|---|---|
| SAGE_Llama_3.2-3B-Instruct | https://huggingface.co/baohao/SAGE_Llama-3.2-3B-Instruct |
| SAGE-light_Llama-3.2-3B-Instruct | https://huggingface.co/baohao/SAGE-light_Llama-3.2-3B-Instruct |
| SAGE_Qwen2.5-7B-Instruct | https://huggingface.co/baohao/SAGE_Qwen2.5-7B-Instruct |
| SAGE-light_Qwen2.5-7B-Instruct | https://huggingface.co/baohao/SAGE-light_Qwen2.5-7B-Instruct |
| SAGE_Qwen3-4B-Instruct-2507 | https://huggingface.co/baohao/SAGE_Qwen3-4B-Instruct-2507 |
| SAGE-light_Qwen3-4B-Instruct-2507 | https://huggingface.co/baohao/SAGE-light_Qwen3-4B-Instruct-2507 |
bash scripts/eval.shIf you find SAGE useful, please cite as:
@misc{liao2026selfhintinglanguagemodelsenhance,
title={Self-Hinting Language Models Enhance Reinforcement Learning},
author={Baohao Liao and Hanze Dong and Xinxing Xu and Christof Monz and Jiang Bian},
year={2026},
eprint={2602.03143},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.03143},
}Our code is based on verl for training, vllm for sampling, and oat for response grader. We really appreciate their contributions to the RL community.



