STEP-LEVEL UNCERTAINTY-AWARE REASONING DATA SELECTION IN REINFORCEMENT LEARNING FOR LLM MULTI-STEP REASONING
Our training pipeline is adapted from verl and rllm(DeepScaleR). The installation commands that we verified as viable are as follows:
conda create -y -n rlvr_train python=3.10
conda activate rlvr_train
pip install -e .
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install ray vllm==0.6.3
pip install flash-attn --no-build-isolation
pip install wandb matplotlib
pip install huggingface_hubIf you are using H100 nodes and see errors like CUDA error: device kernel image is invalid, please refer to this issue for fixing the problem.
Our evaluation pipeline for math reasoning tasks is adapted from Qwen2.5-Math. The installation commands that we verified as viable are as follows:
conda create -y -n rlvr_eval python=3.10
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt
pip install vllm==0.5.1 --no-build-isolation
pip install transformers==4.42.3
pip install wandb matplotlib
pip install -U transformers
pip install vllm==0.6.3# our proposed method
python quality_evaluation/evaluate.py
# Random sampling
python data/sample_random.py --input data/train/rlvr_gsm8k/gsm8k_full.parquet --num 1000
# ppl perplexity
python quality_evaluation/compute_ppl.py
# token length
python quality_evaluation/compute_tokenlength.py
# IFD score
python quality_evaluation/compute_ifd.py
# Sample by score
python data/sample_by_score.py --input data/train/rlvr/math_full.parquet --score_file data/train/rlvr/math_full_with_scores_merged.json --field uncertainty_score --num 5 --mode top
conda activate rlvr_train
bash scripts/train/training_1.5b_math_full.shPlease change data.train_files and trainer.experiment_name in the training script when trying other training subsets.
To run evaluation for our method on 6 common math reasoning benchmarks (MATH500, AIME24, AMC23, Minerva Math, OlympiadBench, AIME25), we can follow the commands:
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
bash sh/eval_one_experiment_all_ckpts.shHere for AIME24, AMC23, and AIME25, we evaluate the pass@8 results.
Please adjust the experiment name in Qwen2.5-Eval/evaluation/sh/eval_one_experiment_all_ckpts.sh when using other training examples.
- Our training experiments are powered by a modified fork of rllm(DeepScaleR) and verl.
- Our evaluation experiments are based on a modified fork of Qwen2.5-Math.
- Our model is trained on top of
Qwen2.5-Math-1.5B,Qwen2.5-Math-7B,Llama-3.2-3B-InstructandDeepSeek-R1-Distill-Qwen-1.5B.