STEP-LEVEL UNCERTAINTY-AWARE REASONING DATA SELECTION IN REINFORCEMENT LEARNING FOR LLM MULTI-STEP REASONING

Setup

Train Enviroment

Our training pipeline is adapted from verl and rllm(DeepScaleR). The installation commands that we verified as viable are as follows:

conda create -y -n rlvr_train python=3.10
conda activate rlvr_train
pip install -e .
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install ray vllm==0.6.3
pip install flash-attn --no-build-isolation
pip install wandb matplotlib
pip install huggingface_hub

If you are using H100 nodes and see errors like CUDA error: device kernel image is invalid, please refer to this issue for fixing the problem.

Eval Enviroment

Our evaluation pipeline for math reasoning tasks is adapted from Qwen2.5-Math. The installation commands that we verified as viable are as follows:

conda create -y -n rlvr_eval python=3.10
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt 
pip install vllm==0.5.1 --no-build-isolation
pip install transformers==4.42.3
pip install wandb matplotlib
pip install -U transformers
pip install vllm==0.6.3

Data Quality Evaluation

# our proposed method
python quality_evaluation/evaluate.py

# Random sampling
python data/sample_random.py --input data/train/rlvr_gsm8k/gsm8k_full.parquet --num 1000

# ppl perplexity
python quality_evaluation/compute_ppl.py

# token length
python quality_evaluation/compute_tokenlength.py

# IFD score
python quality_evaluation/compute_ifd.py

# Sample by score
python data/sample_by_score.py --input data/train/rlvr/math_full.parquet --score_file data/train/rlvr/math_full_with_scores_merged.json --field uncertainty_score --num 5 --mode top

Training Models

conda activate rlvr_train
bash scripts/train/training_1.5b_math_full.sh

Please change data.train_files and trainer.experiment_name in the training script when trying other training subsets.

Evaluation

Eval Scripts for Qwen Models

To run evaluation for our method on 6 common math reasoning benchmarks (MATH500, AIME24, AMC23, Minerva Math, OlympiadBench, AIME25), we can follow the commands:

conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
bash sh/eval_one_experiment_all_ckpts.sh

Here for AIME24, AMC23, and AIME25, we evaluate the pass@8 results. Please adjust the experiment name in Qwen2.5-Eval/evaluation/sh/eval_one_experiment_all_ckpts.sh when using other training examples.

Acknowledgements

Our training experiments are powered by a modified fork of rllm(DeepScaleR) and verl.
Our evaluation experiments are based on a modified fork of Qwen2.5-Math.
Our model is trained on top of Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Llama-3.2-3B-Instruct and DeepSeek-R1-Distill-Qwen-1.5B.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Qwen2.5-Eval		Qwen2.5-Eval
data		data
quality_evaluation		quality_evaluation
scripts		scripts
verl		verl
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_train.txt		requirements_train.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STEP-LEVEL UNCERTAINTY-AWARE REASONING DATA SELECTION IN REINFORCEMENT LEARNING FOR LLM MULTI-STEP REASONING

Setup

Train Enviroment

Eval Enviroment

Data Quality Evaluation

Training Models

Evaluation

Eval Scripts for Qwen Models

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

clsr1008/Star-DS

Folders and files

Latest commit

History

Repository files navigation

STEP-LEVEL UNCERTAINTY-AWARE REASONING DATA SELECTION IN REINFORCEMENT LEARNING FOR LLM MULTI-STEP REASONING

Setup

Train Enviroment

Eval Enviroment

Data Quality Evaluation

Training Models

Evaluation

Eval Scripts for Qwen Models

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages