Skip to content

Official repository for the paper STEP-LEVEL UNCERTAINTY-AWARE REASONING DATA SELECTION IN REINFORCEMENT LEARNING FOR LLM MULTI-STEP REASONING

Notifications You must be signed in to change notification settings

clsr1008/Star-DS

Repository files navigation

STEP-LEVEL UNCERTAINTY-AWARE REASONING DATA SELECTION IN REINFORCEMENT LEARNING FOR LLM MULTI-STEP REASONING

Setup

Train Enviroment

Our training pipeline is adapted from verl and rllm(DeepScaleR). The installation commands that we verified as viable are as follows:

conda create -y -n rlvr_train python=3.10
conda activate rlvr_train
pip install -e .
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install ray vllm==0.6.3
pip install flash-attn --no-build-isolation
pip install wandb matplotlib
pip install huggingface_hub

If you are using H100 nodes and see errors like CUDA error: device kernel image is invalid, please refer to this issue for fixing the problem.

Eval Enviroment

Our evaluation pipeline for math reasoning tasks is adapted from Qwen2.5-Math. The installation commands that we verified as viable are as follows:

conda create -y -n rlvr_eval python=3.10
conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
cd latex2sympy
pip install -e .
cd ..
pip install -r requirements.txt 
pip install vllm==0.5.1 --no-build-isolation
pip install transformers==4.42.3
pip install wandb matplotlib
pip install -U transformers
pip install vllm==0.6.3

Data Quality Evaluation

# our proposed method
python quality_evaluation/evaluate.py

# Random sampling
python data/sample_random.py --input data/train/rlvr_gsm8k/gsm8k_full.parquet --num 1000

# ppl perplexity
python quality_evaluation/compute_ppl.py

# token length
python quality_evaluation/compute_tokenlength.py

# IFD score
python quality_evaluation/compute_ifd.py

# Sample by score
python data/sample_by_score.py --input data/train/rlvr/math_full.parquet --score_file data/train/rlvr/math_full_with_scores_merged.json --field uncertainty_score --num 5 --mode top

Training Models

conda activate rlvr_train
bash scripts/train/training_1.5b_math_full.sh

Please change data.train_files and trainer.experiment_name in the training script when trying other training subsets.

Evaluation

Eval Scripts for Qwen Models

To run evaluation for our method on 6 common math reasoning benchmarks (MATH500, AIME24, AMC23, Minerva Math, OlympiadBench, AIME25), we can follow the commands:

conda activate rlvr_eval
cd Qwen2.5-Eval/evaluation
bash sh/eval_one_experiment_all_ckpts.sh

Here for AIME24, AMC23, and AIME25, we evaluate the pass@8 results. Please adjust the experiment name in Qwen2.5-Eval/evaluation/sh/eval_one_experiment_all_ckpts.sh when using other training examples.

Acknowledgements

About

Official repository for the paper STEP-LEVEL UNCERTAINTY-AWARE REASONING DATA SELECTION IN REINFORCEMENT LEARNING FOR LLM MULTI-STEP REASONING

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published