Skip to content

Public code release for the paper "When Does Verification Pay Off?: A Closer Look at LLMs as Solution Verifiers"

License

Notifications You must be signed in to change notification settings

agentic-learning-ai-lab/llm-verification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Overview

This codebase provides a framework for studying how the end performance of a solver-verifier system depends on factors like model family, model size, post-training, task type, and more. The frameworks supports:

  • Both Real-World and Synthetic Tasks: Real-world tasks include GSM8K, AIME, MMLU, CommonsenseQA, and GPQA. Synthetic tasks include SAT, Sudoku, Matrix Multiplication (with generation scripts in src/generate_problems). See Supported Datasets.
  • Large Suite of Open-Source Models: Post-trained models from Llama3, Qwen2.5, Qwen3, and DeepSeek-R1 families. Base models from Llama3, Qwen2.5, Qwen3 families. See Supported Models.
  • Solver and Verifier Evaluation Metrics: Solver accuracy. Verifier accuracy, TPR, FPR, FNR, Precision, Recall, F1, Gain, etc. See src/inference.py.
  • Rejection sampling: Solver can iteratively re-solve problems that fail test-time verification. See src/rejection_sampling.py.
  • Embedding analysis: Compute pairwise similarity between model outputs across different LLMs. See src/compute_embedding.py.
  • Automatic caching: Efficiently reuse solver outputs when experimenting with different verifier configurations. See src/solver_cache.py.

Installation

# Create Python 3.11 environment
uv venv --python 3.11
source .venv/bin/activate

# Install PyTorch with CUDA 11.8 support (adjust for your own CUDA version)
uv pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118

# Install dependencies
uv pip install -r requirements.txt

Scripts and Example Commands

1. Main Experiment Pipeline (src/inference.py)

The primary script for running solver-verifier evaluations. Generates solutions with a solver model then verify with a verifier model, computing solver and verifier metrics along the way. All sampling automatically supports Multi-GPU (thanks VLLM!).

python src/inference.py \
    --solver_model_name Qwen/Qwen3-0.6B \
    --verifier_model_name Qwen/Qwen3-0.6B \
    --dataset_name gsm \
    --output_dir result

Notable arguments:

Argument Type Default Description
--solver_model_name str required HuggingFace model ID for solving
--verifier_model_name str required HuggingFace model ID for verification
--dataset_name str required Dataset to use (see Supported Datasets)
--dataset_subset_ratio float 1.0 Fraction of dataset to use
--data_generation_kwargs str "" Parameters for procedurally-generated datasets (comma-separated key=value pairs)

2. Rejection Sampling (src/rejection_sampling.py)

Implements iterative rejection sampling: problems that fail verification are re-solved in subsequent rounds. The script tracks metrics per iteration, including solver accuracy at each round, number of newly accepted problems per round, and cumulative compute (GFLOPs).

python src/rejection_sampling.py \
    --solver_model_name Qwen/Qwen3-0.6B \
    --verifier_model_name Qwen/Qwen3-0.6B \
    --dataset_name gsm \
    --max_attempts 5 \
    --output_dir result

Notable Arguments:

Argument Type Default Description
--max_attempts int 5 Maximum solver attempts per problem
--oracle_verifier flag - Use ground truth for verification instead of LLM

3. Embedding Similarity Analysis (src/compute_embedding.py)

Computes pairwise semantic similarity between solver outputs from different models using solution embeddings.

python src/compute_embedding.py \
    --model_names Qwen/Qwen3-0.6B Qwen/Qwen3-1.7B Qwen/Qwen3-4B \
    --dataset_name gsm \
    --output_dir utils/similarity_maps

Note: This script requires cached solver outputs for all specified models. Run inference.py (optionally with --no_verify) first to populate the cache. See Solver Cache System for details.


Supported Models

See Table 1 in our paper for all supported models. To use a model not in this list, add it to MODEL_SIZES in src/inference.py.


Supported Datasets

Real-World Datasets

Dataset Name Description Source
gsm GSM8K math word problems openai/gsm8k
aime AIME competition math (1983-2025) TianHongZXY/aime-1983-2025
mmlu_[SPLIT] All MMLU subjects (57 categories); SPLIT=all,stem,social_sciences,humanities,other cais/mmlu
csqa CommonsenseQA tau/commonsense_qa
gpqa GPQA (graduate-level science) Idavidrein/gpqa

Synthetic Datasets

These datasets are generated on-the-fly in inference.py with configurable parameters via --data_generation_kwargs. In the paper, we use the following generation parameters, but feel free to play around with them to see how they affect difficulty!

2SAT/3SAT

--dataset_name sat \
--data_generation_kwargs "sat_type=2,num_samples=100,min_vars=2,max_vars=8,min_clauses=2,max_clauses=8,seed=42"

Sudoku

--dataset_name sudoku \
--data_generation_kwargs "size=4,num_samples=100,min_empty=4,max_empty=8,seed=42"

Matrix Multiplication

--dataset_name matmul \
--data_generation_kwargs "size=5,num_samples=100,min_val=-5,max_val=5,seed=42"

Solver Cache System

Solver outputs are automatically cached based on a SHA256 hash of all parameters affecting the output. This enables fast iteration when testing different verifier configurations for the same solver. The caching system is automatic, deterministic (up to same hardware configuration), and robust. You can use --no_load_solver_cache for force re-running solver, or --no_save_solver_cache to not save outputs to cache.

Workflow for efficient experimentation:

# Step 1: Populate cache with solver outputs (skip verification with --no_verify)
python src/inference.py \
    --no_verify \
    --solver_model_name Qwen/Qwen3-0.6B \
    --dataset_name gsm \
    --output_dir result

# Step 2: Test different verifiers quickly (loads from solver cache automatically)
python src/inference.py \
    --solver_model_name Qwen/Qwen3-0.6B \
    --verifier_model_name Qwen/Qwen3-1.7B \
    --dataset_name gsm \
    --output_dir result

Citations

If you have any questions or find any bugs, please feel free to contact Jack Lu (yl11330@nyu.edu). If you found our work helpful, please consider giving us a ⭐ and citing us!

@misc{lu2025llmverification,
    title={When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers},
    author={Jack Lu and Ryan Teehan and Jinran Jin and Mengye Ren},
    year={2025},
    eprint={2512.02304},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About

Public code release for the paper "When Does Verification Pay Off?: A Closer Look at LLMs as Solution Verifiers"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages