Paper: RLVR: Learning from Automated Verifiers (Zhang et al., 2024)
Status: Reference documentation (implementation pending)
RLVR combines reward modeling with automated verification for domains with verifiable correctness (math, code).
Standard RLHF: Learned reward model (subjective) RLVR: Verification-based rewards (objective)
For math: Check if answer is correct
For code: Run test cases
For logic: Verify proof steps
def verification_reward(response, problem):
"""
Return 1 if correct, 0 if incorrect.
"""
if problem.type == "math":
return check_math_answer(response, problem.answer)
elif problem.type == "code":
return run_test_cases(response, problem.test_cases)
elif problem.type == "logic":
return verify_proof(response, problem.theorem)Train on final answer correctness:
r(x, y) = 1{final_answer(y) == ground_truth(x)}
Reward intermediate steps:
r(x, y) = sum_t w_t · 1{step_t correct}
L_RLVR = E_{x,y~π}[ -r_verify(x, y) · log π(y|x) ]
where r_verify ∈ {0, 1} is verification result
Can use any RL algorithm (PPO, GRPO, RLOO) with verification rewards.
class RLVRAgent:
def __init__(self, policy, verifier):
self.policy = policy
self.verifier = verifier # Automated verification function
def train_step(self, prompts):
# Generate responses
responses = self.policy.generate(prompts)
# Verify each response
rewards = []
for prompt, response in zip(prompts, responses):
reward = self.verifier(prompt, response)
rewards.append(reward)
# Update policy with RL (e.g., GRPO, RLOO)
self.rl_agent.update({
"input_ids": responses,
"rewards": torch.tensor(rewards),
...
})- Objective rewards: No annotation bias
- Scalable: Automated verification
- Domain-specific: Leverages verifiability
- Accurate: Ground truth available
- Domain-specific: Only works for verifiable tasks
- Binary rewards: Often 0/1, limited feedback
- Verifier quality: Depends on verifier correctness
Mathematics:
- Arithmetic, algebra, calculus
- Verify final answer against ground truth
Code Generation:
- Unit test execution
- Static analysis checks
Formal Logic:
- Proof verification
- Type checking
Games:
- Win/loss verification
- Rule compliance
Best for:
- Verifiable domains (math, code, logic)
- Objective correctness metrics
- Have automated verifiers
Avoid when:
- Subjective tasks (creative writing)
- No ground truth
- Verifier unavailable
Hybrid approach:
r_total = α · r_verify + (1-α) · r_learned
where:
r_verify: verification reward
r_learned: learned reward model
α: trade-off weight@article{zhang2024rlvr,
title={RLVR: Learning from Automated Verification},
author={Zhang, Jie and others},
year={2024}
}Related:
- Process Reward Models (Lightman et al., 2023)
- Let's Verify Step by Step (OpenAI, 2023)
- RLEIF (Yuan et al., 2024)