Version requirement: ms-swift>=3.10
REINFORCE Leave-One-Out (RLOO) is a reinforcement learning algorithm based on the classic REINFORCE policy-gradient method. It constructs an unbiased advantage baseline via the Leave-One-Out (LOO) technique.
For clarity, we explain RLOO by contrasting it with GRPO (Group Relative Policy Optimization).
Both GRPO and RLOO estimate advantages via intra-group comparisons to avoid the high variance of a global baseline. Their core differences are mainly in the following aspects:
1. GRPO (Group Relative Policy Optimization)
For each prompt, GRPO generates
Where:
-
$R_i$ is the reward of the$i$ -th sample - $\text{mean}({R_j}{j=1}^G) = \frac{1}{G}\sum{j=1}^G R_j$ is the group mean
-
$\text{std}({R_j}_{j=1}^G)$ is the group standard deviation
2. RLOO (REINFORCE Leave-One-Out)
For each prompt, RLOO generates
This can be equivalently rewritten as:
where
Note: We use
$K$ here to match the notation in the paper. It has the same meaning as$G$ in GRPO and corresponds to the configuration parameternum_generations.
Why Leave-One-Out?
The key advantage is unbiasedness. For the
To prevent the policy from drifting too far from the reference policy, both algorithms introduce KL divergence regularization, but in different ways:
GRPO: Adds KL divergence as an independent regularization term to the loss:
RLOO: Integrates KL divergence directly into the reward, constructing a modified reward:
where beta), and
RLOO training can be enabled based on GRPOTrainer by setting the following parameters:
# Basic RLOO configuration
--advantage_estimator rloo # Use RLOO's leave-one-out advantage estimator
--kl_in_reward true # Integrate KL divergence into the reward (default for RLOO)You can refer to this script for training.
-
--advantage_estimator: Choose the advantage estimator-
grpo(default): standardize using group mean and standard deviation -
rloo: construct the baseline via Leave-One-Out
-
-
--kl_in_reward: Controls where the KL term is applied-
false: KL as a separate regularization term in the loss (GRPO style) -
true: subtract KL directly from the reward to form a modified reward (RLOO style)
-
-
--num_generations: Number of samples per prompt, i.e.,$K$ -
--beta: KL regularization coefficient$\beta$ - Controls how conservatively the policy updates
Other parameters are consistent with the GRPO arguments.