PRIME-RL implements asynchronous off-policy training, instead of the traditional synchronous on-policy training. This means that we allow inference to generate rollouts from a stale policy up to max_async_level) steps ahead of the trainer. With k=1 and trainer and inference step timings being equal, this allows to run without any idle time on either the trainer or inference. By default, we set k=2 to allow overlap with a weight broadcast over the Internet, which is needed for decentralized training.
We adopt a loss objective capable of handling the natural distribution shift caused by the off-policy nature of the training. By default, we use a token-level loss variant of the AIPO training objective introduced in Llama-RL, but omit the entropy and KL loss terms.
At each step, we sample
where
PRIME-RL uses a global training step
-
Trainer: Produces policy
$\pi_n$ with weights$\theta_n$ from rollouts$(x_n, y_n)$ -
Inference: Produces rollouts
$(x_n, y_n)$ from policy$\pi_{max(0, n-k)}$
Here, max_async_level parameter, which defaults to 2. Note that we use 0-indexed steps to cleanly indicate that at each step, the divergence off-policy gap is at most
