Skip to content

Commit b060d1d

Browse files
authored
docs: update grpo.md (#1106)
Signed-off-by: Xuehan <xxman@google.com>
1 parent 62112f6 commit b060d1d

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

docs/guides/grpo.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,7 @@ We roughly approximate the entropy of the LLM's distribution throughout training
252252
$$
253253
E_{s \sim \pi_{\text{inference}}(x)}[-\frac{\pi_{\text{training}}(x)}{\pi_{\text{inference}}(x)}log(\pi_{\text{training}}(x))]
254254
$$
255+
255256
using the rollouts in each training global batch as Monte-Carlo samples. The ratio of $\pi$ is in the formula to importance-correct for the mismatch between the policy over the course of training in a singular GRPO step and the inference framework.
256257

257258
We use this to track if our models are entropy-collapsing too quickly during training (as is quite common). This is a pretty rough monte-carlo approximation, so we wouldn't recommend using this directly for an entropy bonus or otherwise backpropagating through this. You can take a look at NeMo-Aligner's [implementation](https://github.com/NVIDIA/NeMo-Aligner/blob/main/nemo_aligner/utils/distributed.py#L351) of a full entropy calculation if you're interested (WIP efficient calculation in NeMo-RL).

0 commit comments

Comments
 (0)