You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’d like to share an issue I encountered and resolved while training the Gemma-3 Instruction model with vLLM serve and GRPO.
When running training with Gemma-3 Instruction model through a vLLM-based server, I ran into an issue where the KL loss diverged. To investigate, I checked the part where KL was being calculated based on log probabilities.
It turned out that when calling generate on the vLLM-serve endpoint, the responses were returned without an EOS token, and because of this, the mask was not applied correctly, leading to incorrect KL computation.
To fix this, I added logic in the _prepare_inputs function of GRPOTrainer.py to replace the first occurrence of a pad token with an EOS token.
It seems this issue arises due to the characteristic of the Gemma-3 family of models where EOS tokens are not generated. Hopefully, this helps others facing the same problem, so I’m sharing it here in the Discussion. :)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I’d like to share an issue I encountered and resolved while training the Gemma-3 Instruction model with vLLM serve and GRPO.
When running training with Gemma-3 Instruction model through a vLLM-based server, I ran into an issue where the KL loss diverged. To investigate, I checked the part where KL was being calculated based on log probabilities.
It turned out that when calling generate on the vLLM-serve endpoint, the responses were returned without an EOS token, and because of this, the mask was not applied correctly, leading to incorrect KL computation.
To fix this, I added logic in the _prepare_inputs function of GRPOTrainer.py to replace the first occurrence of a pad token with an EOS token.
It seems this issue arises due to the characteristic of the Gemma-3 family of models where EOS tokens are not generated. Hopefully, this helps others facing the same problem, so I’m sharing it here in the Discussion. :)
Beta Was this translation helpful? Give feedback.
All reactions