Fix Numerical issues in MaskPPO Training #302
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
High-level description: Improve numerical stability by mirroring the logic used by TorchRL, which uses
-inf
instead of a small number and handles masking better.Details of changes:
__init__
, store the original "pristine" logits, then recompute masked logits each time using the original pristine copy inapply_masking
min_real
(the minimum possible value of the dtype of the logits), re-normalizes, and then computes entropy only over valid entries.Context
closes #81
Types of changes
Checklist:
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)