2020class NormalizeReward (
2121 gym .Wrapper [ObsType , ActType , ObsType , ActType ], gym .utils .RecordConstructorArgs
2222):
23- r"""This wrapper will scale rewards s.t. the discounted returns have a mean of 0 and std of 1.
24-
25- In a nutshell, the rewards are divided through by the standard deviation of a rolling discounted sum of the reward.
26- The exponential moving average will have variance :math:`(1 - \gamma)^2`.
23+ r"""Normalizes immediate rewards such that their exponential moving average has an approximately fixed variance.
2724
2825 The property `_update_running_mean` allows to freeze/continue the running mean calculation of the reward
2926 statistics. If `True` (default), the `RunningMeanStd` will get updated every time `self.normalize()` is called.
3027 If False, the calculated statistics are used but not updated anymore; this may be used during evaluation.
3128
3229 A vector version of the wrapper exists :class:`gymnasium.wrappers.vector.NormalizeReward`.
3330
34- Important note:
35- Contrary to what the name suggests, this wrapper does not normalize the rewards to have a mean of 0 and a standard
36- deviation of 1. Instead, it scales the rewards such that **discounted returns** have approximately unit variance.
37- See [Engstrom et al.](https://openreview.net/forum?id=r1etN1rtPB) on "reward scaling" for more information.
38-
3931 Note:
4032 In v0.27, NormalizeReward was updated as the forward discounted reward estimate was incorrectly computed in Gym v0.25+.
4133 For more detail, read [#3154](https://github.com/openai/gym/pull/3152).
@@ -74,7 +66,6 @@ class NormalizeReward(
7466 ... episode_rewards.append(reward)
7567 ...
7668 >>> env.close()
77- >>> # will approach 0.99 with more episodes
7869 >>> np.var(episode_rewards)
7970 np.float64(0.010162116476634746)
8071
@@ -89,7 +80,7 @@ def __init__(
8980 gamma : float = 0.99 ,
9081 epsilon : float = 1e-8 ,
9182 ):
92- """This wrapper will normalize immediate rewards s.t. their exponential moving average has a fixed variance.
83+ """This wrapper will normalize immediate rewards s.t. their exponential moving average has an approximately fixed variance.
9384
9485 Args:
9586 env (env): The environment to apply the wrapper
0 commit comments