You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this loss function equation, there's a kind of regularization for new policy ($\pi_\theta$).
When the advantage is positive, the objective will increase if the action becomes more likely—that is, if $\pi_{\theta}(a|s)$ increases.
But the min in this term puts a limit to how much the objective can increase.
Once $\pi_{\theta}(a|s) > (1+\epsilon) \pi_{\theta_k}(a|s) $, the min kicks in and this term hits a ceiling of $(1+\epsilon) A^{\pi_{\theta_k}}(s,a)$.
Thus, the new policy does not benefit by going far away from the old policy.
Likewise, when the advantage is negative, the objective will increase if the action becomes less likely—that is, if $\pi_{\theta}(a|s)$ decreases.
But the max in this term puts a limit to how much the objective can increase.
Once $\pi_{\theta}(a|s) < (1-\epsilon) \pi_{\theta_k}(a|s)$, the max kicks in and this term hits a ceiling of $(1-\epsilon) A^{\pi_{\theta_k}}(s,a)$.
Thus, again the new policy does not benefit by going far away from the old policy.
Advantage Estimates
In our example code, our advantage estimate is computed as follow: