In the following line, the code can break if the value of 'self.max_action' is high enough that 'action' could have a high value, making the value within the logarithm negative. Negative values of logarithms return NaN.
log_probs -= T.log(1-action.pow(2)+self.reparam_noise)
|
log_probs -= T.log(1-action.pow(2)+self.reparam_noise) |