-
Notifications
You must be signed in to change notification settings - Fork 413
Open
Description
In train.py, when the trajectory reaches max_ep_len, the last done value for that trajectory in ppo_agent.buffer.is_terminals is False.
Lines 173 to 181 in 728cce8
| for t in range(1, max_ep_len+1): | |
| # select action with policy | |
| action = ppo_agent.select_action(state) | |
| state, reward, done, _ = env.step(action) | |
| # saving reward and is_terminals | |
| ppo_agent.buffer.rewards.append(reward) | |
| ppo_agent.buffer.is_terminals.append(done) |
This leads to an issue in the
update function of PPO.py, where the calculation of discounted_reward fails when the last is_terminal value of the trajectory is False.
Lines 200 to 208 in 728cce8
| def update(self): | |
| # Monte Carlo estimate of returns | |
| rewards = [] | |
| discounted_reward = 0 | |
| for reward, is_terminal in zip(reversed(self.buffer.rewards), reversed(self.buffer.is_terminals)): | |
| if is_terminal: | |
| discounted_reward = 0 | |
| discounted_reward = reward + (self.gamma * discounted_reward) | |
| rewards.insert(0, discounted_reward) |
solution:
for t in range(1, max_ep_len+1):
# select action with policy
action = ppo_agent.select_action(state)
state, reward, done, _ = env.step(action)
# saving reward and is_terminals
ppo_agent.buffer.rewards.append(reward)
ppo_agent.buffer.is_terminals.append(True if t == max_ep_len else done)
Metadata
Metadata
Assignees
Labels
No labels