Skip to content

Discounted reward calculation in PPO.py breaks when trajectory reaches max_ep_len in train.py #73

@CaoAnda

Description

@CaoAnda

In train.py, when the trajectory reaches max_ep_len, the last done value for that trajectory in ppo_agent.buffer.is_terminals is False.

PPO-PyTorch/train.py

Lines 173 to 181 in 728cce8

for t in range(1, max_ep_len+1):
# select action with policy
action = ppo_agent.select_action(state)
state, reward, done, _ = env.step(action)
# saving reward and is_terminals
ppo_agent.buffer.rewards.append(reward)
ppo_agent.buffer.is_terminals.append(done)

This leads to an issue in the update function of PPO.py, where the calculation of discounted_reward fails when the last is_terminal value of the trajectory is False.

PPO-PyTorch/PPO.py

Lines 200 to 208 in 728cce8

def update(self):
# Monte Carlo estimate of returns
rewards = []
discounted_reward = 0
for reward, is_terminal in zip(reversed(self.buffer.rewards), reversed(self.buffer.is_terminals)):
if is_terminal:
discounted_reward = 0
discounted_reward = reward + (self.gamma * discounted_reward)
rewards.insert(0, discounted_reward)

solution:

 for t in range(1, max_ep_len+1): 
  
     # select action with policy 
     action = ppo_agent.select_action(state) 
     state, reward, done, _ = env.step(action) 
  
     # saving reward and is_terminals 
     ppo_agent.buffer.rewards.append(reward) 
     ppo_agent.buffer.is_terminals.append(True if t == max_ep_len else done) 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions