Prioritized Experience Replay #5

Kait0 · 2021-02-22T16:42:35Z

Kait0
Feb 22, 2021

Hi,
I am using your library in a Reinforcement Learning project and the normal experience replay (uniform) is working very well.
I wanted to explore the Prioritized Experience Replay as well but so far I am struggling to get it to work. Switching from uniform to prioritized essentially prevents my agents to learn.

I wanted therefore to start a discussion on how to properly use the Prioritized Experience Replay class.

You do provide a basic example in this repository using DQN which I have some questions about.
(I am using TD3 as underlying algorithm but I read some paper that use Prioritized Experience Replay with DDPG so that shouldn't matter).

To update the priorities you seem to use mean square error

TD = np.square(target - Q_pred).sum(axis=1)
rb.update_priorities(sample["indexes"],TD)

I think in the paper they were using abs(target - Q_pred) to update the priorities.
I there a reason why you deviate from that here?

My next question is about the importance sampling weights.
In the example you don't use them.
the rb.sample() function provides an attribute called 'weights'.
If I understood that correctly it already implements the importance sampling formula ((N*P(j))^-b /(max_i(w_i)))
So I can directly use these weights in my loss function by multiplying?
Is there a reason why that is not done in the example?

Does somebody have experience with the hyperparameters?
Are the default parameters robust enough so that one can expect them to reasonably work?
In the paper they reported changing the beta hyperparameter to 1 over time. It wasn't done in the example so I wonder if it's considered to be important.

This is roughly how i'm using the PER. Am I doing something conceptually wrong?

self.buffer = PrioritizedReplayBuffer(self._config["buffer_size"],
                                        env_dict={"obs": {"shape": self._observation_space.shape[0]},
                                                  "act": {"shape": self._action_n},
                                                  "rew": {},
                                                  "next_obs": {"shape": self._observation_space.shape[0]},
                                                  "done": {}
                                                  }
                                      )

for i in range(0,iterations):
   #Run environment
    ...
    self.buffer.add(obs=observation, act=action, rew=reward, next_obs=observation_new, done=(1.0 - int(done_float))
    ...
    #train
    batch1 = self.buffer.sample(self._config['batch_size'])

    ob       = torch.from_numpy(batch1['obs']).to(device)
    a        = torch.from_numpy(batch1['act']).to(device)
    reward   = torch.from_numpy(batch1['rew']).to(device)
    ob_new   = torch.from_numpy(batch1['next_obs']).to(device)
    not_done = torch.from_numpy(batch1['done']).to(device)
    
    replay_indexes = batch1['indexes']
    replay_weights = torch.from_numpy(batch1['weights']).to(device)
    
    #TD3/DDPG stuff
    ...
    #temporal difference error
    y = reward + not_done * self._config['discount'] * torch.min(pred_target_Q_1,pred_target_Q_2)
    
    pred = Q1.forward(ob, a)
    
    #Compute weigthed loss
    td_error = pred - y
    weighted_squared_error = importance_sampling_weights * td_error * td_error
    loss = torch.sum(weighted_squared_error) / torch.numel(weighted_squared_error)
    
    #Optimize Q function
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    
    new_priorities = torch.abs(y - pred)
    self.buffer.update_priorities(replay_indexes, new_priorities.cpu().numpy())
    
    ...
    #Update actor and target networks (not weighted with IS_weights).

Thanks a lot.

Answered by Kait0

Feb 23, 2021

Hi,
thanks for your reply.
I got the PER to work on my problem now.
In case somebody comes accross the same problems the first one was a technical problem:

In my code all the Q value tensors/TD errors where of matrix shape [batch_size, 1].
The importance sampling weights from the library are of vector shape [batch_size].
When I multiplied them with the loss tensor python broadcasted the result to a matrix [batch_size, batch_size] which I didn't notice because the mean also works and spits out one number if you give it a matrix.
So to fix this I simply had to unsqueeze the weights after reading them.
replay_weights = torch.unsqueeze(torch.from_numpy(batch1['weights']).to(device),1)
The…

View full answer

ymd-h · 2021-02-23T03:00:32Z

ymd-h
Feb 23, 2021
Maintainer

Hi, @Kait0

I'm sorry for my incorrect examples.
(I need to maintain it.)

https://github.com/ymd-h/cpprb/blob/444a510282bb8bcba57ec21f6e9050ea2e181de0/example/dqn.py

I assume you mention the above example code, right?

To update the priorities you seem to use mean square error
I think in the paper they were using abs(target - Q_pred) to update the priorities.

You are right. The priority should be absolute value of TD error.

So I can directly use these weights in my loss function by multiplying?

Yes.

Is there a reason why that is not done in the example?

No. That is a bug. (To be honest, I haven't understood correctly when I wrote that example code.)

Does somebody have experience with the hyperparameters?

Recently, there is a paper which theoretically explains the contribution of alpha and beta at PER.
https://arxiv.org/abs/2007.06049

According to the above paper, PER is somehow disturbed with MSE loss when beta is not 1.
The authors proposed Loss Adjusted Prioritized Experience Replay (LAP), where

Huber Loss
beta = 0
p_i = max(|TD|, 1)

I am still implementing LAP support (hopefully it will come soon), however, you can already implement with PrioritizedReplayBuffer by setting

eps =0
beta = 0 (at every sample calling)
p_i = max(|TD|,1)
Ignore "weights"

(Internally, it is bit inefficient because of unused weight calculation)

If you are still questions, please feel free to ask me.

3 replies

ymd-h Feb 23, 2021
Maintainer

https://github.com/ymd-h/cpprb/blob/3081b0e725e60367e18c04f37327d0b3be24e81a/example/dqn.py

Now, example/dqn.py is updated.

I fixed the priorities and weights for PER. (normal PER instead of LAP)

cpprb/example/dqn.py

Lines 161 to 162 in 3081b0e

    
           absTD = tf.math.abs(target_Q - Q) 
        
           rb.update_priorities(sample["indexes"],absTD)

cpprb/example/dqn.py

Line 150 in 3081b0e

loss = tf.reduce_mean(loss_func(absTD)*weights)

Thank you for your contribution!

ymd-h Feb 23, 2021
Maintainer

Update 2

I re-checked the following papers, and added beta annealing to example/dqn.py
https://arxiv.org/abs/1511.05952
https://arxiv.org/abs/1710.02298

cpprb/example/dqn.py

Lines 32 to 34 in 1265592

    
           # Beta linear annealing: https://arxiv.org/abs/1511.05952 
        
           beta = 0.4 
        
           beta_step = (1 - beta)/N_iteration

cpprb/example/dqn.py

Lines 155 to 157 in 1265592

    
           if prioritized: 
        
               sample = rb.sample(batch_size,beta) 
        
               beta += beta_step

However, I still don't have confidence the hyperparameter tuning.
I feel that step size might be task dependent.

If you have any ideas, it is very welcome.

As I mentioned above, personally, I prefer LAP to plain PER, which I feel theoretically much understandable.

Kait0 Feb 23, 2021
Author

Yes that is the example I was referring to. Thanks for correcting it.

Kait0 · 2021-02-23T13:15:01Z

Kait0
Feb 23, 2021
Author

Hi,
thanks for your reply.
I got the PER to work on my problem now.
In case somebody comes accross the same problems the first one was a technical problem:

In my code all the Q value tensors/TD errors where of matrix shape [batch_size, 1].
The importance sampling weights from the library are of vector shape [batch_size].
When I multiplied them with the loss tensor python broadcasted the result to a matrix [batch_size, batch_size] which I didn't notice because the mean also works and spits out one number if you give it a matrix.
So to fix this I simply had to unsqueeze the weights after reading them.
replay_weights = torch.unsqueeze(torch.from_numpy(batch1['weights']).to(device),1)
The second problem was the learning rate of the optimizer. With uniform sampling I used an Adam learning rate of 0.003.
This worked for the uniform case but once you switch to PER it didn't work at all anymore.
I needed to reduce my learning rate (in the paper they argued that this is because the gradients of your batch will usually be larger with PER). What value is precisely best for an environment and algorithm is probably problem dependent. I ended up using 0.000625 (was the reported lr in the rainbow paper) which worked well for me on two environments.
I also used the recommended beta, alpha and epsilon values from the rainbow paper (but I'm not sure if that made any major difference).

1 reply

ymd-h Feb 23, 2021
Maintainer

Congratulations and thank you for your sharing your experience and know-how!

It is great contribution.

Prioritized Experience Replay #5

Uh oh!

Uh oh!

Kait0 Feb 22, 2021

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

ymd-h Feb 23, 2021 Maintainer

Uh oh!

ymd-h Feb 23, 2021 Maintainer

Uh oh!

ymd-h Feb 23, 2021 Maintainer

Uh oh!

Kait0 Feb 23, 2021 Author

Uh oh!

Uh oh!

Kait0 Feb 23, 2021 Author

Uh oh!

ymd-h Feb 23, 2021 Maintainer

Kait0
Feb 22, 2021

Replies: 2 comments 4 replies

ymd-h
Feb 23, 2021
Maintainer

ymd-h Feb 23, 2021
Maintainer

ymd-h Feb 23, 2021
Maintainer

Kait0 Feb 23, 2021
Author

Kait0
Feb 23, 2021
Author

ymd-h Feb 23, 2021
Maintainer