-
Notifications
You must be signed in to change notification settings - Fork 2k
Add Prioritized Approximation Loss feature #2166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
previous xp deleted, see more recents xp in the recent comment |
|
Has anyone else been able to get this to work and provide a working example with the code they used? @bilelsgh does not appear to be active anymore. I've been getting a Runtimerror with element 0 does not require or have a grad_fn when using this implementation. |
Hi, I’m a bit busy these days, I didn’t have this error when i ran the code |
The code is running properly. There is no significant improvement in the reward for CartPole. As detailed in the PER original paper, PER does not always lead to better performance, particularly in environments with low variance in TD-errors and a limited number of rare or informative transitions.
However, the reward is way better on Lunar Lander with PAL, showing its efficiency. Feel free to evaluate the PR directly, or refer to the experiments presented in the paper used as the basis for this implementation. |
|
Here is the code used for the evaluation: import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.buffers import PrioritizedReplayBuffer
env_names = ['CartPole-v1', 'LunarLander-v3']
for env_name in env_names :
for buffer in [None, PrioritizedReplayBuffer]:
log_name = f"{env_name}_classic" if not buffer else f"{env_name}_PAL"
env = gym.make(env_name)
model = DQN("MlpPolicy",
env,
replay_buffer_class=buffer,
tensorboard_log="./pe_board",
verbose=1,)
model.learn(total_timesteps=100000, log_interval=4, tb_log_name=log_name) |


Feature overview
Implementation of Prioritized Experience Replay (PER) with Prioritized Approximation Loss (PAL) (linked to #1622).
A NeurIPS 2020 paper shows that using PER is equivalent to adapting the loss function while using uniform experience replay.
This means we can avoid managing a sorted buffer and the associated complexity, while still converging to the same gradient.
Description
I've added a new loss function, which adapts the Huber Loss by incorporating priority as described in the referenced paper. The buffer itself performs uniform sampling (ReplayBuffer). Additionally, I implemented a PrioritizedReplayBuffer to initialize the parameters alpha and beta (following the PAL or PER papers) and to properly handle the case where the PAL Loss is applied within the training method.
Motivation and Context
In accordance with @AlexPasqua PR Prioritized experience replay #1622 (and the corresponding issue Prioritized Experience Replay for DQN #1242) (👋 @araffin )
Types of changes
Checklist
make format(required)make check-codestyleandmake lint(required)make pytestandmake typeboth pass. (required)make doc(required)Note: You can run most of the checks using
make commit-checks.Note: we are using a maximum length of 127 characters per line