A recurrent, multi-process and readable PyTorch implementation of the deep reinforcement learning algorithms:
inspired by 3 repositories:
- General kinds of observation spaces: tensors and dict of tensors
- General kinds of action spaces: discrete and continuous
- Recurrent policy with
--recurrenceargument - Observation preprocessing
- Reward shaping
- Entropy regularization
- Fast:
- Multiprocessing for collection trajectories in multiple environments simultaneously
- GPU (CUDA) for tensor operations
- Training logs:
- CSV
- Tensorboard
- PyTorch 0.4.0
You have to clone the repository and then install the module:
pip3 install -e torch_rl
To gets updates from the code, you just need to do a git pull. No need to install the module again.
The module consists of:
- 2 classes
torch_rl.A2CAlgoandtorch_rl.PPOAlgofor, respectively, A2C and PPO algorithms - 2 abstract classes
torch_rl.ACModelandtorch_rl.RecurrentACModelfor, respectively, non-recurrent and recurrent actor-critic models - 1 class
torch_rl.DictListfor making dictionnaries of lists batch-friendly
Here are detailed the points that can't be understood immediately by looking at the definition files of the classes, or by looking at the arguments of scripts/train.py with scripts/train.py --help command.
torch_rl.A2CAlgo and torch_rl.PPOAlgo have 2 methods:
__init__that may take, among the other parameters :- an
acmodelactor-critic model that is an instance of a class that inherits from one of the two abstract classestorch_rl.ACModelortorch_rl.RecurrentACModel. - a
preprocess_obssfunction that transforms a list of observations given by the environment into an objectX. This objectXmust allow to retrieve from it a sublist of preprocessed observations given a list of indexesindexeswithX[indexes]. By default, the observations given by the environment are transformed into a Pytorch tensor. - a
reshape_rewardfunction that takes into parameter, in the order, an observationobs, the actionactionof the model, the rewardrewardand the terminal statusdoneand returns a new reward. - a
recurrencenumber to specify over how many timestep gradient will be backpropagated. This number is only considered if a recurrent model is used and must divide thenum_frames_per_agentparameter and, for PPO, thebatch_sizeparameter.
- an
update_parametersthat returns some logs.
torch_rl.ACModel has 2 abstract methods:
__init__that takes into parameter theobservation_spaceand theaction_spacegiven by the environment.forwardthat takes into parameter N preprocessed observationsobsand returns a Pytorch distributiondistand a tensor of valuesvalue. The tensor of values must be of size N, not N x 1.
torch_rl.RecurrentACModel has 3 abstract methods:
__init__that takes into parameter the same parameters thantorch_rl.ACModel.forwardthat takes into parameter the same parameters thantorch_rl.ACModelalong with a tensor of N memoriesmemoryof size N x M where M is the size of a memory. It returns the same thing thantorch_rl.ACModelplus a tensor of N memoriesmemory.memory_sizethat returns the size M of a memory.
For speed purposes, the observations are only preprocessed once. Hence, because of the use of batches in PPO, the preprocessed observations X must allow to retrieve from it a sublist of preprocessed observations given a list of indexes indexes with X[indexes]. If your preprocessed observations are a Pytorch tensor, you are already done, and if you want your preprocessed observations to be a dictionnary of lists or of tensors, you will also be already done if you use the torch_rl.DictList class as follow:
>>> d = DictList({"a": [[1, 2], [3, 4]], "b": [[5], [6]]})
>>> d.a
[[1, 2], [3, 4]]
>>> d[0]
DictList({"a": [1, 2], "b": [5]})Note : if you use a RNN, you will need to set batch_first to True.
An example of use of torch_rl.A2CAlgo and torch_rl.PPOAlgo classes is given in scripts/train.py.
An example of implementation of torch_rl.RecurrentACModel abstract class is given in model.py.
An example of use of torch_rl.DictList and an example of a preprocess_obss function is given in the ObsPreprocessor.__call__ function of utils/format.py.
OMP_NUM_THREADS affects the number of threads used by MKL. The default value may severly damage your performance. This may be avoided if set to 1:
export OMP_NUM_THREADS=1
For your own purposes, you will probabily need to change:
- the model in
model.py, - the
ObssPreprocessor.__call__method inutils.format.
Along with the torch_rl package is provided a model that:
- has a memory. This can be disabled by setting
use_memorytoFalsein the constructor. - understands instructions. This can be disabled by setting
use_instrtoFalsein the constructor.
Along with the torch_rl package are provided 3 general reinforcement learning scripts:
train.pyfor training an actor-critic model with A2C or PPO.enjoy.pyfor visualizing your trained model acting.evaluate.pyfor evaluating the performances of your trained model over X episodes.
These scripts were designed especially for the MiniGrid environments. These environments give an observation containing an image and a textual instruction to the agent and a reward of 1 if it successfully executes the instruction, 0 otherwise. They are used in what follows for illustrating purposes.
These scripts assume that you have already installed the gym package (with pip3 install gym for example). By default, models and logs are stored in the storage folder. You can define a different folder in the environment variable TORCH_RL_STORAGE.
scripts/train.py enables you to load a model, trains it with the specified actor-critic algorithm and save it in the storage folder.
2 arguments are required:
--algo ALGO: name of the actor-critic algorithm.--env ENV: name of the environment to train on.
and a bunch of optional arguments are available among which:
--model MODEL: name of the model, used for loading and saving it. If not specified, it is the_-concatenation of the environment name and algorithm name.--frames-per-proc FRAMES_PER_PROC: number of frames per process before updating parameters.--no-instr: disable the understanding of instructions of the original model inmodel.py. If your model is trained on an environment where there is no need to understand instructions, it is advised to disable it for faster training.--no-mem: disable the memory of the original model inmodel.py. If your model is trained on an environment where there is no need to remember something, it is advised to disable it for faster training.- ... (see more using
--help)
Here is an example of command:
python3 -m scripts.train --algo ppo --env MiniGrid-DoorKey-5x5-v0 --no-instr --no-mem --model DoorKey --save-interval 10
This will print some logs in your terminal:
where:
- "U" is for "Update".
- "F" is for the total number of "Frames".
- "FPS" is for "Frames Per Second".
- "D" is for "Duration".
- "rR" is for "reshaped Return" per episode. The 4 following numbers are, in the order, the mean
x̄, the standard deviationσ, the minimummand the maximumMof the reshaped return per episode during the update. - "F" is for the number of "Frames" per episode. The 4 following numbers are again, in the order, the mean, the standard deviation, the minimum, the maximum of the number of frames per episode during the update.
- "H" is for "Entropy".
- "V" is for "Value".
- "pL" is for "policy Loss".
- "vL" is for "value Loss".
- "∇" is for the gradient norm.
These logs are also saved in a logging format in log.log and in a CSV format in log.csv in the storage folder.
If you add --tb to the command, logs are also plotted in Tensorboard using the tensorboardX package that you can install with pip3 install tensorboardX. Then, you just have to execute:
tensorboard --logdir storage
and you will get something like this:
scripts/enjoy.py enables you to visualize your trained model acting.
2 arguments are required:
--env ENV: name of the environment to act on.--model MODEL: name of the trained model.
and several optional arguments are available (see more using --help).
Here is an example of command:
python3 -m scripts.enjoy --env MiniGrid-DoorKey-5x5-v0 --model DoorKey
In the MiniGrid-DoorKey-6x6-v0 environment, the agent has to reach the green goal. In particular, it has to learn how to open a locked door.
In the MiniGrid-GoToDoor-5x5-v0 environment, the agent has to open a door specified by its color. In particular, it has to understand textual instructions.
In the MiniGrid-RedBlueDoors-6x6-v0 environment, the agent has to open the red door and then the blue door. Because the agent initially faces the blue door, it has to remember if the red door is opened.
scripts/evaluate.py enables you to evaluate the performance of your trained model on X episodes.
2 arguments are required:
--env ENV: name of the environment to act on.--model MODEL: name of the trained model.
and several optional arguments are available (see more using --help).
By default, the model is tested on 100 episodes with a random seed set to 2 instead of 1 during training.
Here is an example of command:
python3 -m scripts.evaluate --env MiniGrid-DoorKey-5x5-v0 --model DoorKey
This will print the evaluation in your terminal:
where "R" is for "Return" per episode.






