diff --git a/README.md b/README.md
index 0a18ba9..7527bf6 100644
--- a/README.md
+++ b/README.md
@@ -108,45 +108,75 @@ OpenAI environments.
## RoadMap
-- [ ] **Working on** **1.0.0** Base version is completed with working model visualizations proving performance / expected failure. At
-this point, all models should have guaranteed environments they should succeed in.
-- [ ] 1.1.0 **Working on** More Traditional RL models
- - [ ] **Working on** Add PPO
- - [ ] **Working on** Add TRPO
+- [ ] 1.1.0 More Traditional RL models
+ - [X] Add Cross Entropy Method CEM
+ - [X] NStep Experience replay
+ - [X] Gaussian and Factored Gaussian Noise exploration replacement
+ - [X] Add Distributional DQN
+ - [X] Add RAINBOW DQN (Note warnings, will require refactor / re-testing)
+ - [ ] **Working on** Add REINFORCE
+ - [ ] **Working on** Add PPO
+ - [ ] **Working on** Add TRPO
- [ ] Add D4PG
- [ ] Add A2C
- [ ] Add A3C
-- [ ] 1.2.0 HRL models *Possibly might change version to 2.0 depending on SMDP issues*
+ - [ ] Add SAC
+- [ ] 2.0.0 Mass refactor / performance update
+ - [ ] Environments needs to be faster. Beat openai baseline 350 frames per second
+ - Comparing against https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On performance
+ - [ ] fastrl needs to handle ram better
+ - [ ] Use Pong as "expensive computation" benchmark for all compatible models (discrete).
+ - [ ] 2 Runs image space
+ - [ ] Use Cartpole as "cheap computation" benchmark for all compatible models (discrete).
+ - [ ] 5 Runs state space
+ - [ ] 2 Runs image space
+ - [ ] Use Mountain car as "far distance goal" benchmark all compatible models (discrete)
+ - [ ] 5 Runs state space
+ - [ ] 2 Runs image space
+ - [ ] Use Ant as "expensive computation" benchmark for all compatible models (continuous).
+ - [ ] 2 Runs image space
+ - [ ] Use Pendulum as "cheap computation" benchmark for all compatible models (continuous).
+ - [ ] 5 Runs state space
+ - [ ] 2 Runs image space
+ - [ ] Use Mountain car continuous as "cheap computation" "far distance goal" benchmark all compatible models (continuous).
+ - [ ] 5 Runs state space
+ - [ ] 2 Runs image space
+ - [ ] Use yield instead of return for the MDPDataset object
+ - [ ] Unify common code pieces shared in all models
+ - [ ] Transition entire project to [nbdev](https://github.com/fastai/nbdev)
+ - Make documentation easier / more expansive. Current method is tedious.
+- [ ] 2.1.0 HRL models *Possibly might change version to 2.0 depending on SMDP issues*
- [ ] Add SMDP
- [ ] Add Goal oriented MDPs. Will Require a new "Step"
- [ ] Add FeUdal Network
- [ ] Add storage based DataBunch memory management. This can prevent RAM from being used up by episode image frames
that may or may not serve any use to the agent, but only for logging.
-- [ ] 1.3.0
+- [ ] 2.2.0
- [ ] Add HAC
- [ ] Add MAXQ
- [ ] Add HIRO
-- [ ] 1.4.0
+- [ ] 2.3.0
- [ ] Add h-DQN
- [ ] Add Modulated Policy Hierarchies
- [ ] Add Meta Learning Shared Hierarchies
-- [ ] 1.5.0
+- [ ] 2.4.0
- [ ] Add STRategic Attentive Writer (STRAW)
- [ ] Add H-DRLN
- [ ] Add Abstract Markov Decision Process (AMDP)
- [ ] Add conda integration so that installation can be truly one step.
-- [ ] 1.6.0 HRL Options models *Possibly will already be implemented in a previous model*
+- [ ] 2.5.0 HRL Options models *Possibly will already be implemented in a previous model*
- [ ] Options augmentation to DQN based models
- [ ] Options augmentation to actor critic models
- [ ] Options augmentation to async actor critic models
-- [ ] 1.8.0 HRL Skills
+- [ ] 2.6.0 HRL Skills
- [ ] Skills augmentation to DQN based models
- [ ] Skills augmentation to actor critic models
- [ ] Skills augmentation to async actor critic models
-- [ ] 1.9.0
-- [ ] 2.0.0 Add PyBullet Fetch Environments
- - [ ] 2.0.0 Not part of this repo, however the envs need to subclass the OpenAI `gym.GoalEnv`
- - [ ] 2.0.0 Add HER
+- [ ] 2.7.0 Add PyBullet Fetch Environments
+ - [ ] Envs need to subclass OpenAI `gym.GoalEnv`
+ - [ ] Add HER
+- [ ] 3.0.0 Breaking refactor of all methods
+ - [ ] Move to fastai 2.0
## Contribution
diff --git a/ROADMAP.md b/ROADMAP.md
index 5747d9d..a983a9a 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -3,39 +3,72 @@
- [X] 0.9.0 Notebook demonstrations of basic model usage.
- [X] **1.0.0** Base version is completed with working model visualizations proving performance / expected failure. At
this point, all models should have guaranteed environments they should succeed in.
-- [ ] **Working on** 1.1.0 More Traditional RL models
+- [ ] 1.1.0 More Traditional RL models
+ - [X] Add Cross Entropy Method CEM
+ - [X] NStep Experience replay
+ - [X] Gaussian and Factored Gaussian Noise exploration replacement
+ - [X] Add Distributional DQN
+ - [X] Add RAINBOW DQN (Note warnings, will require refactor / re-testing)
+ - [X] Add REINFORCE
- [ ] **Working on** Add PPO
- [ ] **Working on** Add TRPO
- [ ] Add D4PG
- [ ] Add A2C
- [ ] Add A3C
-- [ ] 1.2.0 HRL models *Possibly might change version to 2.0 depending on SMDP issues*
+ - [ ] Add SAC
+- [ ] 2.0.0 Mass refactor / performance update
+ - [ ] Environments needs to be faster. Beat openai baseline 350 frames per second
+ - Comparing against https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On performance
+ - [ ] fastrl needs to handle ram better
+ - [ ] Use Pong as "expensive computation" benchmark for all compatible models (discrete).
+ - [ ] 2 Runs image space
+ - [ ] Use Cartpole as "cheap computation" benchmark for all compatible models (discrete).
+ - [ ] 5 Runs state space
+ - [ ] 2 Runs image space
+ - [ ] Use Mountain car as "far distance goal" benchmark all compatible models (discrete)
+ - [ ] 5 Runs state space
+ - [ ] 2 Runs image space
+ - [ ] Use Ant as "expensive computation" benchmark for all compatible models (continuous).
+ - [ ] 2 Runs image space
+ - [ ] Use Pendulum as "cheap computation" benchmark for all compatible models (continuous).
+ - [ ] 5 Runs state space
+ - [ ] 2 Runs image space
+ - [ ] Use Mountain car continuous as "cheap computation" "far distance goal" benchmark all compatible models (continuous).
+ - [ ] 5 Runs state space
+ - [ ] 2 Runs image space
+ - [ ] Use yield instead of return for the MDPDataset object
+ - [ ] Unify common code pieces shared in all models
+ - [ ] Transition entire project to [nbdev](https://github.com/fastai/nbdev)
+ - Make documentation easier / more expansive. Current method is tedious.
+- [ ] 2.1.0 HRL models *Possibly might change version to 2.0 depending on SMDP issues*
- [ ] Add SMDP
- [ ] Add Goal oriented MDPs. Will Require a new "Step"
- [ ] Add FeUdal Network
- [ ] Add storage based DataBunch memory management. This can prevent RAM from being used up by episode image frames
that may or may not serve any use to the agent, but only for logging.
-- [ ] 1.3.0
+- [ ] 2.2.0
- [ ] Add HAC
- [ ] Add MAXQ
- [ ] Add HIRO
-- [ ] 1.4.0
+- [ ] 2.3.0
- [ ] Add h-DQN
- [ ] Add Modulated Policy Hierarchies
- [ ] Add Meta Learning Shared Hierarchies
-- [ ] 1.5.0
+- [ ] 2.4.0
- [ ] Add STRategic Attentive Writer (STRAW)
- [ ] Add H-DRLN
- [ ] Add Abstract Markov Decision Process (AMDP)
-- [ ] 1.6.0 HRL Options models *Possibly will already be implemented in a previous model*
+ - [ ] Add conda integration so that installation can be truly one step.
+- [ ] 2.5.0 HRL Options models *Possibly will already be implemented in a previous model*
- [ ] Options augmentation to DQN based models
- [ ] Options augmentation to actor critic models
- [ ] Options augmentation to async actor critic models
-- [ ] 1.8.0 HRL Skills
+- [ ] 2.6.0 HRL Skills
- [ ] Skills augmentation to DQN based models
- [ ] Skills augmentation to actor critic models
- [ ] Skills augmentation to async actor critic models
-- [ ] 1.9.0
-- [ ] 2.0.0 Add PyBullet Fetch Environments
- - [ ] 2.0.0 Not part of this repo, however the envs need to subclass the OpenAI `gym.GoalEnv`
- - [ ] 2.0.0 Add HER
\ No newline at end of file
+- [ ] 2.7.0 Add PyBullet Fetch Environments
+ - [ ] Envs need to subclass OpenAI `gym.GoalEnv`
+ - [ ] Add HER
+- [ ] 3.0.0 Breaking refactor of all methods
+ - [ ] Move to fastai 2.0
diff --git a/docs_src/rl.agents.cem.ipynb b/docs_src/rl.agents.cem.ipynb
new file mode 100644
index 0000000..ce45d6d
--- /dev/null
+++ b/docs_src/rl.agents.cem.ipynb
@@ -0,0 +1,61 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "pycharm": {
+ "is_executing": false
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Can't import one of these: No module named 'pybullet'\n",
+ "pygame 2.0.0.dev6 (SDL 2.0.10, python 3.6.7)\n",
+ "Hello from the pygame community. https://www.pygame.org/contribute.html\n",
+ "Can't import one of these: No module named 'gym_minigrid'\n"
+ ]
+ }
+ ],
+ "source": [
+ "from fastai.tabular.data import emb_sz_rule\n",
+ "from fast_rl.agents.cem import CEMLearner, CEMTrainer\n",
+ "from fast_rl.agents.cem_models import CEMModel\n",
+ "from fast_rl.core.data_block import MDPDataBunch\n",
+ "import numpy as np\n",
+ "from fast_rl.core.metrics import RewardMetric, RollingRewardMetric"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/docs_src/rl.agents.trpo.ipynb b/docs_src/rl.agents.trpo.ipynb
new file mode 100644
index 0000000..c768de4
--- /dev/null
+++ b/docs_src/rl.agents.trpo.ipynb
@@ -0,0 +1,96 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "collapsed": true,
+ "pycharm": {
+ "is_executing": false
+ }
+ },
+ "source": [
+ "\n",
+ "## TRPO\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from fastai.gen_doc.nbdoc import show_doc\n",
+ "from fast_rl.agents.trpo_models import *"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/markdown": [
+ "
\n",
+ "\n",
+ "> TRPOModule
(**`ni`**:`int`, **`na`**:`int`, **`discount`**:`float`, **`fc_layers`**:`List`\\[`int`\\]=***`None`***, **`conv_filters`**:`List`\\[`int`\\]=***`None`***, **`nc`**=***`3`***, **`bn`**=***`False`***, **`q_lr`**=***`0.001`***, **`v_lr`**=***`0.0001`***, **`ks`**:`List`\\[`int`\\]=***`None`***, **`stride`**:`List`\\[`int`\\]=***`None`***) :: [`PrePostInitMeta`](/core.html#PrePostInitMeta) :: [`Module`](/torch_core.html#Module)\n",
+ "\n",
+ "\n",
+ "\n",
+ "Implementation of the TRPO (Trust Region Policy Optimization) algorithm. Policy Gradient based algorithm for reinforcement learning in discrete\n",
+ " and continuous state and action spaces. Details of the algorithm's mathematical background can be found in [1].\n",
+ "\n",
+ "References:\n",
+ " [1] (Schulman et al., 2017) Trust Region Policy Optimization.\n",
+ "\n",
+ "Args:\n",
+ " ni: na: discount: fc_layers: conv_filters:\n",
+ " nc:\n",
+ " bn:\n",
+ " q_lr:\n",
+ " v_lr:\n",
+ " ks:\n",
+ " stride: "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "show_doc(TRPOModule)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/fast_rl/agents/cem.py b/fast_rl/agents/cem.py
new file mode 100644
index 0000000..280fb58
--- /dev/null
+++ b/fast_rl/agents/cem.py
@@ -0,0 +1,110 @@
+from typing import *
+
+import gym
+import numpy as np
+from fastai.basic_train import LearnerCallback, torch, ifnone, listify, OptimWrapper
+from torch import nn
+
+from fast_rl.core.agent_core import ExplorationStrategy, Experience
+from fast_rl.core.basic_train import AgentLearner
+from fast_rl.core.data_block import MDPStep
+
+
+class EpisodeBuffer(Experience):
+ def __init__(self, memory_size,**kwargs):
+ super().__init__(memory_size,**kwargs)
+ self.episodes:List[Dict[str,List[MDPStep]]]=[{}]
+ self.current_episode_reward=0
+
+ def __len__(self): return len(self.episodes)
+
+ def not_empty_episodes(self): return [e for e in self.episodes if e]
+
+ def update(self, item, **kwargs):
+ if 'episode' not in self.episodes[-1]: self.episodes[-1]['episode']=[]
+ self.episodes[-1]['episode'].append(item)
+ self.current_episode_reward+=item.reward.item()
+ if item.d:
+ self.episodes[-1]['reward']=self.current_episode_reward
+ self.current_episode_reward=0
+ self.episodes.append({})
+
+
+class CEMTrainer(LearnerCallback):
+ def __init__(self, learn):
+ super().__init__(learn)
+ self.cache_loss=None
+
+ def on_batch_begin(self,**kwargs):
+ return {'last_target':self.learn.data.x.items[-1].a.squeeze(0).to(device=self.learn.data.device)}
+
+ def on_backward_begin(self,smooth_loss, **kwargs:Any):
+ self.cache_loss=ifnone(self.cache_loss,smooth_loss)
+ if len(self.learn.memory)None:
+ if self.learn.model.training: self.learn.memory.update(item=self.learn.data.x.items[-1])
+
+
+class Probabilistic(ExplorationStrategy):
+ def __init__(self):
+ super().__init__()
+ self.sm=nn.Softmax(dim=1)
+
+ def perturb(self, action, action_space: gym.Space):
+ action=self.sm(action)
+ a_prob=action.squeeze(0).data.detach().cpu().numpy()
+ return np.random.choice(len(a_prob),p=a_prob)
+
+
+class CEMLearner(AgentLearner):
+ def __init__(self, data,model,percentile=70,trainers=None,lr=0.01,exploration_strategy=None,wd=0,**kwargs):
+ self.percentile=percentile
+ trainers=ifnone(trainers,CEMTrainer)
+ super().__init__(data=data, model=model,wd=wd,**kwargs)
+ self.opt=OptimWrapper.create(self.opt_func, lr=lr,layer_groups=[self.model.action_model])
+ self.loss_func=nn.CrossEntropyLoss()
+ self.exploration_strategy=ifnone(exploration_strategy,Probabilistic())
+ self.trainers=listify(trainers)
+ self.memory=EpisodeBuffer(self.data.batch_size)
+ for t in self.trainers: self.callbacks.append(t(self))
+
+ def filter_memory(self):
+ episodes=self.memory.not_empty_episodes()
+ r=list(map(lambda x: x['reward'],episodes))
+ r_boundary=np.percentile(r,self.percentile)
+ r_mean=float(np.mean(r))
+
+ s=[]
+ a=[]
+ for e in episodes:
+ if e['reward'] bool:
+ if cv_l is None or len(cv_l)==0: return False
+ # gen a list of conv blocks based on the input size and the list of filter sizes
+ conv_blocks=[conv_bn_lrelu(_ni, nf, s, p, bn=use_bn) for _ni, nf, s, p in
+ zip([ni]+cv_l[:-1], cv_l[1:], stride, padding)]
+ fixed_conv_blocks=self.fix_switched_channels(ni, nc, conv_blocks)
+ self.action_model.add_module('conv_block', Sequential(fixed_conv_blocks+[Flatten()]))
+ return True
+
+ def setup_linear_layers(self, ni, emb_szs, layers, ao, use_bn):
+ tabular_model=TabularModel(emb_szs=emb_szs, n_cont=ni if not emb_szs else 0, layers=layers, out_sz=ao,
+ use_bn=use_bn)
+ if not emb_szs: tabular_model.embeds=None
+ if not use_bn: tabular_model.bn_cont=FakeBatchNorm()
+ self.action_model.add_module('lin_block', TabularEmbedWrapper(tabular_model))
+
+ def forward(self, xi: Tensor):
+ training=self.training
+ if xi.shape[0]==1: self.eval()
+ pred=self.action_model(xi)
+ if training: self.train()
+ return pred
diff --git a/fast_rl/agents/ddpg_models.py b/fast_rl/agents/ddpg_models.py
index a568825..aca05f8 100644
--- a/fast_rl/agents/ddpg_models.py
+++ b/fast_rl/agents/ddpg_models.py
@@ -98,13 +98,6 @@ def __init__(self, ni: int, ao: int, layers: Collection[int], discount: float =
References:
[1] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning."
arXiv preprint arXiv:1509.02971 (2015).
-
- Args:
- data: Primary data object to use.
- memory: How big the tree buffer will be for offline training.
- tau: Defines how "soft/hard" we will copy the target networks over to the primary networks.
- discount: Determines the amount of discounting the existing Q reward.
- lr: Rate that the opt will learn parameter gradients.
"""
super().__init__()
self.name = 'DDPG'
diff --git a/fast_rl/agents/dist_dqn.py b/fast_rl/agents/dist_dqn.py
new file mode 100644
index 0000000..f8c3a79
--- /dev/null
+++ b/fast_rl/agents/dist_dqn.py
@@ -0,0 +1,185 @@
+import collections
+from copy import deepcopy
+
+from fastai.basic_train import LearnerCallback
+from fastai.imports import torch, Any
+
+from fast_rl.agents.dist_dqn_models import TargetNet
+from fast_rl.agents.dqn_models import distr_projection
+from fast_rl.core.agent_core import ExperienceReplay, NStepExperienceReplay
+from fast_rl.core.basic_train import AgentLearner, listify, List
+from fast_rl.core.data_block import MDPDataBunch, MDPStep
+from fastai.imports import torch
+
+import gym
+import numpy as np
+import argparse
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+
+Vmax = 10
+Vmin = -10
+N_ATOMS = 51
+DELTA_Z = (Vmax - Vmin) / (N_ATOMS - 1)
+
+ExperienceFirstLast = collections.namedtuple('ExperienceFirstLast', ('state', 'action', 'reward', 'last_state','done'))
+
+
+
+
+def unpack_batch(batch):
+ states, actions, rewards, dones, last_states = [], [], [], [], []
+ for exp in batch:
+ state = np.array(exp.state, copy=False)
+ states.append(state)
+ actions.append(exp.action)
+ rewards.append(exp.reward)
+ dones.append(exp.done)
+ # if exp.last_state is None:
+ # last_states.append(state) # the result will be masked anyway
+ # else:
+ last_states.append(np.array(exp.last_state, copy=False))
+ return np.array(states, copy=False), np.array(actions), np.array(rewards, dtype=np.float32), \
+ np.array(dones, dtype=np.uint8), np.array(last_states, copy=False)
+
+
+def calc_loss(batch, net, tgt_net, gamma, device="cpu", save_prefix=None):
+ states, actions, rewards, dones, next_states = unpack_batch(batch)
+ batch_size = len(batch)
+
+ states_v = torch.tensor(states).to(device)
+ actions_v = torch.tensor(actions).to(device)
+ next_states_v = torch.tensor(next_states).to(device)
+
+ # next state distribution
+ next_distr_v, next_qvals_v = tgt_net.both(next_states_v)
+ next_actions = next_qvals_v.max(1)[1].data.cpu().numpy()
+ next_distr = tgt_net.apply_softmax(next_distr_v).data.cpu().numpy()
+
+ next_best_distr = next_distr[range(batch_size), next_actions]
+ dones = dones.astype(np.bool)
+
+ # project our distribution using Bellman update
+ proj_distr = distr_projection(next_best_distr, rewards, dones, Vmin, Vmax, N_ATOMS, gamma)
+
+ # calculate net output
+ distr_v = net(states_v)
+ state_action_values = distr_v[range(batch_size), actions_v.data]
+ state_log_sm_v = F.log_softmax(state_action_values, dim=1)
+ proj_distr_v = torch.tensor(proj_distr).to(device)
+
+ loss_v = -state_log_sm_v * proj_distr_v
+ return loss_v.sum(dim=1).mean()
+
+
+
+class BaseDistDQNTrainer(LearnerCallback):
+ def __init__(self, learn: 'DistDQNLearner', max_episodes=None):
+ r"""Handles basic DQN end of step model optimization."""
+ super().__init__(learn)
+ self.n_skipped = 0
+ self._persist = max_episodes is not None
+ self.max_episodes = max_episodes
+ self.episode = -1
+ self.iteration = 0
+ # For the callback handler
+ self._order = 0
+ self.previous_item = None
+
+ @property
+ def learn(self)->'DistDQNLearner':
+ return self._learn()
+
+ def on_train_begin(self, n_epochs, **kwargs: Any):
+ self.max_episodes = n_epochs if not self._persist else self.max_episodes
+
+ def on_epoch_begin(self, epoch, **kwargs: Any):
+ pass
+
+ def on_backward_begin(self, **kwargs: Any):return {'skip_bwd': self.learn.warming_up}
+ def on_backward_end(self, **kwargs:Any): return {'skip_step':False}
+ def on_step_end(self, **kwargs: Any):return {'skip_zero': False}
+
+ def on_loss_begin(self, **kwargs: Any):
+ r"""Performs tree updates, exploration updates, and model optimization."""
+ if self.learn.model.training:
+ self.learn.memory.update(item=self.learn.data.x.items[-1])
+ self.iteration+=1
+ self.learn.epsilon_tracker.frame(self.iteration)
+
+ if not self.learn.warming_up:
+ samples: List[MDPStep]=self.memory.sample(self.learn.data.bs)
+ batch=[ExperienceFirstLast(state=deepcopy(s.s[0]),action=deepcopy(s.action.taken_action),
+ reward=deepcopy(s.reward),last_state=deepcopy(s.s_prime[0]),done=deepcopy(s.done)) for s in samples]
+ # model_func=lambda x: self.learn.model.qvals(x)
+ loss=calc_loss(batch,self.learn.model,self.learn.target_net.target_model,gamma=0.99,device=self.learn.data.device,save_prefix=None)
+ return {'last_output':loss}
+ else: return None
+
+ def on_batch_end(self, **kwargs:Any) ->None:
+ if self.iteration % 300 == 0:
+ self.learn.target_net.sync()
+
+
+class ArgmaxActionSelector(object):
+ """
+ Selects actions using argmax
+ """
+ def __call__(self, scores):
+ assert isinstance(scores, np.ndarray)
+ return np.argmax(scores, axis=1)
+
+
+class EpsilonGreedyActionSelector(object):
+ def __init__(self, epsilon=0.05, selector=None):
+ self.epsilon = epsilon
+ self.selector = selector if selector is not None else ArgmaxActionSelector()
+
+ def __call__(self, scores):
+ assert isinstance(scores, np.ndarray)
+ batch_size, n_actions = scores.shape
+ actions = self.selector(scores)
+ mask = np.random.random(size=batch_size) < self.epsilon
+ rand_actions = np.random.choice(n_actions, sum(mask))
+ actions[mask] = rand_actions
+ return actions
+
+class EpsilonTracker:
+ def __init__(self, epsilon_greedy_selector, params):
+ self.epsilon_greedy_selector = epsilon_greedy_selector
+ self.epsilon_start = params['epsilon_start']
+ self.epsilon_final = params['epsilon_final']
+ self.epsilon_frames = params['epsilon_frames']
+ self.frame(0)
+
+ def frame(self, frame):
+ self.epsilon_greedy_selector.epsilon = \
+ max(self.epsilon_final, self.epsilon_start - frame / self.epsilon_frames)
+
+
+
+class DistDQNLearner(AgentLearner):
+ def __init__(self, data: MDPDataBunch, model, trainers, loss_func=None,opt=torch.optim.Adam,**learn_kwargs):
+ super().__init__(data=data, model=model, opt=opt,loss_func=loss_func, **learn_kwargs)
+ self._loss_func=loss_func
+ self.memory=NStepExperienceReplay(100000)
+ self.target_net=TargetNet(self.model)
+ self.exploration_method=EpsilonGreedyActionSelector(1.0)
+ self.epsilon_tracker=EpsilonTracker(self.exploration_method, {'epsilon_frames': 100, 'epsilon_start': 1.0,
+ 'epsilon_final': 0.02})
+ self.trainers=listify(trainers)
+ for t in self.trainers: self.callbacks.append(t(self))
+
+ def init(self, init):pass
+ # def init_loss_func(self):pass
+
+ def predict(self, element, **kwargs):
+ model_func=lambda x: self.model.qvals(x)
+ q_v=model_func(element)
+ q=q_v.data.cpu().numpy()
+ actions=self.exploration_method(q)
+ return actions
+
diff --git a/fast_rl/agents/dist_dqn_models.py b/fast_rl/agents/dist_dqn_models.py
new file mode 100644
index 0000000..f191ca5
--- /dev/null
+++ b/fast_rl/agents/dist_dqn_models.py
@@ -0,0 +1,72 @@
+import copy
+
+import torch
+import torch.nn as nn
+from fastai.imports import torch
+
+Vmax = 10
+Vmin = -10
+N_ATOMS = 51
+DELTA_Z = (Vmax - Vmin) / (N_ATOMS - 1)
+
+
+class TargetNet:
+ """
+ Wrapper around model which provides copy of it instead of trained weights
+ """
+ def __init__(self, model):
+ self.model = model
+ self.target_model = copy.deepcopy(model)
+
+ def sync(self):
+ self.target_model.load_state_dict(self.model.state_dict())
+
+ def alpha_sync(self, alpha):
+ """
+ Blend params of target net with params from the model
+ :param alpha:
+ """
+ assert isinstance(alpha, float)
+ assert 0.0 < alpha <= 1.0
+ state = self.model.state_dict()
+ tgt_state = self.target_model.state_dict()
+ for k, v in state.items():
+ tgt_state[k] = tgt_state[k] * alpha + (1 - alpha) * v
+ self.target_model.load_state_dict(tgt_state)
+
+
+class DistributionalDQN(nn.Module):
+ def __init__(self, input_shape, n_actions):
+ super(DistributionalDQN, self).__init__()
+
+ self.fc = nn.Sequential(
+ nn.Linear(input_shape[0], 512),
+ nn.ReLU(),
+ nn.Linear(512, n_actions * N_ATOMS)
+ )
+
+ self.register_buffer("supports", torch.arange(Vmin, Vmax+DELTA_Z, DELTA_Z))
+ self.softmax = nn.Softmax(dim=1)
+
+ self.loss_func=None
+
+ def set_opt(self,_):pass
+
+ def forward(self, x):
+ batch_size = x.size()[0]
+ fc_out = self.fc(x.float())
+ return fc_out.view(batch_size, -1, N_ATOMS)
+
+ def both(self, x):
+ cat_out = self(x)
+ probs = self.apply_softmax(cat_out)
+ weights = probs * self.supports
+ res = weights.sum(dim=2)
+ return cat_out, res
+
+ def qvals(self, x):
+ return self.both(x)[1]
+
+ def apply_softmax(self, t):
+ return self.softmax(t.view(-1, N_ATOMS)).view(t.size())
+
diff --git a/fast_rl/agents/dqn.py b/fast_rl/agents/dqn.py
index 04a5afa..d60b825 100644
--- a/fast_rl/agents/dqn.py
+++ b/fast_rl/agents/dqn.py
@@ -101,7 +101,8 @@ def create_dqn_model(data: MDPDataBunch, base_arch: DQNModule, layers=None, igno
DoubleDQNModule: [BaseDQNTrainer, FixedTargetDQNTrainer],
DuelingDQNModule: [BaseDQNTrainer, FixedTargetDQNTrainer],
DoubleDuelingModule: [BaseDQNTrainer, FixedTargetDQNTrainer],
- FixedTargetDQNModule: [BaseDQNTrainer, FixedTargetDQNTrainer]
+ FixedTargetDQNModule: [BaseDQNTrainer, FixedTargetDQNTrainer],
+ DistributionalDQN: [BaseDQNTrainer, FixedTargetDQNTrainer]
}
diff --git a/fast_rl/agents/dqn_models.py b/fast_rl/agents/dqn_models.py
index f4d6306..ffceec8 100644
--- a/fast_rl/agents/dqn_models.py
+++ b/fast_rl/agents/dqn_models.py
@@ -1,204 +1,416 @@
from fastai.callback import OptimWrapper
from fast_rl.core.layers import *
+# import copy
+
+
+def distr_projection(next_distr, rewards, dones, Vmin, Vmax, n_atoms, gamma):
+ """
+ Perform distribution projection aka Catergorical Algorithm from the
+ "A Distributional Perspective on RL" paper
+ """
+ batch_size = len(rewards)
+ proj_distr = np.zeros((batch_size, n_atoms), dtype=np.float32)
+ delta_z = (Vmax - Vmin) / (n_atoms - 1)
+ for atom in range(n_atoms):
+ tz_j = np.minimum(Vmax, np.maximum(Vmin, rewards + (Vmin + atom * delta_z) * gamma))
+ b_j = (tz_j - Vmin) / delta_z
+ l = np.floor(b_j).astype(np.int64)
+ u = np.ceil(b_j).astype(np.int64)
+ eq_mask = u == l
+ proj_distr[eq_mask, l[eq_mask]] += next_distr[eq_mask, atom]
+ ne_mask = u != l
+ proj_distr[ne_mask, l[ne_mask]] += next_distr[ne_mask, atom] * (u - b_j)[ne_mask]
+ proj_distr[ne_mask, u[ne_mask]] += next_distr[ne_mask, atom] * (b_j - l)[ne_mask]
+ if dones.any():
+ proj_distr[dones] = 0.0
+ tz_j = np.minimum(Vmax, np.maximum(Vmin, rewards[dones]))
+ b_j = (tz_j - Vmin) / delta_z
+ l = np.floor(b_j).astype(np.int64)
+ u = np.ceil(b_j).astype(np.int64)
+ eq_mask = u == l
+ eq_dones = dones.copy()
+ eq_dones[dones] = eq_mask
+ if eq_dones.any():
+ proj_distr[eq_dones, l[eq_mask]] = 1.0
+ ne_mask = u != l
+ ne_dones = dones.copy()
+ ne_dones[dones] = ne_mask
+ if ne_dones.any():
+ proj_distr[ne_dones, l[ne_mask]] = (u - b_j)[ne_mask]
+ proj_distr[ne_dones, u[ne_mask]] = (b_j - l)[ne_mask]
+ return proj_distr
+
class DQNModule(Module):
- def __init__(self, ni: int, ao: int, layers: Collection[int], discount: float = 0.99, lr=0.001,
- n_conv_blocks: Collection[int] = 0, nc=3, opt=None, emb_szs: ListSizes = None, loss_func=None,
- w=-1, h=-1, ks: Union[None, list]=None, stride: Union[None, list]=None, grad_clip=5,
- conv_kern_proportion=0.1, stride_proportion=0.1, pad=False, batch_norm=False):
- r"""
- Basic DQN Module.
-
- Args:
- ni: Number of inputs. Expecting a flat state `[1 x ni]`
- ao: Number of actions to output.
- layers: Number of layers where is determined per element.
- n_conv_blocks: If `n_conv_blocks` is not 0, then convolutional blocks will be added
- to the head on top of existing linear layers.
- nc: Number of channels that will be expected by the convolutional blocks.
- """
- super().__init__()
- self.name = 'DQN'
- self.loss = None
- self.loss_func = loss_func
- self.discount = discount
- self.gradient_clipping_norm = grad_clip
- self.lr = lr
- self.batch_norm = batch_norm
- self.switched = False
- # self.ks, self.stride = ([], []) if len(n_conv_blocks) == 0 else ks_stride(ks, stride, w, h, n_conv_blocks, conv_kern_proportion, stride_proportion)
- self.ks, self.stride=([], []) if len(n_conv_blocks)==0 else (ifnone(ks, [10, 10, 10]), ifnone(stride, [5, 5, 5]))
- self.action_model = nn.Sequential()
- _layers = [conv_bn_lrelu(ch, self.nf, ks=ks, stride=stride, pad=pad, bn=self.batch_norm) for ch, self.nf, ks, stride in zip([nc]+n_conv_blocks[:-1],n_conv_blocks, self.ks, self.stride)]
-
- if _layers: ni = self.setup_conv_block(_layers=_layers, ni=ni, nc=nc, w=w, h=h)
- self.setup_linear_block(_layers=_layers, ni=ni, nc=nc, w=w, h=h, emb_szs=emb_szs, layers=layers, ao=ao)
- self.init_weights(self.action_model)
- self.opt = None
- self.set_opt(opt)
-
- def set_opt(self, opt):
- self.opt=OptimWrapper.create(ifnone(optim.Adam, opt), lr=self.lr, layer_groups=[self.action_model])
-
- def setup_conv_block(self, _layers, ni, nc, w, h):
- self.action_model.add_module('conv_block', nn.Sequential(*(self.fix_switched_channels(ni, nc, _layers) + [Flatten()])))
- training = self.action_model.training
- self.action_model.eval()
- ni = int(self.action_model(torch.zeros((1, w, h, nc) if self.switched else (1, nc, w, h))).view(-1, ).shape[0])
- self.action_model.train(training)
- return ni
-
- def setup_linear_block(self, _layers, ni, nc, w, h, emb_szs, layers, ao):
- tabular_model = TabularModel(emb_szs=emb_szs, n_cont=ni if not emb_szs else 0, layers=layers, out_sz=ao, use_bn=self.batch_norm)
- if not emb_szs: tabular_model.embeds = None
- if not self.batch_norm: tabular_model.bn_cont = FakeBatchNorm()
- self.action_model.add_module('lin_block', TabularEmbedWrapper(tabular_model))
-
- def fix_switched_channels(self, current_channels, expected_channels, layers: list):
- if current_channels == expected_channels:
- return layers
- else:
- self.switched = True
- return [ChannelTranspose()] + layers
-
- def forward(self, xi: Tensor):
- training = self.training
- if xi.shape[0] == 1: self.eval()
- pred = self.action_model(xi)
- if training: self.train()
- return pred
-
- def init_weights(self, m):
- if type(m) == nn.Linear:
- torch.nn.init.xavier_uniform_(m.weight)
- m.bias.data.fill_(0.01)
-
- def sample_mask(self, d):
- return torch.sub(1.0, d)
-
- def optimize(self, sampled):
- r"""Uses ER to optimize the Q-net (without fixed targets).
-
- Uses the equation:
-
- .. math::
- Q^{*}(s, a) = \mathbb{E}_{s'∼ \Big\epsilon} \Big[r + \lambda \displaystyle\max_{a'}(Q^{*}(s' , a'))
- \;|\; s, a \Big]
-
-
- Returns (dict): Optimization information
-
- """
- with torch.no_grad():
- r = torch.cat([item.reward.float() for item in sampled])
- s_prime = torch.cat([item.s_prime for item in sampled])
- s = torch.cat([item.s for item in sampled])
- a = torch.cat([item.a.long() for item in sampled])
- d = torch.cat([item.done.float() for item in sampled])
- masking = self.sample_mask(d)
-
- y_hat = self.y_hat(s, a)
- y = self.y(s_prime, masking, r, y_hat)
-
- loss = self.loss_func(y, y_hat)
-
- if self.training:
- self.opt.zero_grad()
- loss.backward()
- torch.nn.utils.clip_grad_norm_(self.action_model.parameters(), self.gradient_clipping_norm)
- for param in self.action_model.parameters():
- if param.grad is not None: param.grad.data.clamp_(-1, 1)
- self.opt.step()
-
- with torch.no_grad():
- self.loss = loss
- post_info = {'td_error': to_detach(y - y_hat).cpu().numpy()}
- return post_info
-
- def y_hat(self, s, a):
- return self.action_model(s).gather(1, a)
-
- def y(self, s_prime, masking, r, y_hat):
- return self.discount * self.action_model(s_prime).max(1)[0].unsqueeze(1) * masking + r.expand_as(y_hat)
+ def __init__(self, ni: int, ao: int, layers: Collection[int], discount: float = 0.99, lr=0.001,
+ n_conv_blocks: Collection[int] = 0, nc=3, opt=None, emb_szs: ListSizes = None, loss_func=None,
+ w=-1, h=-1, ks: Union[None, list]=None, stride: Union[None, list]=None, grad_clip=5,
+ conv_kern_proportion=0.1, stride_proportion=0.1, pad=False, batch_norm=False,lin_cls=nn.Linear,
+ do_grad_clipping=True):
+ r"""
+ Basic DQN Module.
+
+ Args:
+ ni: Number of inputs. Expecting a flat state `[1 x ni]`
+ ao: Number of actions to output.
+ layers: Number of layers where is determined per element.
+ n_conv_blocks: If `n_conv_blocks` is not 0, then convolutional blocks will be added
+ to the head on top of existing linear layers.
+ nc: Number of channels that will be expected by the convolutional blocks.
+ """
+ super().__init__()
+ self.lin_cls=lin_cls
+ self.name = 'DQN'
+ self.loss = None
+ self.loss_func = loss_func
+ self.discount = discount
+ self.gradient_clipping_norm = grad_clip
+ self.lr = lr
+ self.batch_norm = batch_norm
+ self.switched = False
+ self.do_grad_clipping=do_grad_clipping
+ # self.ks, self.stride = ([], []) if len(n_conv_blocks) == 0 else ks_stride(ks, stride, w, h, n_conv_blocks, conv_kern_proportion, stride_proportion)
+ self.ks, self.stride=([], []) if len(n_conv_blocks)==0 else (ifnone(ks, [10, 10, 10]), ifnone(stride, [5, 5, 5]))
+ self.action_model = nn.Sequential()
+ _layers = [conv_bn_lrelu(ch, self.nf, ks=ks, stride=stride, pad=pad, bn=self.batch_norm) for ch, self.nf, ks, stride in zip([nc]+n_conv_blocks[:-1],n_conv_blocks, self.ks, self.stride)]
+
+ if _layers: ni = self.setup_conv_block(_layers=_layers, ni=ni, nc=nc, w=w, h=h)
+ self.setup_linear_block(_layers=_layers, ni=ni, nc=nc, w=w, h=h, emb_szs=emb_szs, layers=layers, ao=ao)
+ self.init_weights(self.action_model)
+ self.opt = None
+ self.set_opt(opt)
+
+ def set_opt(self, opt):
+ self.opt=OptimWrapper.create(ifnone(optim.Adam, opt), lr=self.lr, layer_groups=[self.action_model])
+
+ def setup_conv_block(self, _layers, ni, nc, w, h):
+ self.action_model.add_module('conv_block', nn.Sequential(*(self.fix_switched_channels(ni, nc, _layers) + [Flatten()])))
+ training = self.action_model.training
+ self.action_model.eval()
+ ni = int(self.action_model(torch.zeros((1, w, h, nc) if self.switched else (1, nc, w, h))).view(-1, ).shape[0])
+ self.action_model.train(training)
+ return ni
+
+ def setup_linear_block(self, _layers, ni, nc, w, h, emb_szs, layers, ao):
+ tabular_model = TabularModel(emb_szs=emb_szs, n_cont=ni if not emb_szs else 0, layers=layers, out_sz=ao, use_bn=self.batch_norm,lin_cls=self.lin_cls)
+ if not emb_szs: tabular_model.embeds = None
+ if not self.batch_norm: tabular_model.bn_cont = FakeBatchNorm()
+ self.action_model.add_module('lin_block', TabularEmbedWrapper(tabular_model))
+
+ def fix_switched_channels(self, current_channels, expected_channels, layers: list):
+ if current_channels == expected_channels:
+ return layers
+ else:
+ self.switched = True
+ return [ChannelTranspose()] + layers
+
+ def forward(self, xi: Tensor):
+ training = self.training
+ if xi.shape[0] == 1: self.eval()
+ pred = self.action_model(xi)
+ if training: self.train()
+ return pred
+
+ def init_weights(self, m):
+ if issubclass(m.__class__,nn.Linear):
+ torch.nn.init.xavier_uniform_(m.weight)
+ m.bias.data.fill_(0.01)
+
+ def sample_mask(self, d):
+ return torch.sub(1.0, d)
+
+ def optimize(self, sampled):
+ r"""Uses ER to optimize the Q-net (without fixed targets).
+
+ Uses the equation:
+
+ .. math::
+ Q^{*}(s, a) = \mathbb{E}_{s'∼ \Big\epsilon} \Big[r + \lambda \displaystyle\max_{a'}(Q^{*}(s' , a'))
+ \;|\; s, a \Big]
+
+
+ Returns (dict): Optimization information
+
+ """
+ with torch.no_grad():
+ r = torch.cat([item.reward.float() for item in sampled])
+ s_prime = torch.cat([item.s_prime for item in sampled])
+ s = torch.cat([item.s for item in sampled])
+ a = torch.cat([item.a.long() for item in sampled])
+ d = torch.cat([item.done.float() for item in sampled])
+ masking = self.sample_mask(d)
+
+ y_hat = self.y_hat(s, a,s_prime,r,masking)
+ y = self.y(s_prime, masking, r, y_hat,s,a)
+ self.opt.zero_grad()
+ loss = self.loss_func(y, y_hat)
+
+ if self.training:
+ loss.backward()
+ if self.do_grad_clipping:
+ torch.nn.utils.clip_grad_norm_(self.action_model.parameters(), self.gradient_clipping_norm)
+ for param in self.action_model.parameters():
+ if param.grad is not None: param.grad.data.clamp_(-1, 1)
+ self.opt.step()
+
+ with torch.no_grad():
+ self.loss = loss
+ post_info = {'td_error': to_detach(y - y_hat).cpu().numpy()}
+ return post_info
+
+ def y_hat(self, s, a,s_prime,r,masking):
+ return self.action_model(s).gather(1, a)
+
+ def y(self, s_prime, masking, r, y_hat,s,a):
+ return self.discount * self.action_model(s_prime).max(1)[0].unsqueeze(1) * masking + r.expand_as(y_hat)
class FixedTargetDQNModule(DQNModule):
- def __init__(self, ni: int, ao: int, layers: Collection[int], tau=1, **kwargs):
- super().__init__(ni, ao, layers, **kwargs)
- self.name = 'Fixed Target DQN'
- self.tau = tau
- self.target_model = copy(self.action_model)
+ def __init__(self, ni: int, ao: int, layers: Collection[int], tau=1, **kwargs):
+ super().__init__(ni, ao, layers, **kwargs)
+ self.name = 'Fixed Target DQN'
+ self.tau = tau
+ self.target_model = deepcopy(self.action_model)
- def target_copy_over(self):
- r""" Updates the target network from calls in the FixedTargetDQNTrainer callback."""
- # self.target_net.load_state_dict(self.action_model.state_dict())
- for target_param, local_param in zip(self.target_model.parameters(), self.action_model.parameters()):
- target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data)
+ def target_copy_over(self):
+ r""" Updates the target network from calls in the FixedTargetDQNTrainer callback."""
+ # self.target_net.load_state_dict(self.action_model.state_dict())
+ # for target_param, local_param in zip(self.target_model.parameters(), self.action_model.parameters()):
+ # target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data)
+ self.target_model.load_state_dict(self.action_model.state_dict())
- def y(self, s_prime, masking, r, y_hat):
- r"""
- Uses the equation:
+ def y(self, s_prime, masking, r, y_hat,s,a):
+ r"""
+ Uses the equation:
- .. math::
+ .. math::
- Q^{*}(s, a) = \mathbb{E}_{s'∼ \Big\epsilon} \Big[r + \lambda \displaystyle\max_{a'}(Q^{*}(s' , a'))
- \;|\; s, a \Big]
+ Q^{*}(s, a) = \mathbb{E}_{s'∼ \Big\epsilon} \Big[r + \lambda \displaystyle\max_{a'}(Q^{*}(s' , a'))
+ \;|\; s, a \Big]
- """
- return self.discount * self.target_model(s_prime).max(1)[0].unsqueeze(1) * masking + r.expand_as(y_hat)
+ """
+ return self.discount * self.target_model(s_prime).max(1)[0].unsqueeze(1) * masking + r.expand_as(y_hat)
class DoubleDQNModule(FixedTargetDQNModule):
- def __init__(self, ni: int, ao: int, layers: Collection[int], **kwargs):
- super().__init__(ni, ao, layers, **kwargs)
- self.name = 'DDQN'
+ def __init__(self, ni: int, ao: int, layers: Collection[int], **kwargs):
+ super().__init__(ni, ao, layers, **kwargs)
+ self.name = 'DDQN'
- def calc_y(self, s_prime, masking, r, y_hat):
- return self.discount * self.target_model(s_prime).gather(1, self.action_model(s_prime).argmax(1).unsqueeze(
- 1)) * masking + r.expand_as(y_hat)
+ def y(self, s_prime, masking, r, y_hat,s,a):
+ return self.discount * self.target_model(s_prime).gather(1, self.action_model(s_prime).argmax(1).unsqueeze(
+ 1)) * masking + r.expand_as(y_hat)
class DuelingBlock(nn.Module):
- def __init__(self, ao, stream_input_size):
- super().__init__()
+ def __init__(self, ao, stream_input_size,lin_cls=nn.Linear):
+ super().__init__()
- self.val = nn.Linear(stream_input_size, 1)
- self.adv = nn.Linear(stream_input_size, ao)
+ self.val = lin_cls(stream_input_size, 1)
+ self.adv = lin_cls(stream_input_size, ao)
- def forward(self, xi):
- r"""Splits the base neural net output into 2 streams to evaluate the advantage and v of the s space and
- corresponding actions.
+ def forward(self, xi):
+ r"""Splits the base neural net output into 2 streams to evaluate the advantage and v of the s space and
+ corresponding actions.
- .. math::
- Q(s,a;\; \Theta, \\alpha, \\beta) = V(s;\; \Theta, \\beta) + A(s, a;\; \Theta, \\alpha) - \\frac{1}{|A|}
- \\Big\\sum_{a'} A(s, a';\; \Theta, \\alpha)
+ .. math::
+ Q(s,a;\; \Theta, \\alpha, \\beta) = V(s;\; \Theta, \\beta) + A(s, a;\; \Theta, \\alpha) - \\frac{1}{|A|}
+ \\Big\\sum_{a'} A(s, a';\; \Theta, \\alpha)
- """
- val, adv = self.val(xi), self.adv(xi)
- xi = val.expand_as(adv) + (adv - adv.mean()).squeeze(0)
- return xi
+ """
+ val, adv = self.val(xi), self.adv(xi)
+ xi = val.expand_as(adv) + (adv - adv.mean()).squeeze(0)
+ return xi
class DuelingDQNModule(FixedTargetDQNModule):
- def __init__(self, **kwargs):
- super().__init__(**kwargs)
- self.name = 'Dueling DQN'
+ def __init__(self, **kwargs):
+ super().__init__(**kwargs)
+ self.name = 'Dueling DQN'
- def setup_linear_block(self, _layers, ni, nc, w, h, emb_szs, layers, ao):
- tabular_model = TabularModel(emb_szs=emb_szs, n_cont=ni if not emb_szs else 0, layers=layers, out_sz=ao,
- use_bn=self.batch_norm)
- if not emb_szs: tabular_model.embeds = None
- if not self.batch_norm: tabular_model.bn_cont = FakeBatchNorm()
- tabular_model.layers, removed_layer = split_model(tabular_model.layers, [last_layer(tabular_model)])
- ni = removed_layer[0].in_features
- self.action_model.add_module('lin_block', TabularEmbedWrapper(tabular_model))
- self.action_model.add_module('dueling_block', DuelingBlock(ao, ni))
+ def setup_linear_block(self, _layers, ni, nc, w, h, emb_szs, layers, ao):
+ tabular_model = TabularModel(emb_szs=emb_szs, n_cont=ni if not emb_szs else 0, layers=layers, out_sz=ao,
+ use_bn=self.batch_norm,lin_cls=self.lin_cls)
+ if not emb_szs: tabular_model.embeds = None
+ if not self.batch_norm: tabular_model.bn_cont = FakeBatchNorm()
+ tabular_model.layers, removed_layer = split_model(tabular_model.layers, [last_layer(tabular_model)])
+ ni = removed_layer[0].in_features
+ self.action_model.add_module('lin_block', TabularEmbedWrapper(tabular_model))
+ self.action_model.add_module('dueling_block', DuelingBlock(ao, ni))
class DoubleDuelingModule(DuelingDQNModule, DoubleDQNModule):
- def __init__(self, **kwargs):
- super().__init__(**kwargs)
- self.name = 'DDDQN'
+ def __init__(self, **kwargs):
+ super().__init__(**kwargs)
+ self.name = 'DDDQN'
+
+
+def distributional_loss_fn(s_log_sm_v,proj_distr_v):
+ loss= (-s_log_sm_v*proj_distr_v)
+ return loss.sum(dim=1).mean()
+
+
+# class DistributionalDQN(FixedTargetDQNModule):
+# def __init__(self,ao,n_atoms=51,v_min=-10,v_max=10,**kwargs):
+# self.z_delta=(v_max-v_min)/(n_atoms-1)
+# self.n_atoms=n_atoms
+# self.v_min=v_min
+# self.v_max=v_max
+# super().__init__(ao=ao*n_atoms,**kwargs)
+# self.name='Distributional DQN'
+# # self.sm=nn.Softmax(dim=1)
+#
+# self.loss_func=distributional_loss_fn
+#
+# def init_weights(self, m):pass
+#
+# def setup_linear_block(self, **kwargs):
+# super(DistributionalDQN,self).setup_linear_block(**kwargs)
+# self.action_model.register_buffer('supports', torch.arange(self.v_min, self.v_max+self.z_delta, self.z_delta))
+# self.action_model.add_module('softmax_buff',nn.Softmax(dim=1))
+#
+# def both(self,xi,use_target=False):
+# if not use_target: cat_out=self(xi,False)
+# else: cat_out=self.target_model(xi).view(xi.size()[0],-1,self.n_atoms)
+# probs=self.apply_softmax(cat_out,use_target)
+# if not use_target: weights=probs*self.action_model.supports
+# else: weights=probs*self.target_model.supports
+# res=weights.sum(dim=2)
+# return cat_out,res
+#
+# def q_vals(self,xi):
+# return self.both(xi)[1]
+#
+# def apply_softmax(self,t,use_target=False):
+# if not use_target: return self.action_model.softmax_buff(t.view(-1,self.n_atoms)).view(t.size())
+# return self.target_model.softmax_buff(t.view(-1,self.n_atoms)).view(t.size())
+#
+# def y(self, s_prime, masking, r, y_hat,s,a):
+# distr_v=self(s,only_q=False)
+# state_action_values=distr_v[range(s.size()[0]), a.data]
+# state_log_sm_v=F.log_softmax(state_action_values, dim=1)
+# return state_log_sm_v
+#
+# def y_hat(self, s, a,s_prime,r,masking):
+# next_distr_v, next_q_vals_v=self.both(s_prime,True) # target
+# next_actions=next_q_vals_v.max(1)[1].data.cpu().numpy()
+# next_distr=self.apply_softmax(next_distr_v,True).data.cpu() # target
+# next_best_distr=next_distr[range(s_prime.size()[0]),next_actions]
+# proj_distr=distr_projection(next_best_distr,r,masking,self.v_min,self.v_max,self.n_atoms,self.discount)
+# proj_distr_v=torch.tensor(proj_distr).to(device=self.action_model.supports.device)
+# return proj_distr_v
+#
+# def forward(self, xi: Tensor,only_q=True):
+# return self.q_vals(xi) if only_q else super(DistributionalDQN,self).forward(xi).view(xi.size()[0],-1,self.n_atoms)
+
+class DistributionalDQNModule(nn.Module):
+ def __init__(self, input_shape, n_actions,n_atoms=51,v_min=-10,v_max=10,):
+ super(DistributionalDQNModule, self).__init__()
+ self.n_atoms=n_atoms
+ self.v_min=v_min
+ self.v_max=v_max
+ self.z_delta=(v_max-v_min)/(n_atoms-1)
+
+ self.fc = nn.Sequential(
+ nn.Linear(input_shape, 512),
+ nn.ReLU(),
+ nn.Linear(512, n_actions * self.n_atoms)
+ )
+
+ self.register_buffer("supports", torch.arange(self.v_min, self.v_max+self.z_delta, self.z_delta))
+ self.softmax = nn.Softmax(dim=1)
+
+ def forward(self, x):
+ batch_size = x.size()[0]
+ fc_out = self.fc(x.float())
+ return fc_out.view(batch_size, -1, self.n_atoms)
+
+ def both(self, x):
+ cat_out = self(x)
+ probs = self.apply_softmax(cat_out)
+ weights = probs * self.supports
+ res = weights.sum(dim=2)
+ return cat_out, res
+
+ def qvals(self, x): return self.both(x)[1]
+ def apply_softmax(self, t): return self.softmax(t.view(-1, self.n_atoms)).view(t.size())
+
+class DistributionalDQN(FixedTargetDQNModule):
+ def __init__(self,ao,n_atoms=51,v_min=-10,v_max=10,**kwargs):
+ self.z_delta=(v_max-v_min)/(n_atoms-1)
+ self.n_atoms=n_atoms
+ self.v_min=v_min
+ self.v_max=v_max
+ super().__init__(ao=ao,**kwargs)
+ self.do_grad_clipping=False
+ self.name='Distributional DQN'
+ # self.sm=nn.Softmax(dim=1)
+
+ self.loss_func=distributional_loss_fn
+
+ def init_weights(self, m):pass
+
+ def setup_linear_block(self, _layers, ni, nc, w, h, emb_szs, layers, ao,**kwargs):
+ self.action_model=DistributionalDQNModule(ni,ao)
+
+ def optimize(self, sampled):
+ with torch.no_grad():
+ r = torch.cat([item.reward.float() for item in sampled]).flatten().cpu().numpy()
+ s_prime = torch.cat([item.s_prime for item in sampled])
+ s = torch.cat([item.s for item in sampled])
+ a = torch.cat([item.a.long() for item in sampled])
+ d = torch.cat([item.done.float() for item in sampled]).flatten().cpu().numpy()
+ # masking = self.sample_mask(d)
+
+ batch_size=len(r)
+
+ # next state distribution
+ next_distr_v, next_qvals_v=self.target_model.both(s_prime)
+ next_actions=next_qvals_v.max(1)[1].data.cpu().numpy()
+ next_distr=self.target_model.apply_softmax(next_distr_v).data.cpu().numpy()
+
+ next_best_distr=next_distr[range(batch_size), next_actions]
+ dones=d.astype(np.bool)
+
+ # project our distribution using Bellman update
+ proj_distr=distr_projection(next_best_distr, r, dones, self.v_min, self.v_max, self.n_atoms, self.discount)
+
+ # calculate net output
+ distr_v=self.action_model(s)
+ state_action_values=distr_v[range(batch_size), a.data]
+ state_log_sm_v=F.log_softmax(state_action_values, dim=1)
+ proj_distr_v=torch.tensor(proj_distr).to(self.action_model.supports.device)
+
+ loss=-state_log_sm_v*proj_distr_v
+ loss= loss.sum(dim=1).mean()
+
+ with torch.no_grad():
+ self.loss = loss
+ _,y=self.action_model.both(s.to(device=self.action_model.supports.device))
+ post_info = {'td_error': to_detach(y - next_qvals_v).cpu().numpy()}
+ return post_info
+
+ def q_vals(self,xi):
+ return self.action_model.both(xi)[1]
+
+ # def y(self, s_prime, masking, r, y_hat,s,a):
+ # distr_v=self(s,only_q=False)
+ # state_action_values=distr_v[range(s.size()[0]), a.data]
+ # state_log_sm_v=F.log_softmax(state_action_values, dim=1)
+ # return state_log_sm_v
+ #
+ # def y_hat(self, s, a,s_prime,r,masking):
+ # next_distr_v, next_q_vals_v=self.both(s_prime,True) # target
+ # next_actions=next_q_vals_v.max(1)[1].data.cpu().numpy()
+ # next_distr=self.apply_softmax(next_distr_v,True).data.cpu() # target
+ # next_best_distr=next_distr[range(s_prime.size()[0]),next_actions]
+ # proj_distr=distr_projection(next_best_distr,r,masking,self.v_min,self.v_max,self.n_atoms,self.discount)
+ # proj_distr_v=torch.tensor(proj_distr).to(device=self.action_model.supports.device)
+ # return proj_distr_v
+
+ def forward(self, xi: Tensor,only_q=True):
+ bs=xi.size()[0]
+ return self.q_vals(xi) if only_q else super(DistributionalDQN,self).forward(xi).view(bs,-1,self.n_atoms)
\ No newline at end of file
diff --git a/fast_rl/agents/native_dist_dqn.py b/fast_rl/agents/native_dist_dqn.py
new file mode 100644
index 0000000..13cd77c
--- /dev/null
+++ b/fast_rl/agents/native_dist_dqn.py
@@ -0,0 +1,1898 @@
+#!/usr/bin/env python3
+import math
+
+import gym
+
+import numpy as np
+import argparse
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+
+"""basic wrappers, useful for reinforcement learning on gym envs"""
+# Mostly copy-pasted from https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
+import numpy as np
+from collections import deque
+import gym
+from gym import spaces
+import cv2
+
+import gym
+import torch
+import random
+import collections
+from torch.autograd import Variable
+
+import numpy as np
+
+from collections import namedtuple, deque
+
+
+# one single experience step
+Experience = namedtuple('Experience', ['state', 'action', 'reward', 'done'])
+
+
+import numpy as np
+
+import sys
+import time
+import operator
+from datetime import timedelta
+import numpy as np
+import collections
+
+import torch
+import torch.nn as nn
+
+
+class SMAQueue:
+ """
+ Queue of fixed size with mean, max, min operations
+ """
+ def __init__(self, size):
+ self.queue = collections.deque()
+ self.size = size
+
+ def __iadd__(self, other):
+ if isinstance(other, (list, tuple)):
+ self.queue.extend(other)
+ else:
+ self.queue.append(other)
+ while len(self.queue) > self.size:
+ self.queue.popleft()
+ return self
+
+ def __len__(self):
+ return len(self.queue)
+
+ def __repr__(self):
+ return "SMAQueue(size=%d)" % self.size
+
+ def __str__(self):
+ return "SMAQueue(size=%d, len=%d)" % (self.size, len(self.queue))
+
+ def min(self):
+ if not self.queue:
+ return None
+ return np.min(self.queue)
+
+ def mean(self):
+ if not self.queue:
+ return None
+ return np.mean(self.queue)
+
+ def max(self):
+ if not self.queue:
+ return None
+ return np.max(self.queue)
+
+
+class SpeedMonitor:
+ def __init__(self, batch_size, autostart=True):
+ self.batch_size = batch_size
+ self.start_ts = None
+ self.batches = None
+ if autostart:
+ self.reset()
+
+ def epoch(self):
+ if self.epoches is not None:
+ self.epoches += 1
+
+ def batch(self):
+ if self.batches is not None:
+ self.batches += 1
+
+ def reset(self):
+ self.start_ts = time.time()
+ self.batches = 0
+ self.epoches = 0
+
+ def seconds(self):
+ """
+ Seconds since last reset
+ :return:
+ """
+ return time.time() - self.start_ts
+
+ def samples_per_sec(self):
+ """
+ Calculate samples per second since last reset() call
+ :return: float count samples per second or None if not started
+ """
+ if self.start_ts is None:
+ return None
+ secs = self.seconds()
+ if abs(secs) < 1e-5:
+ return 0.0
+ return (self.batches + 1) * self.batch_size / secs
+
+ def epoch_time(self):
+ """
+ Calculate average epoch time
+ :return: timedelta object
+ """
+ if self.start_ts is None:
+ return None
+ s = self.seconds()
+ if self.epoches > 0:
+ s /= self.epoches + 1
+ return timedelta(seconds=s)
+
+ def batch_time(self):
+ """
+ Calculate average batch time
+ :return: timedelta object
+ """
+ if self.start_ts is None:
+ return None
+ s = self.seconds()
+ if self.batches > 0:
+ s /= self.batches + 1
+ return timedelta(seconds=s)
+
+
+class WeightedMSELoss(nn.Module):
+ def __init__(self, size_average=True):
+ super(WeightedMSELoss, self).__init__()
+ self.size_average = size_average
+
+ def forward(self, input, target, weights=None):
+ if weights is None:
+ return nn.MSELoss(self.size_average)(input, target)
+
+ loss_rows = (input - target) ** 2
+ if len(loss_rows.size()) != 1:
+ loss_rows = torch.sum(loss_rows, dim=1)
+ res = (weights * loss_rows).sum()
+ if self.size_average:
+ res /= len(weights)
+ return res
+
+
+class SegmentTree(object):
+ def __init__(self, capacity, operation, neutral_element):
+ """Build a Segment Tree data structure.
+
+ https://en.wikipedia.org/wiki/Segment_tree
+
+ Can be used as regular array, but with two
+ important differences:
+
+ a) setting item's value is slightly slower.
+ It is O(lg capacity) instead of O(1).
+ b) user has access to an efficient `reduce`
+ operation which reduces `operation` over
+ a contiguous subsequence of items in the
+ array.
+
+ Paramters
+ ---------
+ capacity: int
+ Total size of the array - must be a power of two.
+ operation: lambda obj, obj -> obj
+ and operation for combining elements (eg. sum, max)
+ must for a mathematical group together with the set of
+ possible values for array elements.
+ neutral_element: obj
+ neutral element for the operation above. eg. float('-inf')
+ for max and 0 for sum.
+ """
+ assert capacity > 0 and capacity & (capacity - 1) == 0, "capacity must be positive and a power of 2."
+ self._capacity = capacity
+ self._value = [neutral_element for _ in range(2 * capacity)]
+ self._operation = operation
+
+ def _reduce_helper(self, start, end, node, node_start, node_end):
+ if start == node_start and end == node_end:
+ return self._value[node]
+ mid = (node_start + node_end) // 2
+ if end <= mid:
+ return self._reduce_helper(start, end, 2 * node, node_start, mid)
+ else:
+ if mid + 1 <= start:
+ return self._reduce_helper(start, end, 2 * node + 1, mid + 1, node_end)
+ else:
+ return self._operation(
+ self._reduce_helper(start, mid, 2 * node, node_start, mid),
+ self._reduce_helper(mid + 1, end, 2 * node + 1, mid + 1, node_end)
+ )
+
+ def reduce(self, start=0, end=None):
+ """Returns result of applying `self.operation`
+ to a contiguous subsequence of the array.
+
+ self.operation(arr[start], operation(arr[start+1], operation(... arr[end])))
+
+ Parameters
+ ----------
+ start: int
+ beginning of the subsequence
+ end: int
+ end of the subsequences
+
+ Returns
+ -------
+ reduced: obj
+ result of reducing self.operation over the specified range of array elements.
+ """
+ if end is None:
+ end = self._capacity
+ if end < 0:
+ end += self._capacity
+ end -= 1
+ return self._reduce_helper(start, end, 1, 0, self._capacity - 1)
+
+ def __setitem__(self, idx, val):
+ # index of the leaf
+ idx += self._capacity
+ self._value[idx] = val
+ idx //= 2
+ while idx >= 1:
+ self._value[idx] = self._operation(
+ self._value[2 * idx],
+ self._value[2 * idx + 1]
+ )
+ idx //= 2
+
+ def __getitem__(self, idx):
+ assert 0 <= idx < self._capacity
+ return self._value[self._capacity + idx]
+
+
+class SumSegmentTree(SegmentTree):
+ def __init__(self, capacity):
+ super(SumSegmentTree, self).__init__(
+ capacity=capacity,
+ operation=operator.add,
+ neutral_element=0.0
+ )
+
+ def sum(self, start=0, end=None):
+ """Returns arr[start] + ... + arr[end]"""
+ return super(SumSegmentTree, self).reduce(start, end)
+
+ def find_prefixsum_idx(self, prefixsum):
+ """Find the highest index `i` in the array such that
+ sum(arr[0] + arr[1] + ... + arr[i - i]) <= prefixsum
+
+ if array values are probabilities, this function
+ allows to sample indexes according to the discrete
+ probability efficiently.
+
+ Parameters
+ ----------
+ perfixsum: float
+ upperbound on the sum of array prefix
+
+ Returns
+ -------
+ idx: int
+ highest index satisfying the prefixsum constraint
+ """
+ assert 0 <= prefixsum <= self.sum() + 1e-5
+ idx = 1
+ while idx < self._capacity: # while non-leaf
+ if self._value[2 * idx] > prefixsum:
+ idx = 2 * idx
+ else:
+ prefixsum -= self._value[2 * idx]
+ idx = 2 * idx + 1
+ return idx - self._capacity
+
+
+class MinSegmentTree(SegmentTree):
+ def __init__(self, capacity):
+ super(MinSegmentTree, self).__init__(
+ capacity=capacity,
+ operation=min,
+ neutral_element=float('inf')
+ )
+
+ def min(self, start=0, end=None):
+ """Returns min(arr[start], ..., arr[end])"""
+
+ return super(MinSegmentTree, self).reduce(start, end)
+
+
+class TBMeanTracker:
+ """
+ TensorBoard value tracker: allows to batch fixed amount of historical values and write their mean into TB
+
+ Designed and tested with pytorch-tensorboard in mind
+ """
+ def __init__(self, writer, batch_size):
+ """
+ :param writer: writer with close() and add_scalar() methods
+ :param batch_size: integer size of batch to track
+ """
+ assert isinstance(batch_size, int)
+ assert writer is not None
+ self.writer = writer
+ self.batch_size = batch_size
+
+ def __enter__(self):
+ self._batches = collections.defaultdict(list)
+ return self
+
+ def __exit__(self, exc_type, exc_val, exc_tb):
+ self.writer.close()
+
+ @staticmethod
+ def _as_float(value):
+ assert isinstance(value, (float, int, np.ndarray, np.generic, torch.autograd.Variable)) or torch.is_tensor(value)
+ tensor_val = None
+ if isinstance(value, torch.autograd.Variable):
+ tensor_val = value.data
+ elif torch.is_tensor(value):
+ tensor_val = value
+
+ if tensor_val is not None:
+ return tensor_val.float().mean()
+ elif isinstance(value, np.ndarray):
+ return float(np.mean(value))
+ else:
+ return float(value)
+
+ def track(self, param_name, value, iter_index):
+ assert isinstance(param_name, str)
+ assert isinstance(iter_index, int)
+
+ data = self._batches[param_name]
+ data.append(self._as_float(value))
+
+ if len(data) >= self.batch_size:
+ self.writer.add_scalar(param_name, np.mean(data), iter_index)
+ data.clear()
+
+
+class RewardTracker:
+ def __init__(self, writer):
+ self.writer = writer
+
+ def __enter__(self):
+ self.ts = time.time()
+ self.ts_frame = 0
+ self.total_rewards = []
+ return self
+
+ def __exit__(self, *args):
+ self.writer.close()
+
+ def reward(self, reward, frame, epsilon=None):
+ self.total_rewards.append(reward)
+ speed = (frame - self.ts_frame) / (time.time() - self.ts)
+ self.ts_frame = frame
+ self.ts = time.time()
+ mean_reward = np.mean(self.total_rewards[-100:])
+ epsilon_str = "" if epsilon is None else ", eps %.2f" % epsilon
+ print("%d: done %d episodes, mean reward %.3f, speed %.2f f/s%s" % (
+ frame, len(self.total_rewards), mean_reward, speed, epsilon_str
+ ))
+ sys.stdout.flush()
+ if epsilon is not None:
+ self.writer.add_scalar("epsilon", epsilon, frame)
+ self.writer.add_scalar("speed", speed, frame)
+ self.writer.add_scalar("reward_100", mean_reward, frame)
+ self.writer.add_scalar("reward", reward, frame)
+ return mean_reward if len(self.total_rewards) > 30 else None
+
+
+
+class ActionSelector:
+ """
+ Abstract class which converts scores to the actions
+ """
+ def __call__(self, scores):
+ raise NotImplementedError
+
+
+class ArgmaxActionSelector(ActionSelector):
+ """
+ Selects actions using argmax
+ """
+ def __call__(self, scores):
+ assert isinstance(scores, np.ndarray)
+ return np.argmax(scores, axis=1)
+
+
+class EpsilonGreedyActionSelector(ActionSelector):
+ def __init__(self, epsilon=0.05, selector=None):
+ self.epsilon = epsilon
+ self.selector = selector if selector is not None else ArgmaxActionSelector()
+
+ def __call__(self, scores):
+ assert isinstance(scores, np.ndarray)
+ batch_size, n_actions = scores.shape
+ actions = self.selector(scores)
+ mask = np.random.random(size=batch_size) < self.epsilon
+ rand_actions = np.random.choice(n_actions, sum(mask))
+ actions[mask] = rand_actions
+ return actions
+
+
+class ProbabilityActionSelector(ActionSelector):
+ """
+ Converts probabilities of actions into action by sampling them
+ """
+ def __call__(self, probs):
+ assert isinstance(probs, np.ndarray)
+ actions = []
+ for prob in probs:
+ actions.append(np.random.choice(len(prob), p=prob))
+ return np.array(actions)
+
+
+
+class ExperienceSource:
+ """
+ Simple n-step experience source using single or multiple environments
+
+ Every experience contains n list of Experience entries
+ """
+ def __init__(self, env, agent, steps_count=2, steps_delta=1, vectorized=False):
+ """
+ Create simple experience source
+ :param env: environment or list of environments to be used
+ :param agent: callable to convert batch of states into actions to take
+ :param steps_count: count of steps to track for every experience chain
+ :param steps_delta: how many steps to do between experience items
+ :param vectorized: support of vectorized envs from OpenAI universe
+ """
+ assert isinstance(env, (gym.Env, list, tuple))
+ assert isinstance(agent, BaseAgent)
+ assert isinstance(steps_count, int)
+ assert steps_count >= 1
+ assert isinstance(vectorized, bool)
+ if isinstance(env, (list, tuple)):
+ self.pool = env
+ else:
+ self.pool = [env]
+ self.agent = agent
+ self.steps_count = steps_count
+ self.steps_delta = steps_delta
+ self.total_rewards = []
+ self.total_steps = []
+ self.vectorized = vectorized
+
+ def __iter__(self):
+ states, agent_states, histories, cur_rewards, cur_steps = [], [], [], [], []
+ env_lens = []
+ for env in self.pool:
+ obs = env.reset()
+ # if the environment is vectorized, all it's output is lists of results.
+ # Details are here: https://github.com/openai/universe/blob/master/doc/env_semantics.rst
+ if self.vectorized:
+ obs_len = len(obs)
+ states.extend(obs)
+ else:
+ obs_len = 1
+ states.append(obs)
+ env_lens.append(obs_len)
+
+ for _ in range(obs_len):
+ histories.append(deque(maxlen=self.steps_count))
+ cur_rewards.append(0.0)
+ cur_steps.append(0)
+ agent_states.append(self.agent.initial_state())
+
+ iter_idx = 0
+ while True:
+ actions = [None] * len(states)
+ states_input = []
+ states_indices = []
+ for idx, state in enumerate(states):
+ if state is None:
+ actions[idx] = self.pool[0].action_space.sample() # assume that all envs are from the same family
+ else:
+ states_input.append(state)
+ states_indices.append(idx)
+ if states_input:
+ states_actions, new_agent_states = self.agent(states_input, agent_states)
+ for idx, action in enumerate(states_actions):
+ g_idx = states_indices[idx]
+ actions[g_idx] = action
+ agent_states[g_idx] = new_agent_states[idx]
+ grouped_actions = _group_list(actions, env_lens)
+
+ global_ofs = 0
+ for env_idx, (env, action_n) in enumerate(zip(self.pool, grouped_actions)):
+ if self.vectorized:
+ next_state_n, r_n, is_done_n, _ = env.step(action_n)
+ else:
+ next_state, r, is_done, _ = env.step(action_n[0])
+ next_state_n, r_n, is_done_n = [next_state], [r], [is_done]
+
+ for ofs, (action, next_state, r, is_done) in enumerate(zip(action_n, next_state_n, r_n, is_done_n)):
+ idx = global_ofs + ofs
+ state = states[idx]
+ history = histories[idx]
+
+ cur_rewards[idx] += r
+ cur_steps[idx] += 1
+ if state is not None:
+ history.append(Experience(state=state, action=action, reward=r, done=is_done))
+ if len(history) == self.steps_count and iter_idx % self.steps_delta == 0:
+ yield tuple(history)
+ states[idx] = next_state
+ if is_done:
+ # generate tail of history
+ while len(history) >= 1:
+ yield tuple(history)
+ history.popleft()
+ self.total_rewards.append(cur_rewards[idx])
+ self.total_steps.append(cur_steps[idx])
+ cur_rewards[idx] = 0.0
+ cur_steps[idx] = 0
+ # vectorized envs are reset automatically
+ states[idx] = env.reset() if not self.vectorized else None
+ agent_states[idx] = self.agent.initial_state()
+ history.clear()
+ global_ofs += len(action_n)
+ iter_idx += 1
+
+ def pop_total_rewards(self):
+ r = self.total_rewards
+ if r:
+ self.total_rewards = []
+ self.total_steps = []
+ return r
+
+ def pop_rewards_steps(self):
+ res = list(zip(self.total_rewards, self.total_steps))
+ if res:
+ self.total_rewards, self.total_steps = [], []
+ return res
+
+
+def _group_list(items, lens):
+ """
+ Unflat the list of items by lens
+ :param items: list of items
+ :param lens: list of integers
+ :return: list of list of items grouped by lengths
+ """
+ res = []
+ cur_ofs = 0
+ for g_len in lens:
+ res.append(items[cur_ofs:cur_ofs+g_len])
+ cur_ofs += g_len
+ return res
+
+
+# those entries are emitted from ExperienceSourceFirstLast. Reward is discounted over the trajectory piece
+ExperienceFirstLast = collections.namedtuple('ExperienceFirstLast', ('state', 'action', 'reward', 'last_state'))
+
+
+class ExperienceSourceFirstLast(ExperienceSource):
+ """
+ This is a wrapper around ExperienceSource to prevent storing full trajectory in replay buffer when we need
+ only first and last states. For every trajectory piece it calculates discounted reward and emits only first
+ and last states and action taken in the first state.
+
+ If we have partial trajectory at the end of episode, last_state will be None
+ """
+ def __init__(self, env, agent, gamma, steps_count=1, steps_delta=1, vectorized=False):
+ assert isinstance(gamma, float)
+ super(ExperienceSourceFirstLast, self).__init__(env, agent, steps_count+1, steps_delta, vectorized=vectorized)
+ self.gamma = gamma
+ self.steps = steps_count
+
+ def __iter__(self):
+ for exp in super(ExperienceSourceFirstLast, self).__iter__():
+ if exp[-1].done and len(exp) <= self.steps:
+ last_state = None
+ elems = exp
+ else:
+ last_state = exp[-1].state
+ elems = exp[:-1]
+ total_reward = 0.0
+ for e in reversed(elems):
+ total_reward *= self.gamma
+ total_reward += e.reward
+ yield ExperienceFirstLast(state=exp[0].state, action=exp[0].action,
+ reward=total_reward, last_state=last_state)
+
+
+def discount_with_dones(rewards, dones, gamma):
+ discounted = []
+ r = 0
+ for reward, done in zip(rewards[::-1], dones[::-1]):
+ r = reward + gamma*r*(1.-done)
+ discounted.append(r)
+ return discounted[::-1]
+
+
+class ExperienceSourceRollouts:
+ """
+ N-step rollout experience source following A3C rollouts scheme. Have to be used with agent,
+ keeping the value in its state (for example, agent.ActorCriticAgent).
+
+ Yields batches of num_envs * n_steps samples with the following arrays:
+ 1. observations
+ 2. actions
+ 3. discounted rewards, with values approximation
+ 4. values
+ """
+ def __init__(self, env, agent, gamma, steps_count=5):
+ """
+ Constructs the rollout experience source
+ :param env: environment or list of environments to be used
+ :param agent: callable to convert batch of states into actions
+ :param steps_count: how many steps to perform rollouts
+ """
+ assert isinstance(env, (gym.Env, list, tuple))
+ assert isinstance(agent, BaseAgent)
+ assert isinstance(gamma, float)
+ assert isinstance(steps_count, int)
+ assert steps_count >= 1
+
+ if isinstance(env, (list, tuple)):
+ self.pool = env
+ else:
+ self.pool = [env]
+ self.agent = agent
+ self.gamma = gamma
+ self.steps_count = steps_count
+ self.total_rewards = []
+ self.total_steps = []
+
+ def __iter__(self):
+ pool_size = len(self.pool)
+ states = [np.array(e.reset()) for e in self.pool]
+ mb_states = np.zeros((pool_size, self.steps_count) + states[0].shape, dtype=states[0].dtype)
+ mb_rewards = np.zeros((pool_size, self.steps_count), dtype=np.float32)
+ mb_values = np.zeros((pool_size, self.steps_count), dtype=np.float32)
+ mb_actions = np.zeros((pool_size, self.steps_count), dtype=np.int64)
+ mb_dones = np.zeros((pool_size, self.steps_count), dtype=np.bool)
+ total_rewards = [0.0] * pool_size
+ total_steps = [0] * pool_size
+ agent_states = None
+ step_idx = 0
+
+ while True:
+ actions, agent_states = self.agent(states, agent_states)
+ rewards = []
+ dones = []
+ new_states = []
+ for env_idx, (e, action) in enumerate(zip(self.pool, actions)):
+ o, r, done, _ = e.step(action)
+ total_rewards[env_idx] += r
+ total_steps[env_idx] += 1
+ if done:
+ o = e.reset()
+ self.total_rewards.append(total_rewards[env_idx])
+ self.total_steps.append(total_steps[env_idx])
+ total_rewards[env_idx] = 0.0
+ total_steps[env_idx] = 0
+ new_states.append(np.array(o))
+ dones.append(done)
+ rewards.append(r)
+ # we need an extra step to get values approximation for rollouts
+ if step_idx == self.steps_count:
+ # calculate rollout rewards
+ for env_idx, (env_rewards, env_dones, last_value) in enumerate(zip(mb_rewards, mb_dones, agent_states)):
+ env_rewards = env_rewards.tolist()
+ env_dones = env_dones.tolist()
+ if not env_dones[-1]:
+ env_rewards = discount_with_dones(env_rewards + [last_value], env_dones + [False], self.gamma)[:-1]
+ else:
+ env_rewards = discount_with_dones(env_rewards, env_dones, self.gamma)
+ mb_rewards[env_idx] = env_rewards
+ yield mb_states.reshape((-1,) + mb_states.shape[2:]), mb_rewards.flatten(), mb_actions.flatten(), mb_values.flatten()
+ step_idx = 0
+ mb_states[:, step_idx] = states
+ mb_rewards[:, step_idx] = rewards
+ mb_values[:, step_idx] = agent_states
+ mb_actions[:, step_idx] = actions
+ mb_dones[:, step_idx] = dones
+ step_idx += 1
+ states = new_states
+
+ def pop_total_rewards(self):
+ r = self.total_rewards
+ if r:
+ self.total_rewards = []
+ self.total_steps = []
+ return r
+
+ def pop_rewards_steps(self):
+ res = list(zip(self.total_rewards, self.total_steps))
+ if res:
+ self.total_rewards, self.total_steps = [], []
+ return res
+
+
+class ExperienceSourceBuffer:
+ """
+ The same as ExperienceSource, but takes episodes from the buffer
+ """
+ def __init__(self, buffer, steps_count=1):
+ """
+ Create buffered experience source
+ :param buffer: list of episodes, each is a list of Experience object
+ :param steps_count: count of steps in every entry
+ """
+ self.update_buffer(buffer)
+ self.steps_count = steps_count
+
+ def update_buffer(self, buffer):
+ self.buffer = buffer
+ self.lens = list(map(len, buffer))
+
+ def __iter__(self):
+ """
+ Infinitely sample episode from the buffer and then sample item offset
+ """
+ while True:
+ episode = random.randrange(len(self.buffer))
+ ofs = random.randrange(self.lens[episode] - self.steps_count - 1)
+ yield self.buffer[episode][ofs:ofs+self.steps_count]
+
+
+class ExperienceReplayBuffer:
+ def __init__(self, experience_source, buffer_size):
+ assert isinstance(experience_source, (ExperienceSource, type(None)))
+ assert isinstance(buffer_size, int)
+ self.experience_source_iter = None if experience_source is None else iter(experience_source)
+ self.buffer = []
+ self.capacity = buffer_size
+ self.pos = 0
+
+ def __len__(self):
+ return len(self.buffer)
+
+ def __iter__(self):
+ return iter(self.buffer)
+
+ def sample(self, batch_size):
+ """
+ Get one random batch from experience replay
+ TODO: implement sampling order policy
+ :param batch_size:
+ :return:
+ """
+ if len(self.buffer) <= batch_size:
+ return self.buffer
+ # Warning: replace=False makes random.choice O(n)
+ keys = np.random.choice(len(self.buffer), batch_size, replace=True)
+ return [self.buffer[key] for key in keys]
+
+ def _add(self, sample):
+ if len(self.buffer) < self.capacity:
+ self.buffer.append(sample)
+ else:
+ self.buffer[self.pos] = sample
+ self.pos = (self.pos + 1) % self.capacity
+
+ def populate(self, samples):
+ """
+ Populates samples into the buffer
+ :param samples: how many samples to populate
+ """
+ for _ in range(samples):
+ entry = next(self.experience_source_iter)
+ self._add(entry)
+
+class PrioReplayBufferNaive:
+ def __init__(self, exp_source, buf_size, prob_alpha=0.6):
+ self.exp_source_iter = iter(exp_source)
+ self.prob_alpha = prob_alpha
+ self.capacity = buf_size
+ self.pos = 0
+ self.buffer = []
+ self.priorities = np.zeros((buf_size, ), dtype=np.float32)
+
+ def __len__(self):
+ return len(self.buffer)
+
+ def populate(self, count):
+ max_prio = self.priorities.max() if self.buffer else 1.0
+ for _ in range(count):
+ sample = next(self.exp_source_iter)
+ if len(self.buffer) < self.capacity:
+ self.buffer.append(sample)
+ else:
+ self.buffer[self.pos] = sample
+ self.priorities[self.pos] = max_prio
+ self.pos = (self.pos + 1) % self.capacity
+
+ def sample(self, batch_size, beta=0.4):
+ if len(self.buffer) == self.capacity:
+ prios = self.priorities
+ else:
+ prios = self.priorities[:self.pos]
+ probs = np.array(prios, dtype=np.float32) ** self.prob_alpha
+
+ probs /= probs.sum()
+ indices = np.random.choice(len(self.buffer), batch_size, p=probs, replace=True)
+ samples = [self.buffer[idx] for idx in indices]
+ total = len(self.buffer)
+ weights = (total * probs[indices]) ** (-beta)
+ weights /= weights.max()
+ return samples, indices, np.array(weights, dtype=np.float32)
+
+ def update_priorities(self, batch_indices, batch_priorities):
+ for idx, prio in zip(batch_indices, batch_priorities):
+ self.priorities[idx] = prio
+
+
+class PrioritizedReplayBuffer(ExperienceReplayBuffer):
+ def __init__(self, experience_source, buffer_size, alpha):
+ super(PrioritizedReplayBuffer, self).__init__(experience_source, buffer_size)
+ assert alpha > 0
+ self._alpha = alpha
+
+ it_capacity = 1
+ while it_capacity < buffer_size:
+ it_capacity *= 2
+
+ self._it_sum = SumSegmentTree(it_capacity)
+ self._it_min = MinSegmentTree(it_capacity)
+ self._max_priority = 1.0
+
+ def _add(self, *args, **kwargs):
+ idx = self.pos
+ super()._add(*args, **kwargs)
+ self._it_sum[idx] = self._max_priority ** self._alpha
+ self._it_min[idx] = self._max_priority ** self._alpha
+
+ def _sample_proportional(self, batch_size):
+ res = []
+ for _ in range(batch_size):
+ mass = random.random() * self._it_sum.sum(0, len(self) - 1)
+ idx = self._it_sum.find_prefixsum_idx(mass)
+ res.append(idx)
+ return res
+
+ def sample(self, batch_size, beta):
+ assert beta > 0
+
+ idxes = self._sample_proportional(batch_size)
+
+ weights = []
+ p_min = self._it_min.min() / self._it_sum.sum()
+ max_weight = (p_min * len(self)) ** (-beta)
+
+ for idx in idxes:
+ p_sample = self._it_sum[idx] / self._it_sum.sum()
+ weight = (p_sample * len(self)) ** (-beta)
+ weights.append(weight / max_weight)
+ weights = np.array(weights, dtype=np.float32)
+ samples = [self.buffer[idx] for idx in idxes]
+ return samples, idxes, weights
+
+ def update_priorities(self, idxes, priorities):
+ assert len(idxes) == len(priorities)
+ for idx, priority in zip(idxes, priorities):
+ assert priority > 0
+ assert 0 <= idx < len(self)
+ self._it_sum[idx] = priority ** self._alpha
+ self._it_min[idx] = priority ** self._alpha
+
+ self._max_priority = max(self._max_priority, priority)
+
+
+class BatchPreprocessor:
+ """
+ Abstract preprocessor class descendants to which converts experience
+ batch to form suitable to learning.
+ """
+ def preprocess(self, batch):
+ raise NotImplementedError
+
+
+class QLearningPreprocessor(BatchPreprocessor):
+ """
+ Supports SimpleDQN, TargetDQN, DoubleDQN and can additionally feed TD-error back to
+ experience replay buffer.
+
+ To use different modes, use appropriate class method
+ """
+ def __init__(self, model, target_model, use_double_dqn=False, batch_td_error_hook=None, gamma=0.99, device="cpu"):
+ self.model = model
+ self.target_model = target_model
+ self.use_double_dqn = use_double_dqn
+ self.batch_dt_error_hook = batch_td_error_hook
+ self.gamma = gamma
+ self.device = device
+
+ @staticmethod
+ def simple_dqn(model, **kwargs):
+ return QLearningPreprocessor(model=model, target_model=None, use_double_dqn=False, **kwargs)
+
+ @staticmethod
+ def target_dqn(model, target_model, **kwards):
+ return QLearningPreprocessor(model, target_model, use_double_dqn=False, **kwards)
+
+ @staticmethod
+ def double_dqn(model, target_model, **kwargs):
+ return QLearningPreprocessor(model, target_model, use_double_dqn=True, **kwargs)
+
+ def _calc_Q(self, states_first, states_last):
+ """
+ Calculates apropriate q values for first and last states. Way of calculate depends on our settings.
+ :param states_first: numpy array of first states
+ :param states_last: numpy array of last states
+ :return: tuple of numpy arrays of q values
+ """
+ # here we need both first and last values calculated using our main model, so we
+ # combine both states into one batch for efficiency and separate results later
+ if self.target_model is None or self.use_double_dqn:
+ states_t = torch.tensor(np.concatenate((states_first, states_last), axis=0)).to(self.device)
+ res_both = self.model(states_t).data.cpu().numpy()
+ return res_both[:len(states_first)], res_both[len(states_first):]
+
+ # in this case we have target_model set and use_double_dqn==False
+ # so, we should calculate first_q and last_q using different models
+ states_first_v = torch.tensor(states_first).to(self.device)
+ states_last_v = torch.tensor(states_last).to(self.device)
+ q_first = self.model(states_first_v).data
+ q_last = self.target_model(states_last_v).data
+ return q_first.cpu().numpy(), q_last.cpu().numpy()
+
+ def _calc_target_rewards(self, states_last, q_last):
+ """
+ Calculate rewards from final states according to variants from our construction:
+ 1. simple DQN: max(Q(states, model))
+ 2. target DQN: max(Q(states, target_model))
+ 3. double DQN: Q(states, target_model)[argmax(Q(states, model)]
+ :param states_last: numpy array of last states from the games
+ :param q_last: numpy array of last q values
+ :return: vector of target rewards
+ """
+ # in this case we handle both simple DQN and target DQN
+ if self.target_model is None or not self.use_double_dqn:
+ return q_last.max(axis=1)
+
+ # here we have target_model set and use_double_dqn==True
+ actions = q_last.argmax(axis=1)
+ # calculate Q values using target net
+ states_last_v = torch.tensor(states_last).to(self.device)
+ q_last_target = self.target_model(states_last_v).data.cpu().numpy()
+ return q_last_target[range(q_last_target.shape[0]), actions]
+
+ def preprocess(self, batch):
+ """
+ Calculates data for Q learning from batch of observations
+ :param batch: list of lists of Experience objects
+ :return: tuple of numpy arrays:
+ 1. states -- observations
+ 2. target Q-values
+ 3. vector of td errors for every batch entry
+ """
+ # first and last states for every entry
+ state_0 = np.array([exp[0].state for exp in batch], dtype=np.float32)
+ state_L = np.array([exp[-1].state for exp in batch], dtype=np.float32)
+
+ q0, qL = self._calc_Q(state_0, state_L)
+ rewards = self._calc_target_rewards(state_L, qL)
+
+ td = np.zeros(shape=(len(batch),))
+
+ for idx, (total_reward, exps) in enumerate(zip(rewards, batch)):
+ # game is done, no final reward
+ if exps[-1].done:
+ total_reward = 0.0
+ for exp in reversed(exps[:-1]):
+ total_reward *= self.gamma
+ total_reward += exp.reward
+ # update total reward and calculate td error
+ act = exps[0].action
+ td[idx] = q0[idx][act] - total_reward
+ q0[idx][act] = total_reward
+
+ return state_0, q0, td
+
+
+class NoopResetEnv(gym.Wrapper):
+ def __init__(self, env=None, noop_max=30):
+ """Sample initial states by taking random number of no-ops on reset.
+ No-op is assumed to be action 0.
+ """
+ super(NoopResetEnv, self).__init__(env)
+ self.noop_max = noop_max
+ self.override_num_noops = None
+ assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
+
+ def step(self, action):
+ return self.env.step(action)
+
+ def reset(self):
+ """ Do no-op action for a number of steps in [1, noop_max]."""
+ self.env.reset()
+ if self.override_num_noops is not None:
+ noops = self.override_num_noops
+ else:
+ noops = np.random.randint(1, self.noop_max + 1)
+ assert noops > 0
+ obs = None
+ for _ in range(noops):
+ obs, _, done, _ = self.env.step(0)
+ if done:
+ obs = self.env.reset()
+ return obs
+
+
+class FireResetEnv(gym.Wrapper):
+ def __init__(self, env=None):
+ """For environments where the user need to press FIRE for the game to start."""
+ super(FireResetEnv, self).__init__(env)
+ assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
+ assert len(env.unwrapped.get_action_meanings()) >= 3
+
+ def step(self, action):
+ return self.env.step(action)
+
+ def reset(self):
+ self.env.reset()
+ obs, _, done, _ = self.env.step(1)
+ if done:
+ self.env.reset()
+ obs, _, done, _ = self.env.step(2)
+ if done:
+ self.env.reset()
+ return obs
+
+
+class EpisodicLifeEnv(gym.Wrapper):
+ def __init__(self, env=None):
+ """Make end-of-life == end-of-episode, but only reset on true game over.
+ Done by DeepMind for the DQN and co. since it helps value estimation.
+ """
+ super(EpisodicLifeEnv, self).__init__(env)
+ self.lives = 0
+ self.was_real_done = True
+ self.was_real_reset = False
+
+ def step(self, action):
+ obs, reward, done, info = self.env.step(action)
+ self.was_real_done = done
+ # check current lives, make loss of life terminal,
+ # then update lives to handle bonus lives
+ lives = self.env.unwrapped.ale.lives()
+ if lives < self.lives and lives > 0:
+ # for Qbert somtimes we stay in lives == 0 condtion for a few frames
+ # so its important to keep lives > 0, so that we only reset once
+ # the environment advertises done.
+ done = True
+ self.lives = lives
+ return obs, reward, done, info
+
+ def reset(self):
+ """Reset only when lives are exhausted.
+ This way all states are still reachable even though lives are episodic,
+ and the learner need not know about any of this behind-the-scenes.
+ """
+ if self.was_real_done:
+ obs = self.env.reset()
+ self.was_real_reset = True
+ else:
+ # no-op step to advance from terminal/lost life state
+ obs, _, _, _ = self.env.step(0)
+ self.was_real_reset = False
+ self.lives = self.env.unwrapped.ale.lives()
+ return obs
+
+
+class MaxAndSkipEnv(gym.Wrapper):
+ def __init__(self, env=None, skip=4):
+ """Return only every `skip`-th frame"""
+ super(MaxAndSkipEnv, self).__init__(env)
+ # most recent raw observations (for max pooling across time steps)
+ self._obs_buffer = deque(maxlen=2)
+ self._skip = skip
+
+ def step(self, action):
+ total_reward = 0.0
+ done = None
+ for _ in range(self._skip):
+ obs, reward, done, info = self.env.step(action)
+ self._obs_buffer.append(obs)
+ total_reward += reward
+ if done:
+ break
+
+ max_frame = np.max(np.stack(self._obs_buffer), axis=0)
+
+ return max_frame, total_reward, done, info
+
+ def reset(self):
+ """Clear past frame buffer and init. to first obs. from inner env."""
+ self._obs_buffer.clear()
+ obs = self.env.reset()
+ self._obs_buffer.append(obs)
+ return obs
+
+
+class ProcessFrame84(gym.ObservationWrapper):
+ def __init__(self, env=None):
+ super(ProcessFrame84, self).__init__(env)
+ self.observation_space = spaces.Box(low=0, high=255, shape=(84, 84, 1), dtype=np.uint8)
+
+ def observation(self, obs):
+ return ProcessFrame84.process(obs)
+
+ @staticmethod
+ def process(frame):
+ if frame.size == 210 * 160 * 3:
+ img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
+ elif frame.size == 250 * 160 * 3:
+ img = np.reshape(frame, [250, 160, 3]).astype(np.float32)
+ else:
+ assert False, "Unknown resolution."
+ img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
+ resized_screen = cv2.resize(img, (84, 110), interpolation=cv2.INTER_AREA)
+ x_t = resized_screen[18:102, :]
+ x_t = np.reshape(x_t, [84, 84, 1])
+ return x_t.astype(np.uint8)
+
+
+class ClippedRewardsWrapper(gym.RewardWrapper):
+ def reward(self, reward):
+ """Change all the positive rewards to 1, negative to -1 and keep zero."""
+ return np.sign(reward)
+
+
+class LazyFrames(object):
+ def __init__(self, frames):
+ """This object ensures that common frames between the observations are only stored once.
+ It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
+ buffers.
+ This object should only be converted to numpy array before being passed to the model.
+ You'd not belive how complex the previous solution was."""
+ self._frames = frames
+
+ def __array__(self, dtype=None):
+ out = np.concatenate(self._frames, axis=0)
+ if dtype is not None:
+ out = out.astype(dtype)
+ return out
+
+
+class FrameStack(gym.Wrapper):
+ def __init__(self, env, k):
+ """Stack k last frames.
+ Returns lazy array, which is much more memory efficient.
+ See Also
+ --------
+ baselines.common.atari_wrappers.LazyFrames
+ """
+ gym.Wrapper.__init__(self, env)
+ self.k = k
+ self.frames = deque([], maxlen=k)
+ shp = env.observation_space.shape
+ self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0]*k, shp[1], shp[2]), dtype=np.float32)
+
+ def reset(self):
+ ob = self.env.reset()
+ for _ in range(self.k):
+ self.frames.append(ob)
+ return self._get_ob()
+
+ def step(self, action):
+ ob, reward, done, info = self.env.step(action)
+ self.frames.append(ob)
+ return self._get_ob(), reward, done, info
+
+ def _get_ob(self):
+ assert len(self.frames) == self.k
+ return LazyFrames(list(self.frames))
+
+
+class ScaledFloatFrame(gym.ObservationWrapper):
+ def observation(self, obs):
+ # careful! This undoes the memory optimization, use
+ # with smaller replay buffers only.
+ return np.array(obs).astype(np.float32) / 255.0
+
+
+class ImageToPyTorch(gym.ObservationWrapper):
+ """
+ Change image shape to CWH
+ """
+ def __init__(self, env):
+ super(ImageToPyTorch, self).__init__(env)
+ old_shape = self.observation_space.shape
+ self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]),
+ dtype=np.float32)
+
+ def observation(self, observation):
+ return np.swapaxes(observation, 2, 0)
+
+
+def wrap_dqn(env, stack_frames=4, episodic_life=True, reward_clipping=True):
+ """Apply a common set of wrappers for Atari games."""
+ assert 'NoFrameskip' in env.spec.id
+ if episodic_life:
+ env = EpisodicLifeEnv(env)
+ env = NoopResetEnv(env, noop_max=30)
+ env = MaxAndSkipEnv(env, skip=4)
+ if 'FIRE' in env.unwrapped.get_action_meanings():
+ env = FireResetEnv(env)
+ env = ProcessFrame84(env)
+ env = ImageToPyTorch(env)
+ env = FrameStack(env, stack_frames)
+ if reward_clipping:
+ env = ClippedRewardsWrapper(env)
+ return env
+
+
+from tensorboardX import SummaryWriter
+
+import sys
+import time
+import numpy as np
+import torch
+import torch.nn as nn
+
+
+HYPERPARAMS = {
+ 'cartpole': {
+ 'env_name': "CartPole-v1",
+ 'stop_reward': 199.0,
+ 'run_name': 'cartpole',
+ 'replay_size': 100000,
+ 'replay_initial': 32,
+ 'target_net_sync': 300,
+ 'epsilon_frames': 10 ** 2,
+ 'epsilon_start': 1.0,
+ 'epsilon_final': 0.02,
+ 'learning_rate': 0.0001,
+ 'gamma': 0.99,
+ 'batch_size': 32
+ },
+ 'pong': {
+ 'env_name': "PongNoFrameskip-v4",
+ 'stop_reward': 18.0,
+ 'run_name': 'pong',
+ 'replay_size': 100000,
+ 'replay_initial': 10000,
+ 'target_net_sync': 1000,
+ 'epsilon_frames': 10**5,
+ 'epsilon_start': 1.0,
+ 'epsilon_final': 0.02,
+ 'learning_rate': 0.0001,
+ 'gamma': 0.99,
+ 'batch_size': 32
+ },
+ 'breakout-small': {
+ 'env_name': "BreakoutNoFrameskip-v4",
+ 'stop_reward': 500.0,
+ 'run_name': 'breakout-small',
+ 'replay_size': 3*10 ** 5,
+ 'replay_initial': 20000,
+ 'target_net_sync': 1000,
+ 'epsilon_frames': 10 ** 6,
+ 'epsilon_start': 1.0,
+ 'epsilon_final': 0.1,
+ 'learning_rate': 0.0001,
+ 'gamma': 0.99,
+ 'batch_size': 64
+ },
+ 'breakout': {
+ 'env_name': "BreakoutNoFrameskip-v4",
+ 'stop_reward': 500.0,
+ 'run_name': 'breakout',
+ 'replay_size': 10 ** 6,
+ 'replay_initial': 50000,
+ 'target_net_sync': 10000,
+ 'epsilon_frames': 10 ** 6,
+ 'epsilon_start': 1.0,
+ 'epsilon_final': 0.1,
+ 'learning_rate': 0.00025,
+ 'gamma': 0.99,
+ 'batch_size': 32
+ },
+ 'invaders': {
+ 'env_name': "SpaceInvadersNoFrameskip-v4",
+ 'stop_reward': 500.0,
+ 'run_name': 'breakout',
+ 'replay_size': 10 ** 6,
+ 'replay_initial': 50000,
+ 'target_net_sync': 10000,
+ 'epsilon_frames': 10 ** 6,
+ 'epsilon_start': 1.0,
+ 'epsilon_final': 0.1,
+ 'learning_rate': 0.00025,
+ 'gamma': 0.99,
+ 'batch_size': 32
+ },
+}
+"""
+Agent is something which converts states into actions and has state
+"""
+
+
+class BaseAgent:
+ """
+ Abstract Agent interface
+ """
+ def initial_state(self):
+ """
+ Should create initial empty state for the agent. It will be called for the start of the episode
+ :return: Anything agent want to remember
+ """
+ return None
+
+ def __call__(self, states, agent_states):
+ """
+ Convert observations and states into actions to take
+ :param states: list of environment states to process
+ :param agent_states: list of states with the same length as observations
+ :return: tuple of actions, states
+ """
+ assert isinstance(states, list)
+ assert isinstance(agent_states, list)
+ assert len(agent_states) == len(states)
+
+ raise NotImplementedError
+
+
+def default_states_preprocessor(states):
+ """
+ Convert list of states into the form suitable for model. By default we assume Variable
+ :param states: list of numpy arrays with states
+ :return: Variable
+ """
+ if len(states) == 1:
+ np_states = np.expand_dims(states[0], 0)
+ else:
+ np_states = np.array([np.array(s, copy=False) for s in states], copy=False)
+ return torch.tensor(np_states)
+
+
+def float32_preprocessor(states):
+ np_states = np.array(states, dtype=np.float32)
+ return torch.tensor(np_states)
+
+
+class DQNAgent(BaseAgent):
+ """
+ DQNAgent is a memoryless DQN agent which calculates Q values
+ from the observations and converts them into the actions using action_selector
+ """
+ def __init__(self, dqn_model, action_selector, device="cpu", preprocessor=default_states_preprocessor):
+ self.dqn_model = dqn_model
+ self.action_selector = action_selector
+ self.preprocessor = preprocessor
+ self.device = device
+
+ def __call__(self, states, agent_states=None):
+ if agent_states is None:
+ agent_states = [None] * len(states)
+ if self.preprocessor is not None:
+ states = self.preprocessor(states)
+ if torch.is_tensor(states):
+ states = states.to(self.device)
+ q_v = self.dqn_model(states)
+ q = q_v.data.cpu().numpy()
+ actions = self.action_selector(q)
+ return actions, agent_states
+
+
+class TargetNet:
+ """
+ Wrapper around model which provides copy of it instead of trained weights
+ """
+ def __init__(self, model):
+ self.model = model
+ import copy
+ self.target_model = copy.deepcopy(model)
+
+ def sync(self):
+ self.target_model.load_state_dict(self.model.state_dict())
+
+ def alpha_sync(self, alpha):
+ """
+ Blend params of target net with params from the model
+ :param alpha:
+ """
+ assert isinstance(alpha, float)
+ assert 0.0 < alpha <= 1.0
+ state = self.model.state_dict()
+ tgt_state = self.target_model.state_dict()
+ for k, v in state.items():
+ tgt_state[k] = tgt_state[k] * alpha + (1 - alpha) * v
+ self.target_model.load_state_dict(tgt_state)
+
+
+class PolicyAgent(BaseAgent):
+ """
+ Policy agent gets action probabilities from the model and samples actions from it
+ """
+ # TODO: unify code with DQNAgent, as only action selector is differs.
+ def __init__(self, model, action_selector=ProbabilityActionSelector(), device="cpu",
+ apply_softmax=False, preprocessor=default_states_preprocessor):
+ self.model = model
+ self.action_selector = action_selector
+ self.device = device
+ self.apply_softmax = apply_softmax
+ self.preprocessor = preprocessor
+
+ def __call__(self, states, agent_states=None):
+ """
+ Return actions from given list of states
+ :param states: list of states
+ :return: list of actions
+ """
+ if agent_states is None:
+ agent_states = [None] * len(states)
+ if self.preprocessor is not None:
+ states = self.preprocessor(states)
+ if torch.is_tensor(states):
+ states = states.to(self.device)
+ probs_v = self.model(states)
+ if self.apply_softmax:
+ probs_v = F.softmax(probs_v, dim=1)
+ probs = probs_v.data.cpu().numpy()
+ actions = self.action_selector(probs)
+ return np.array(actions), agent_states
+
+
+class ActorCriticAgent(BaseAgent):
+ """
+ Policy agent which returns policy and value tensors from observations. Value are stored in agent's state
+ and could be reused for rollouts calculations by ExperienceSource.
+ """
+ def __init__(self, model, action_selector=ProbabilityActionSelector(), device="cpu",
+ apply_softmax=False, preprocessor=default_states_preprocessor):
+ self.model = model
+ self.action_selector = action_selector
+ self.device = device
+ self.apply_softmax = apply_softmax
+ self.preprocessor = preprocessor
+
+ def __call__(self, states, agent_states=None):
+ """
+ Return actions from given list of states
+ :param states: list of states
+ :return: list of actions
+ """
+ if self.preprocessor is not None:
+ states = self.preprocessor(states)
+ if torch.is_tensor(states):
+ states = states.to(self.device)
+ probs_v, values_v = self.model(states)
+ if self.apply_softmax:
+ probs_v = F.softmax(probs_v, dim=1)
+ probs = probs_v.data.cpu().numpy()
+ actions = self.action_selector(probs)
+ agent_states = values_v.data.squeeze().cpu().numpy().tolist()
+ return np.array(actions), agent_states
+
+
+def unpack_batch(batch):
+ states, actions, rewards, dones, last_states = [], [], [], [], []
+ for exp in batch:
+ state = np.array(exp.state, copy=False)
+ states.append(state)
+ actions.append(exp.action)
+ rewards.append(exp.reward)
+ dones.append(exp.last_state is None)
+ if exp.last_state is None:
+ last_states.append(state) # the result will be masked anyway
+ else:
+ last_states.append(np.array(exp.last_state, copy=False))
+ return np.array(states, copy=False), np.array(actions), np.array(rewards, dtype=np.float32), \
+ np.array(dones, dtype=np.uint8), np.array(last_states, copy=False)
+
+
+def calc_loss_dqn(batch, net, tgt_net, gamma, device="cpu"):
+ states, actions, rewards, dones, next_states = unpack_batch(batch)
+
+ states_v = torch.tensor(states).to(device)
+ next_states_v = torch.tensor(next_states).to(device)
+ actions_v = torch.tensor(actions).to(device)
+ rewards_v = torch.tensor(rewards).to(device)
+ done_mask = torch.ByteTensor(dones).to(device)
+
+ state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
+ next_state_values = tgt_net(next_states_v).max(1)[0]
+ next_state_values[done_mask] = 0.0
+
+ expected_state_action_values = next_state_values.detach() * gamma + rewards_v
+ return nn.MSELoss()(state_action_values, expected_state_action_values)
+
+
+class RewardTracker:
+ def __init__(self, writer, stop_reward):
+ self.writer = writer
+ self.stop_reward = stop_reward
+
+ def __enter__(self):
+ self.ts = time.time()
+ self.ts_frame = 0
+ self.total_rewards = []
+ return self
+
+ def __exit__(self, *args):
+ self.writer.close()
+
+ def reward(self, reward, frame, epsilon=None):
+ self.total_rewards.append(reward)
+ speed = (frame - self.ts_frame) / (time.time() - self.ts)
+ self.ts_frame = frame
+ self.ts = time.time()
+ mean_reward = np.mean(self.total_rewards[-100:])
+ epsilon_str = "" if epsilon is None else ", eps %.2f" % epsilon
+ print("%d: done %d games, mean reward %.3f, speed %.2f f/s%s" % (
+ frame, len(self.total_rewards), mean_reward, speed, epsilon_str
+ ))
+ sys.stdout.flush()
+ if epsilon is not None:
+ self.writer.add_scalar("epsilon", epsilon, frame)
+ self.writer.add_scalar("speed", speed, frame)
+ self.writer.add_scalar("reward_100", mean_reward, frame)
+ self.writer.add_scalar("reward", reward, frame)
+ if mean_reward > self.stop_reward:
+ print("Solved in %d frames!" % frame)
+ return True
+ return False
+
+
+class EpsilonTracker:
+ def __init__(self, epsilon_greedy_selector, params):
+ self.epsilon_greedy_selector = epsilon_greedy_selector
+ self.epsilon_start = params['epsilon_start']
+ self.epsilon_final = params['epsilon_final']
+ self.epsilon_frames = params['epsilon_frames']
+ self.frame(0)
+
+ def frame(self, frame):
+ self.epsilon_greedy_selector.epsilon = \
+ max(self.epsilon_final, self.epsilon_start - frame / self.epsilon_frames)
+
+
+def distr_projection(next_distr, rewards, dones, Vmin, Vmax, n_atoms, gamma):
+ """
+ Perform distribution projection aka Catergorical Algorithm from the
+ "A Distributional Perspective on RL" paper
+ """
+ batch_size = len(rewards)
+ proj_distr = np.zeros((batch_size, n_atoms), dtype=np.float32)
+ delta_z = (Vmax - Vmin) / (n_atoms - 1)
+ for atom in range(n_atoms):
+ tz_j = np.minimum(Vmax, np.maximum(Vmin, rewards + (Vmin + atom * delta_z) * gamma))
+ b_j = (tz_j - Vmin) / delta_z
+ l = np.floor(b_j).astype(np.int64)
+ u = np.ceil(b_j).astype(np.int64)
+ eq_mask = u == l
+ proj_distr[eq_mask, l[eq_mask]] += next_distr[eq_mask, atom]
+ ne_mask = u != l
+ proj_distr[ne_mask, l[ne_mask]] += next_distr[ne_mask, atom] * (u - b_j)[ne_mask]
+ proj_distr[ne_mask, u[ne_mask]] += next_distr[ne_mask, atom] * (b_j - l)[ne_mask]
+ if dones.any():
+ proj_distr[dones] = 0.0
+ tz_j = np.minimum(Vmax, np.maximum(Vmin, rewards[dones]))
+ b_j = (tz_j - Vmin) / delta_z
+ l = np.floor(b_j).astype(np.int64)
+ u = np.ceil(b_j).astype(np.int64)
+ eq_mask = u == l
+ eq_dones = dones.copy()
+ eq_dones[dones] = eq_mask
+ if eq_dones.any():
+ proj_distr[eq_dones, l[eq_mask]] = 1.0
+ ne_mask = u != l
+ ne_dones = dones.copy()
+ ne_dones[dones] = ne_mask
+ if ne_dones.any():
+ proj_distr[ne_dones, l[ne_mask]] = (u - b_j)[ne_mask]
+ proj_distr[ne_dones, u[ne_mask]] = (b_j - l)[ne_mask]
+ return proj_distr
+
+
+Vmax = 10
+Vmin = -10
+N_ATOMS = 51
+DELTA_Z = (Vmax - Vmin) / (N_ATOMS - 1)
+
+
+class DistributionalDQN(nn.Module):
+ def __init__(self, input_shape, n_actions):
+ super(DistributionalDQN, self).__init__()
+
+ self.fc = nn.Sequential(
+ nn.Linear(input_shape[0], 512),
+ nn.ReLU(),
+ nn.Linear(512, n_actions * N_ATOMS)
+ )
+
+ self.register_buffer("supports", torch.arange(Vmin, Vmax+DELTA_Z, DELTA_Z))
+ self.softmax = nn.Softmax(dim=1)
+
+ def forward(self, x):
+ batch_size = x.size()[0]
+ fc_out = self.fc(x.float())
+ return fc_out.view(batch_size, -1, N_ATOMS)
+
+ def both(self, x):
+ cat_out = self(x)
+ probs = self.apply_softmax(cat_out)
+ weights = probs * self.supports
+ res = weights.sum(dim=2)
+ return cat_out, res
+
+ def qvals(self, x):
+ return self.both(x)[1]
+
+ def apply_softmax(self, t):
+ return self.softmax(t.view(-1, N_ATOMS)).view(t.size())
+
+def calc_loss(batch, net, tgt_net, gamma, device="cpu", save_prefix=None):
+ states, actions, rewards, dones, next_states = unpack_batch(batch)
+ batch_size = len(batch)
+
+ states_v = torch.tensor(states).to(device)
+ actions_v = torch.tensor(actions).to(device)
+ next_states_v = torch.tensor(next_states).to(device)
+
+ # next state distribution
+ next_distr_v, next_qvals_v = tgt_net.both(next_states_v)
+ next_actions = next_qvals_v.max(1)[1].data.cpu().numpy()
+ next_distr = tgt_net.apply_softmax(next_distr_v).data.cpu().numpy()
+
+ next_best_distr = next_distr[range(batch_size), next_actions]
+ dones = dones.astype(np.bool)
+
+ # project our distribution using Bellman update
+ proj_distr = distr_projection(next_best_distr, rewards, dones, Vmin, Vmax, N_ATOMS, gamma)
+
+ # calculate net output
+ distr_v = net(states_v)
+ state_action_values = distr_v[range(batch_size), actions_v.data]
+ state_log_sm_v = F.log_softmax(state_action_values, dim=1)
+ proj_distr_v = torch.tensor(proj_distr).to(device)
+
+ loss_v = -state_log_sm_v * proj_distr_v
+ return loss_v.sum(dim=1).mean()
+
+
+# if __name__ == "__main__":
+# params = HYPERPARAMS['cartpole']
+# # params['epsilon_frames'] *= 2
+# parser = argparse.ArgumentParser()
+# parser.add_argument("--cuda", default=False, action="store_true", help="Enable cuda")
+# args = parser.parse_args()
+# device = torch.device("cuda" if args.cuda or True else "cpu")
+#
+# env = gym.make(params['env_name'])
+#
+# writer = SummaryWriter(comment="-" + params['run_name'] + "-distrib")
+# net = DistributionalDQN(env.observation_space.shape, env.action_space.n).to(device)
+#
+# tgt_net = TargetNet(net)
+# selector = EpsilonGreedyActionSelector(epsilon=params['epsilon_start'])
+# epsilon_tracker = EpsilonTracker(selector, params)
+# agent = DQNAgent(lambda x: net.qvals(x), selector, device=device)
+#
+# exp_source = ExperienceSourceFirstLast(env, agent, gamma=params['gamma'], steps_count=1)
+# buffer = ExperienceReplayBuffer(exp_source, buffer_size=params['replay_size'])
+# optimizer = optim.Adam(net.parameters(), lr=params['learning_rate'])
+#
+# frame_idx = 0
+# eval_states = None
+# prev_save = 0
+# save_prefix = None
+#
+# with RewardTracker(writer, params['stop_reward']) as reward_tracker:
+# while True:
+# frame_idx += 1
+# buffer.populate(1)
+# epsilon_tracker.frame(frame_idx)
+#
+# new_rewards = exp_source.pop_total_rewards()
+# if new_rewards:
+# if reward_tracker.reward(new_rewards[0], frame_idx, selector.epsilon):
+# break
+#
+# if len(buffer) < params['replay_initial']:
+# continue
+#
+# optimizer.zero_grad()
+# batch = buffer.sample(params['batch_size'])
+#
+# loss_v = calc_loss(batch, net, tgt_net.target_model, gamma=params['gamma'],
+# device=device, save_prefix=save_prefix)
+# loss_v.backward()
+# # print(str(loss_v.data))
+# optimizer.step()
+#
+# if frame_idx % params['target_net_sync'] == 0:
+# tgt_net.sync()
+
+
+class NoisyLinear(nn.Linear):
+ def __init__(self, in_features, out_features, sigma_init=0.017, bias=True):
+ super(NoisyLinear, self).__init__(in_features, out_features, bias=bias)
+ self.sigma_weight = nn.Parameter(torch.full((out_features, in_features), sigma_init))
+ self.register_buffer("epsilon_weight", torch.zeros(out_features, in_features))
+ if bias:
+ self.sigma_bias = nn.Parameter(torch.full((out_features,), sigma_init))
+ self.register_buffer("epsilon_bias", torch.zeros(out_features))
+ self.reset_parameters()
+
+ def reset_parameters(self):
+ std = math.sqrt(3 / self.in_features)
+ self.weight.data.uniform_(-std, std)
+ self.bias.data.uniform_(-std, std)
+
+ def forward(self, input):
+ self.epsilon_weight.normal_()
+ bias = self.bias
+ if bias is not None:
+ self.epsilon_bias.normal_()
+ bias = bias + self.sigma_bias * self.epsilon_bias.data
+ return F.linear(input, self.weight + self.sigma_weight * self.epsilon_weight.data, bias)
+
+
+class NoisyFactorizedLinear(nn.Linear):
+ """
+ NoisyNet layer with factorized gaussian noise
+
+ N.B. nn.Linear already initializes weight and bias to
+ """
+ def __init__(self, in_features, out_features, sigma_zero=0.4, bias=True):
+ super(NoisyFactorizedLinear, self).__init__(in_features, out_features, bias=bias)
+ sigma_init = sigma_zero / math.sqrt(in_features)
+ self.sigma_weight = nn.Parameter(torch.full((out_features, in_features), sigma_init))
+ self.register_buffer("epsilon_input", torch.zeros(1, in_features))
+ self.register_buffer("epsilon_output", torch.zeros(out_features, 1))
+ if bias:
+ self.sigma_bias = nn.Parameter(torch.full((out_features,), sigma_init))
+
+ def forward(self, input):
+ self.epsilon_input.normal_()
+ self.epsilon_output.normal_()
+
+ func = lambda x: torch.sign(x) * torch.sqrt(torch.abs(x))
+ eps_in = func(self.epsilon_input.data)
+ eps_out = func(self.epsilon_output.data)
+
+ bias = self.bias
+ if bias is not None:
+ bias = bias + self.sigma_bias * eps_out.t()
+ noise_v = torch.mul(eps_in, eps_out)
+ return F.linear(input, self.weight + self.sigma_weight * noise_v, bias)
+
+
+class DQN(nn.Module):
+ def __init__(self, input_shape, n_actions):
+ super(DQN, self).__init__()
+
+ # self.conv = nn.Sequential(
+ # nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
+ # nn.ReLU(),
+ # nn.Conv2d(32, 64, kernel_size=4, stride=2),
+ # nn.ReLU(),
+ # nn.Conv2d(64, 64, kernel_size=3, stride=1),
+ # nn.ReLU()
+ # )
+ #
+ # conv_out_size = self._get_conv_out(input_shape)
+ self.fc = nn.Sequential(
+ nn.Linear(input_shape[0], 512),
+ nn.ReLU(),
+ nn.Linear(512, n_actions)
+ )
+
+ def _get_conv_out(self, shape):
+ o = self.conv(torch.zeros(1, *shape))
+ return int(np.prod(o.size()))
+
+ def forward(self, x):
+ fx = x.float()# / 256
+ # conv_out = self.conv(fx).view(fx.size()[0], -1)
+ return self.fc(fx)
+
+
+class NoisyDQN(nn.Module):
+ def __init__(self, input_shape, n_actions):
+ super(NoisyDQN, self).__init__()
+
+ # self.conv = nn.Sequential(
+ # nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
+ # nn.ReLU(),
+ # nn.Conv2d(32, 64, kernel_size=4, stride=2),
+ # nn.ReLU(),
+ # nn.Conv2d(64, 64, kernel_size=3, stride=1),
+ # nn.ReLU()
+ # )
+ #
+ # conv_out_size = self._get_conv_out(input_shape)
+ self.noisy_layers = [
+ NoisyLinear(input_shape[0], 512),
+ NoisyLinear(512, n_actions)
+ ]
+ self.fc = nn.Sequential(
+ self.noisy_layers[0],
+ nn.ReLU(),
+ self.noisy_layers[1]
+ )
+
+ def _get_conv_out(self, shape):
+ o = self.conv(torch.zeros(1, *shape))
+ return int(np.prod(o.size()))
+
+ def forward(self, x):
+ fx = x.float() #/ 256
+ # conv_out = self.conv(fx).view(fx.size()[0], -1)
+ return self.fc(fx)
+
+ def noisy_layers_sigma_snr(self):
+ return [
+ ((layer.weight ** 2).mean().sqrt() / (layer.sigma_weight ** 2).mean().sqrt()).item()
+ for layer in self.noisy_layers
+ ]
+
+
+if __name__ == "__main__":
+ params = HYPERPARAMS['cartpole']
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--cuda", default=False, action="store_true", help="Enable cuda")
+ args = parser.parse_args()
+ device = torch.device("cuda" if args.cuda else "cpu")
+
+ env = gym.make(params['env_name'])
+ # env = wrap_dqn(env)
+
+ writer = SummaryWriter(comment="-" + params['run_name'] + "-noisy-net")
+ net = NoisyDQN(env.observation_space.shape, env.action_space.n).to(device)
+ tgt_net = TargetNet(net)
+ agent = DQNAgent(net, ArgmaxActionSelector(), device=device)
+
+ exp_source = ExperienceSourceFirstLast(env, agent, gamma=params['gamma'], steps_count=1)
+ buffer = ExperienceReplayBuffer(exp_source, buffer_size=params['replay_size'])
+ optimizer = optim.Adam(net.parameters(), lr=params['learning_rate'])
+
+ frame_idx = 0
+
+ with RewardTracker(writer, params['stop_reward']) as reward_tracker:
+ while True:
+ frame_idx += 1
+ buffer.populate(1)
+
+ new_rewards = exp_source.pop_total_rewards()
+ if new_rewards:
+ if reward_tracker.reward(new_rewards[0], frame_idx):
+ break
+
+ if len(buffer) < params['replay_initial']:
+ continue
+
+ optimizer.zero_grad()
+ batch = buffer.sample(params['batch_size'])
+ loss_v = calc_loss_dqn(batch, net, tgt_net.target_model, gamma=params['gamma'], device=device)
+ loss_v.backward()
+ print(loss_v)
+ optimizer.step()
+
+ if frame_idx % params['target_net_sync'] == 0:
+ tgt_net.sync()
+
+ if frame_idx % 500 == 0:
+ for layer_idx, sigma_l2 in enumerate(net.noisy_layers_sigma_snr()):
+ writer.add_scalar("sigma_snr_layer_%d" % (layer_idx+1),
+ sigma_l2, frame_idx)
\ No newline at end of file
diff --git a/fast_rl/agents/rainbow_dqn.py b/fast_rl/agents/rainbow_dqn.py
new file mode 100644
index 0000000..fc7c4f7
--- /dev/null
+++ b/fast_rl/agents/rainbow_dqn.py
@@ -0,0 +1,203 @@
+import collections
+from copy import deepcopy
+from warnings import warn
+
+from fastai.basic_train import LearnerCallback
+from fastai.imports import torch, Any
+
+from fast_rl.agents.dist_dqn_models import TargetNet
+from fast_rl.agents.dqn_models import distr_projection
+from fast_rl.core.agent_core import ExperienceReplay, NStepExperienceReplay, NStepPriorityExperienceReplay
+from fast_rl.core.basic_train import AgentLearner, listify, List
+from fast_rl.core.data_block import MDPDataBunch, MDPStep
+from fastai.imports import torch
+
+import gym
+import numpy as np
+import argparse
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+
+Vmax = 10
+Vmin = -10
+N_ATOMS = 51
+# Vmax = 5
+# Vmin = -5
+# N_ATOMS = 8
+DELTA_Z = (Vmax - Vmin) / (N_ATOMS - 1)
+
+ExperienceFirstLast = collections.namedtuple('ExperienceFirstLast', ('state', 'action', 'reward', 'last_state','done'))
+
+
+
+
+def unpack_batch(batch):
+ states, actions, rewards, dones, last_states = [], [], [], [], []
+ for exp in batch:
+ state = np.array(exp.state, copy=False)
+ states.append(state)
+ actions.append(exp.action)
+ rewards.append(exp.reward)
+ dones.append(exp.done)
+ # if exp.last_state is None:
+ # last_states.append(state) # the result will be masked anyway
+ # else:
+ last_states.append(np.array(exp.last_state, copy=False))
+ return np.array(states, copy=False), np.array(actions), np.array(rewards, dtype=np.float32), \
+ np.array(dones, dtype=np.uint8), np.array(last_states, copy=False)
+
+
+def calc_loss(batch, batch_weights, net, tgt_net, gamma, device="cpu"):
+ states, actions, rewards, dones, next_states = unpack_batch(batch)
+ batch_size = len(batch)
+
+ states_v = torch.tensor(states).to(device)
+ actions_v = torch.tensor(actions).to(device)
+ next_states_v = torch.tensor(next_states).to(device)
+ if batch_weights is not None: batch_weights_v = torch.tensor(batch_weights).to(device)
+
+ # next state distribution
+ # dueling arch -- actions from main net, distr from tgt_net
+
+ # calc at once both next and cur states
+ distr_v, qvals_v = net.both(torch.cat((states_v, next_states_v)))
+ next_qvals_v = qvals_v[batch_size:]
+ distr_v = distr_v[:batch_size]
+
+ next_actions_v = next_qvals_v.max(1)[1]
+ next_distr_v = tgt_net(next_states_v)
+ next_best_distr_v = next_distr_v[range(batch_size), next_actions_v.data]
+ next_best_distr_v = tgt_net.apply_softmax(next_best_distr_v)
+ next_best_distr = next_best_distr_v.data.cpu().numpy()
+
+ dones = dones.astype(np.bool)
+
+ # project our distribution using Bellman update
+ proj_distr = distr_projection(next_best_distr, rewards, dones, Vmin, Vmax, N_ATOMS, gamma)
+
+ # calculate net output
+ state_action_values = distr_v[range(batch_size), actions_v.data]
+ state_log_sm_v = F.log_softmax(state_action_values, dim=1)
+ proj_distr_v = torch.tensor(proj_distr).to(device)
+
+ loss_v = -state_log_sm_v * proj_distr_v
+ if batch_weights is not None: loss_v = batch_weights_v * loss_v.sum(dim=1)
+ return loss_v.mean(), loss_v + 1e-5
+
+
+class BaseRainbowDQNTrainer(LearnerCallback):
+ def __init__(self, learn: 'DistDQNLearner', max_episodes=None):
+ r"""Handles basic DQN end of step model optimization."""
+ super().__init__(learn)
+ self.n_skipped = 0
+ self._persist = max_episodes is not None
+ self.max_episodes = max_episodes
+ self.episode = -1
+ self.iteration = 0
+ # For the callback handler
+ self._order = 0
+ self.previous_item = None
+
+ @property
+ def learn(self)->'RainbowDQNLearner':
+ return self._learn()
+
+ def on_train_begin(self, n_epochs, **kwargs: Any):
+ self.max_episodes = n_epochs if not self._persist else self.max_episodes
+
+ def on_epoch_begin(self, epoch, **kwargs: Any):
+ pass
+
+ def on_backward_begin(self, **kwargs: Any):return {'skip_bwd': self.learn.warming_up}
+ def on_backward_end(self, **kwargs:Any): return {'skip_step':False}
+ def on_step_end(self, **kwargs: Any):return {'skip_zero': False}
+
+ def on_loss_begin(self, **kwargs: Any):
+ r"""Performs tree updates, exploration updates, and model optimization."""
+ if self.learn.model.training:
+ self.learn.memory.update(item=self.learn.data.x.items[-1])
+ self.iteration+=1
+ self.learn.epsilon_tracker.frame(self.iteration)
+
+ if not self.learn.warming_up:
+ samples: List[MDPStep]=self.memory.sample(self.learn.data.bs)
+ batch=[ExperienceFirstLast(state=deepcopy(s.s[0]),action=deepcopy(s.action.taken_action),
+ reward=deepcopy(s.reward),last_state=deepcopy(s.s_prime[0]),done=deepcopy(s.done)) for s in samples]
+ # model_func=lambda x: self.learn.model.qvals(x)
+ loss,weight_loss=calc_loss(batch,self.memory.weights(),self.learn.model,self.learn.target_net.target_model,gamma=0.99,device=self.learn.data.device)
+ self.learn.memory.refresh({'td_error':weight_loss})
+ return {'last_output':loss}
+ else: return None
+
+ def on_batch_end(self, **kwargs:Any) ->None:
+ if self.iteration % 300 == 0:
+ self.learn.target_net.sync()
+
+
+class ArgmaxActionSelector(object):
+ """
+ Selects actions using argmax
+ """
+ def __call__(self, scores):
+ assert isinstance(scores, np.ndarray)
+ return np.argmax(scores, axis=1)
+
+
+class EpsilonGreedyActionSelector(object):
+ def __init__(self, epsilon=0.05, selector=None):
+ self.epsilon = epsilon
+ self.selector = selector if selector is not None else ArgmaxActionSelector()
+
+ def __call__(self, scores):
+ assert isinstance(scores, np.ndarray)
+ batch_size, n_actions = scores.shape
+ actions = self.selector(scores)
+ mask = np.random.random(size=batch_size) < self.epsilon
+ rand_actions = np.random.choice(n_actions, sum(mask))
+ actions[mask] = rand_actions
+ return actions
+
+class EpsilonTracker:
+ def __init__(self, epsilon_greedy_selector, params):
+ self.epsilon_greedy_selector = epsilon_greedy_selector
+ self.epsilon_start = params['epsilon_start']
+ self.epsilon_final = params['epsilon_final']
+ self.epsilon_frames = params['epsilon_frames']
+ self.frame(0)
+
+ def frame(self, frame):
+ self.epsilon_greedy_selector.epsilon = \
+ max(self.epsilon_final, self.epsilon_start - frame / self.epsilon_frames)
+
+
+
+class RainbowDQNLearner(AgentLearner):
+ def __init__(self, data: MDPDataBunch, model, trainers,use_per=True, loss_func=None,opt=torch.optim.Adam,**learn_kwargs):
+ super().__init__(data=data, model=model, opt=opt,loss_func=loss_func, **learn_kwargs)
+ self._loss_func=loss_func
+ self.memory=NStepPriorityExperienceReplay(100000,n_step=2) if use_per else NStepExperienceReplay(100000,step_sz=2)
+ if use_per: warn('Using per on simpler evs such as cartpole has not been solved at least before 2000 epochs. '
+ 'We will see if there is a way to configure per to handling these simpler environments')
+ warn('RAINBOW for envs like cartpole is extremely slow on convergence requiring more than 600 epochs. '
+ 'Due to a memory issue, we cannot test beyond this number of epochs to get detailed convergence.')
+ self.target_net=TargetNet(self.model)
+ self.exploration_method=EpsilonGreedyActionSelector(1.0)
+ self.epsilon_tracker=EpsilonTracker(self.exploration_method, {'epsilon_frames': 100, 'epsilon_start': 1.0,
+ 'epsilon_final': 0.02})
+ self.trainers=listify(trainers)
+ for t in self.trainers: self.callbacks.append(t(self))
+
+ def init(self, init):pass
+ # def init_loss_func(self):pass
+
+ def predict(self, element, **kwargs):
+ model_func=lambda x: self.model.qvals(x)
+ q_v=model_func(element)
+ q=q_v.data.cpu().numpy()
+ actions=self.exploration_method(q)
+ return actions
+
+
diff --git a/fast_rl/agents/rainbow_dqn_models.py b/fast_rl/agents/rainbow_dqn_models.py
new file mode 100644
index 0000000..abbb66d
--- /dev/null
+++ b/fast_rl/agents/rainbow_dqn_models.py
@@ -0,0 +1,66 @@
+# n-step
+from fast_rl.core.layers import *
+
+REWARD_STEPS = 2
+
+# priority replay
+PRIO_REPLAY_ALPHA = 0.6
+BETA_START = 0.4
+BETA_FRAMES = 100000
+
+# C51
+Vmax = 10
+Vmin = -10
+N_ATOMS = 51
+DELTA_Z = (Vmax - Vmin) / (N_ATOMS - 1)
+
+
+class RainbowDQN(nn.Module):
+ def __init__(self, input_shape, n_actions):
+ super(RainbowDQN, self).__init__()
+
+ hd_sz=512
+
+ self.fc_val = nn.Sequential(
+ GaussianNoisyLinear(input_shape[0], hd_sz),
+ nn.ReLU(),
+ GaussianNoisyLinear(hd_sz, N_ATOMS)
+ )
+
+ self.fc_adv = nn.Sequential(
+ GaussianNoisyLinear(input_shape[0], hd_sz),
+ nn.ReLU(),
+ GaussianNoisyLinear(hd_sz, n_actions * N_ATOMS)
+ )
+
+ self.register_buffer("supports", torch.arange(Vmin, Vmax+DELTA_Z, DELTA_Z))
+ self.softmax = nn.Softmax(dim=1)
+
+ self.loss_func=None
+
+ def _get_conv_out(self, shape):
+ o = self.conv(torch.zeros(1, *shape))
+ return int(np.prod(o.size()))
+
+ def forward(self, x):
+ batch_size = x.size()[0]
+ fx = x.float()
+ val_out = self.fc_val(fx).view(batch_size, 1, N_ATOMS)
+ adv_out = self.fc_adv(fx).view(batch_size, -1, N_ATOMS)
+ adv_mean = adv_out.mean(dim=1, keepdim=True)
+ return val_out + (adv_out - adv_mean)
+
+ def set_opt(self, _): pass
+
+ def both(self, x):
+ cat_out = self(x)
+ probs = self.apply_softmax(cat_out)
+ weights = probs * self.supports
+ res = weights.sum(dim=2)
+ return cat_out, res
+
+ def qvals(self, x):
+ return self.both(x)[1]
+
+ def apply_softmax(self, t):
+ return self.softmax(t.view(-1, N_ATOMS)).view(t.size())
diff --git a/fast_rl/agents/reinforce.py b/fast_rl/agents/reinforce.py
new file mode 100644
index 0000000..cc2c5cc
--- /dev/null
+++ b/fast_rl/agents/reinforce.py
@@ -0,0 +1,171 @@
+import collections
+from copy import deepcopy
+from functools import partial
+from warnings import warn
+
+from fastai.basic_train import LearnerCallback
+from fastai.imports import torch, Any
+
+from fast_rl.agents.dist_dqn_models import TargetNet
+from fast_rl.agents.dqn_models import distr_projection
+from fast_rl.core.agent_core import ExperienceReplay, NStepExperienceReplay, NStepPriorityExperienceReplay, \
+ ExplorationStrategy
+from fast_rl.core.basic_train import AgentLearner, listify, List
+from fast_rl.core.data_block import MDPDataBunch, MDPStep
+from fastai.imports import torch
+
+import gym
+import numpy as np
+import argparse
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+
+from itertools import groupby
+
+
+class PolicyExploration(ExplorationStrategy):
+ def __init__(self,apply_softmax=False):
+ super().__init__()
+ self.apply_softmax=apply_softmax
+ self.epsilon=0
+
+ def perturb(self, action, action_space) -> np.ndarray:
+ if self.apply_softmax:
+ action=F.softmax(action.double(), dim=1) # cast to double so the precision is greater (otherwise sum!= 1 err)
+ action/=action.sum()
+ action_list = [np.random.choice(len(prob), p=prob) for prob in action]
+ return np.array(action_list)
+
+def calc_qvals(rewards,gamma):
+ res = []
+ sum_r = 0.0
+ for r in reversed(rewards):
+ sum_r *= gamma
+ sum_r += r
+ res.append(sum_r)
+ res = list(reversed(res))
+ mean_q = np.mean(res)
+ return [q - mean_q for q in res]
+
+def calc_loss(states,actions,logits,q_vals):
+ log_prob_v=F.log_softmax(logits, dim=1)
+ log_prob_actions_v=q_vals*log_prob_v[range(len(states)), actions]
+ loss_v=-log_prob_actions_v.mean()
+ return loss_v
+
+class BaseReinforceTrainer(LearnerCallback):
+ def __init__(self, learn: 'ReinforceLearner', max_episodes=None):
+ r"""Handles basic DQN end of step model optimization."""
+ super().__init__(learn)
+
+ self.n_skipped = 0
+ self._persist = max_episodes is not None
+ self.max_episodes = max_episodes
+ self.episode = -1
+ self.iteration = 0
+ # For the callback handler
+ self._order = 0
+ self.previous_item = None
+ self.loss=None
+
+ @property
+ def learn(self)->'ReinforceLearner':
+ return self._learn()
+
+ def on_train_begin(self, n_epochs, **kwargs: Any):
+ self.max_episodes = n_epochs if not self._persist else self.max_episodes
+
+ def on_epoch_begin(self, epoch, **kwargs: Any):
+ pass
+
+ def on_backward_begin(self, **kwargs: Any):
+ return {'skip_bwd':self.learn.current_n_episodes_to_train=self.learn.n_episodes_to_train:
+ return {'skip_step': False}
+ return {'skip_step':True}
+ def on_step_end(self, **kwargs: Any):
+ if self.learn.current_n_episodes_to_train>=self.learn.n_episodes_to_train:
+ self.learn.current_n_episodes_to_train=0
+ return {'skip_zero': False}
+ return {'skip_zero': True}
+
+ def on_batch_begin(self, **kwargs:Any) ->None:
+ if self.learn.model.training:
+ if self.learn.data.x.items[-1].done: self.learn.current_n_episodes_to_train+=1
+ if not self.learn.warming_up and self.learn.loss_func is None: self.learn.init_loss_func()
+
+ def on_loss_begin(self, **kwargs: Any):
+ r"""Performs tree updates, exploration updates, and model optimization."""
+ if self.learn.model.training:
+ # if self.learn.data.x.items[-1].done: self.learn.current_n_episodes_to_train+=1
+ self.learn.memory.update(item=self.learn.data.x.items[-1])
+ if len(self.learn.data.x.items)>10:self.learn.data.x.items=np.delete(self.learn.data.x.items,0)
+ self.iteration+=1
+ self.learn.exploration_method.update(self.episode, max_episodes=self.max_episodes, explore=self.learn.model.training)
+
+ if self.learn.current_n_episodes_to_train>=self.learn.n_episodes_to_train and not self.learn.warming_up:
+ samples=list(self.memory.memory)
+ assert int(sum([s.done for s in samples]))==self.learn.current_n_episodes_to_train
+
+ _episode_counter=[0]
+ q_vals=[]
+ def paint_episodes(x,counter):
+ if not bool(x.done): return counter[0]
+ else:
+ counter[0]+=1
+ return counter[0]
+ for g_n,o in groupby([s for s in samples],partial(paint_episodes,counter=_episode_counter)):
+ q_vals.extend(calc_qvals([float(s.reward) for s in o],self.learn.discount))
+
+ loss=calc_loss(
+ states=torch.cat([s.s for s in samples]),
+ actions=torch.cat([s.a for s in samples]),
+ logits=self.learn.model(torch.cat([s.s for s in samples])),
+ q_vals=torch.Tensor(q_vals).to(self.learn.data.device)
+ )
+ self.learn.memory.memory.clear()
+ self.loss=loss.detach().cpu()
+ return {'last_output': loss}
+ return {'last_output':self.loss}
+
+ def on_batch_end(self, **kwargs:Any) ->None:
+ if self.iteration % 300 == 0:
+ self.learn.target_net.sync()
+
+
+class ReinforceLearner(AgentLearner):
+ def __init__(self, data: MDPDataBunch, model, trainers,loss_func=None,episodes_to_train=4,discount=0.99,opt=torch.optim.Adam,**learn_kwargs):
+ super().__init__(data=data, model=model, opt=opt,loss_func=loss_func, **learn_kwargs)
+ self._loss_func=loss_func
+ self.memory=ExperienceReplay(100000)
+ self.discount=discount
+
+ self.target_net=TargetNet(self.model)
+ self.exploration_method=PolicyExploration(apply_softmax=True)
+ self.trainers=listify(trainers)
+ for t in self.trainers: self.callbacks.append(t(self))
+ self.n_episodes_to_train=episodes_to_train
+ self.current_n_episodes_to_train=0
+ self.stay_warmed_up_toggle=True
+
+ def init(self, init):pass
+ # def init_loss_func(self):pass
+ def remove_loss_func(self):
+ self.loss_func=None
+
+ @property
+ def warming_up(self):
+ if self.n_episodes_to_train<=self.current_n_episodes_to_train:
+ self.stay_warmed_up_toggle=False
+ return self.stay_warmed_up_toggle
+
+ def predict(self, element, **kwargs):
+ q_v=self.model(element)
+ actions=self.exploration_method.perturb(q_v,self.data.action.action_space)
+ return actions
+
+
diff --git a/fast_rl/agents/reinforce_models.py b/fast_rl/agents/reinforce_models.py
new file mode 100644
index 0000000..b9cc141
--- /dev/null
+++ b/fast_rl/agents/reinforce_models.py
@@ -0,0 +1,18 @@
+from torch import nn
+
+
+class PGN(nn.Module):
+ def __init__(self, input_size, n_actions):
+ super(PGN, self).__init__()
+
+ self.net = nn.Sequential(
+ nn.Linear(input_size[0], 128),
+ nn.ReLU(),
+ nn.Linear(128, n_actions)
+ )
+ self.loss_func=None
+
+ def set_opt(self, _): pass
+
+ def forward(self, x):
+ return self.net(x)
\ No newline at end of file
diff --git a/fast_rl/core/agent_core.py b/fast_rl/core/agent_core.py
index 419a675..75b61fe 100644
--- a/fast_rl/core/agent_core.py
+++ b/fast_rl/core/agent_core.py
@@ -5,12 +5,13 @@
from fastai.basic_train import *
from fastai.torch_core import *
+from fast_rl.core.data_block import MDPStep
from fast_rl.core.data_structures import SumTree
class ExplorationStrategy:
def __init__(self, explore: bool = True): self.explore=explore
- def update(self, max_episodes, explore, **kwargs): self.explore=explore
+ def update(self,episode, max_episodes, explore, **kwargs): self.explore=explore
def perturb(self, action, action_space) -> np.ndarray:
"""
Base method just returns the action. Subclass, and change to return randomly / augmented actions.
@@ -42,7 +43,7 @@ def perturb(self, action, action_space: gym.Space):
return action_space.sample() if np.random.random()=self.n_step or item.done:
+ super(NStepPriorityExperienceReplay,self).update(deepcopy(self._memory))
+ self._memory.clear()
+
+ def sample(self, batch, **kwargs):
+ samples=super(NStepPriorityExperienceReplay, self).sample(batch//self.n_step)
+ self._temp_samples=samples
+ return [o for ll in samples for o in ll]
\ No newline at end of file
diff --git a/fast_rl/core/basic_train.py b/fast_rl/core/basic_train.py
index 8a7f8c9..4bff6a9 100644
--- a/fast_rl/core/basic_train.py
+++ b/fast_rl/core/basic_train.py
@@ -3,6 +3,7 @@
from fastai.basic_train import Learner, load_callback
from fastai.torch_core import *
+from fast_rl.core.agent_core import ExplorationStrategy
from fast_rl.core.data_block import MDPDataBunch
@@ -35,10 +36,11 @@ class AgentLearner(Learner):
def __init__(self, data, loss_func=None, callback_fns=None, opt=torch.optim.Adam, **kwargs):
super().__init__(data=data, callback_fns=ifnone(callback_fns, []) + data.callback, **kwargs)
- self.model.loss_func = ifnone(loss_func, F.mse_loss)
+ self.model.loss_func = ifnone(ifnone(loss_func,self.model.loss_func), F.mse_loss)
self.model.set_opt(opt)
self.loss_func = None
self.trainers = None
+ self.exploration_strategy: Union[None,ExplorationStrategy]=None
self._loss_func = WrapperLossFunc(self)
@property
@@ -53,7 +55,7 @@ def init_loss_func(self):
By default, the learner will have a `None` loss function, and so the fit function will not try to log that
loss.
"""
- self.loss_func = WrapperLossFunc(self)
+ self.loss_func=self._loss_func
def export(self, file:PathLikeOrBinaryStream='export.pkl', destroy=False, pickle_data=False):
"Export the state of the `Learner` in `self.path/file`. `file` can be file-like (file or buffer)"
diff --git a/fast_rl/core/data_block.py b/fast_rl/core/data_block.py
index 190cdca..d4d7347 100644
--- a/fast_rl/core/data_block.py
+++ b/fast_rl/core/data_block.py
@@ -157,7 +157,7 @@ def __init__(self,env,w_step:int,h_step:int):
def render(self,mode='human',**kwargs):
img = super(ResolutionWrapper,self).render(mode=mode,**kwargs)
- return img if len(img)==0 else img[::self.w_step,::self.h_step,:]
+ return img if type(img)==bool or len(img)==0 else img[::self.w_step,::self.h_step,:]
@dataclass
class Bounds(object):
@@ -393,10 +393,11 @@ class MDPStep(object):
step: int
def __post_init__(self):
- self.action = deepcopy(self.action)
- self.state = deepcopy(self.state)
- self.reward = torch.tensor(data=self.reward).reshape(1, -1).float()
- self.done = torch.tensor(data=self.done).reshape(1, -1).float()
+ with torch.no_grad():
+ self.action = deepcopy(self.action)
+ self.state = deepcopy(self.state)
+ self.reward = torch.tensor(data=self.reward).reshape(1, -1).float()
+ self.done = torch.tensor(data=self.done).reshape(1, -1).float()
def to(self, device):
self.reward = self.reward.to(device=device)
@@ -451,9 +452,10 @@ def __init__(self, learn, keep_env_open=True):
def on_batch_begin(self, last_input, last_target, train, **kwargs: Any):
r""" Set the Action of a dataset, determine if still warming up. """
a = self.learn.predict(last_input)
- if self.learn.model.training:
- self.train_ds.action = Action(taken_action=a, action_space=self.train_ds.action.action_space)
- else: self.valid_ds.action = Action(taken_action=a, action_space=self.train_ds.action.action_space)
+ with torch.no_grad():
+ if self.learn.model.training:
+ self.train_ds.action = Action(taken_action=a, action_space=self.train_ds.action.action_space)
+ else: self.valid_ds.action = Action(taken_action=a, action_space=self.train_ds.action.action_space)
self.train_ds.is_warming_up = self.learn.warming_up
if self.valid_ds is not None: self.valid_ds.is_warming_up = self.learn.warming_up
if not self.learn.warming_up and self.learn.loss_func is None: self.learn.init_loss_func()
@@ -655,9 +657,9 @@ def new(self, _):
self.s_prime, reward, done, _, self.alt_s_prime = self.stage_2_env_step()
# If both the current item and the done are both true, then we need to retry the env
if self.item is not None and self.item.d and done: return self.new(_)
-
- self.state = State(s, self.s_prime, alt_s, self.alt_s_prime, self.env.observation_space, self.feed_type)
- self.item = MDPStep(self.action, self.state, done, reward, self.episode, self.counter)
+ with torch.no_grad():
+ self.state = State(s, self.s_prime, alt_s, self.alt_s_prime, self.env.observation_space, self.feed_type)
+ self.item = MDPStep(self.action, self.state, done, reward, self.episode, self.counter)
self.counter += 1
return MDPList([self.item])
diff --git a/fast_rl/core/layers.py b/fast_rl/core/layers.py
index 3009f25..eb20bbd 100644
--- a/fast_rl/core/layers.py
+++ b/fast_rl/core/layers.py
@@ -1,89 +1,225 @@
r"""`fast_rl.layers` provides essential functions to building and modifying `model` architectures"""
from math import ceil
+from fastai.layers import embedding
from fastai.torch_core import *
-from fastai.tabular import TabularModel
+from torch.distributions import Normal
+
+
+def distr_projection(next_distr, rewards, dones, Vmin, Vmax, n_atoms, gamma):
+ """
+ Perform distribution projection aka Catergorical Algorithm from the
+ "A Distributional Perspective on RL" paper
+ """
+ batch_size = len(rewards)
+ proj_distr = np.zeros((batch_size, n_atoms), dtype=np.float32)
+ delta_z = (Vmax - Vmin) / (n_atoms - 1)
+ for atom in range(n_atoms):
+ tz_j = np.minimum(Vmax, np.maximum(Vmin, rewards + (Vmin + atom * delta_z) * gamma))
+ b_j = (tz_j - Vmin) / delta_z
+ l = np.floor(b_j).astype(np.int64)
+ u = np.ceil(b_j).astype(np.int64)
+ eq_mask = u == l
+ proj_distr[eq_mask, l[eq_mask]] += next_distr[eq_mask, atom]
+ ne_mask = u != l
+ proj_distr[ne_mask, l[ne_mask]] += next_distr[ne_mask, atom] * (u - b_j)[ne_mask]
+ proj_distr[ne_mask, u[ne_mask]] += next_distr[ne_mask, atom] * (b_j - l)[ne_mask]
+ if dones.any():
+ proj_distr[dones] = 0.0
+ tz_j = np.minimum(Vmax, np.maximum(Vmin, rewards[dones]))
+ b_j = (tz_j - Vmin) / delta_z
+ l = np.floor(b_j).astype(np.int64)
+ u = np.ceil(b_j).astype(np.int64)
+ eq_mask = u == l
+ eq_dones = dones.copy()
+ eq_dones[dones] = eq_mask
+ if eq_dones.any():
+ proj_distr[eq_dones, l[eq_mask]] = 1.0
+ ne_mask = u != l
+ ne_dones = dones.copy()
+ ne_dones[dones] = ne_mask
+ if ne_dones.any():
+ proj_distr[ne_dones, l[ne_mask]] = (u - b_j)[ne_mask]
+ proj_distr[ne_dones, u[ne_mask]] = (b_j - l)[ne_mask]
+ return proj_distr
+
+
+
+def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0., actn:Optional[nn.Module]=None,lin_cls=nn.Linear):
+ "Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`."
+ layers = [nn.BatchNorm1d(n_in)] if bn else []
+ if p != 0: layers.append(nn.Dropout(p))
+ layers.append(lin_cls(n_in, n_out))
+ if actn is not None: layers.append(actn)
+ return layers
+
+
+class TabularModel(Module):
+ "Basic model for tabular data."
+ def __init__(self, emb_szs:ListSizes, n_cont:int, out_sz:int, layers:Collection[int], ps:Collection[float]=None,
+ emb_drop:float=0., y_range:OptRange=None, use_bn:bool=True, bn_final:bool=False,lin_cls=nn.Linear):
+ super().__init__()
+ ps = ifnone(ps, [0]*len(layers))
+ ps = listify(ps, layers)
+ self.embeds = nn.ModuleList([embedding(ni, nf) for ni,nf in emb_szs])
+ self.emb_drop = nn.Dropout(emb_drop)
+ self.bn_cont = nn.BatchNorm1d(n_cont)
+ n_emb = sum(e.embedding_dim for e in self.embeds)
+ self.n_emb,self.n_cont,self.y_range = n_emb,n_cont,y_range
+ sizes = self.get_sizes(layers, out_sz)
+ actns = [nn.ReLU(inplace=True) for _ in range(len(sizes)-2)] + [None]
+ layers = []
+ for i,(n_in,n_out,dp,act) in enumerate(zip(sizes[:-1],sizes[1:],[0.]+ps,actns)):
+ layers += bn_drop_lin(n_in, n_out, bn=use_bn and i!=0, p=dp, actn=act,lin_cls=lin_cls)
+ if bn_final: layers.append(nn.BatchNorm1d(sizes[-1]))
+ self.layers = nn.Sequential(*layers)
+
+ def get_sizes(self, layers, out_sz):
+ return [self.n_emb + self.n_cont] + layers + [out_sz]
+
+ def forward(self, x_cat:Tensor, x_cont:Tensor) -> Tensor:
+ if self.n_emb != 0:
+ x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
+ x = torch.cat(x, 1)
+ x = self.emb_drop(x)
+ if self.n_cont != 0:
+ x_cont = self.bn_cont(x_cont)
+ x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
+ x = self.layers(x)
+ if self.y_range is not None:
+ x = (self.y_range[1]-self.y_range[0]) * torch.sigmoid(x) + self.y_range[0]
+ return x
def init_cnn(mod: Any):
- r""" Utility for initializing cnn Modules. """
- if getattr(mod, 'bias', None) is not None: nn.init.constant_(mod.bias, 0)
- if isinstance(mod, (nn.Conv2d, nn.Linear)): nn.init.kaiming_normal_(mod.weight)
- for sub_mod in mod.children(): init_cnn(sub_mod)
+ r""" Utility for initializing cnn Modules. """
+ if getattr(mod, 'bias', None) is not None: nn.init.constant_(mod.bias, 0)
+ if isinstance(mod, (nn.Conv2d, nn.Linear)): nn.init.kaiming_normal_(mod.weight)
+ for sub_mod in mod.children(): init_cnn(sub_mod)
def ks_stride(ks, stride, w, h, n_blocks, kern_proportion=.1, stride_proportion=0.3):
- r""" Utility for determing the the kernel size and stride. """
- kernels, strides, max_dim = [], [], max((w, h))
- for i in range(len(n_blocks)):
- kernels.append(max_dim * kern_proportion)
- strides.append(kernels[-1] * stride_proportion)
- max_dim = (max_dim - kernels[-1]) / strides[-1]
- assert max_dim > 1
+ r""" Utility for determing the the kernel size and stride. """
+ kernels, strides, max_dim=[], [], max((w, h))
+ for i in range(len(n_blocks)):
+ kernels.append(max_dim*kern_proportion)
+ strides.append(kernels[-1]*stride_proportion)
+ max_dim=(max_dim-kernels[-1])/strides[-1]
+ assert max_dim>1
- return ifnone(ks, map(ceil, kernels)), ifnone(stride, map(ceil, strides))
+ return ifnone(ks, map(ceil, kernels)), ifnone(stride, map(ceil, strides))
class Flatten(nn.Module):
- def forward(self, y): return y.view(y.size(0), -1)
+ def forward(self, y): return y.view(y.size(0), -1)
class FakeBatchNorm(Module):
- r""" If we want all the batch norm layers gone, then we will replace the tabular batch norm with this. """
- def forward(self, xi: Tensor, *args): return xi
+ r""" If we want all the batch norm layers gone, then we will replace the tabular batch norm with this. """
+ def forward(self, xi: Tensor, *args): return xi
+
+
+class GaussianNoisyLinear(nn.Linear):
+ def __init__(self, in_features, out_features, sigma_zero=0.4, bias=True):
+ super(GaussianNoisyLinear, self).__init__(in_features, out_features, bias=bias)
+ sigma_init = sigma_zero / math.sqrt(in_features)
+ self.sigma_weight = nn.Parameter(torch.full((out_features, in_features), sigma_init))
+ self.register_buffer("epsilon_input", torch.zeros(1, in_features))
+ self.register_buffer("epsilon_output", torch.zeros(out_features, 1))
+ if bias:
+ self.sigma_bias = nn.Parameter(torch.full((out_features,), sigma_init))
+
+ def forward(self, input):
+ self.epsilon_input.normal_()
+ self.epsilon_output.normal_()
+
+ func = lambda x: torch.sign(x) * torch.sqrt(torch.abs(x))
+ eps_in = func(self.epsilon_input.data)
+ eps_out = func(self.epsilon_output.data)
+
+ bias = self.bias
+ if bias is not None:
+ bias = bias + self.sigma_bias * eps_out.t()
+ noise_v = torch.mul(eps_in, eps_out)
+ return F.linear(input, self.weight + self.sigma_weight * noise_v, bias)
+
+class GaussianNoisyFactorizedLinear(nn.Linear):
+ def __init__(self, in_features, out_features,sigma_zero=0.4,bias=True):
+ super().__init__(in_features, out_features,bias=bias)
+ sigma_init=sigma_zero/math.sqrt(in_features)
+ self.sigma_weight=nn.Parameter(torch.full((out_features,in_features),sigma_init))
+ self.register_buffer("epsilon_input",torch.zeros((1,in_features)))
+ self.register_buffer("epsilon_output", torch.zeros((out_features,1)))
+ if bias:
+ self.sigma_bias=nn.Parameter(torch.full((out_features,),sigma_init))
+
+ def square_direction(self,x): return torch.sign(x)*torch.sqrt(torch.abs(x))
+
+ def forward(self, xi):
+ self.epsilon_input.normal_()
+ self.epsilon_output.normal_()
+
+ eps_in,eps_out=self.square_direction(self.epsilon_input),self.square_direction(self.epsilon_output)
+
+ bias=self.bias
+ if bias is not None:
+ bias=bias+self.sigma_bias*eps_out.t()
+ noise_v=torch.mul(eps_in,eps_out)
+ return F.linear(xi,self.weight+self.sigma_weight*noise_v,bias)
+
def conv_bn_lrelu(ni: int, nf: int, ks: int = 3, stride: int = 1, pad=True, bn=True) -> nn.Sequential:
- r""" Create a sequence Conv2d->BatchNorm2d->LeakyReLu layer. (from darknet.py). Allows excluding BatchNorm2d Layer."""
- return nn.Sequential(
- nn.Conv2d(ni, nf, kernel_size=ks, bias=False, stride=stride, padding=(ks // 2) if pad else 0),
- nn.BatchNorm2d(nf) if bn else FakeBatchNorm(),
- nn.LeakyReLU(negative_slope=0.1, inplace=True))
+ r""" Create a sequence Conv2d->BatchNorm2d->LeakyReLu layer. (from darknet.py). Allows excluding BatchNorm2d Layer."""
+ return nn.Sequential(
+ nn.Conv2d(ni, nf, kernel_size=ks, bias=False, stride=stride, padding=(ks//2) if pad else 0),
+ nn.BatchNorm2d(nf) if bn else FakeBatchNorm(),
+ nn.LeakyReLU(negative_slope=0.1, inplace=True))
class ChannelTranspose(Module):
- r""" Runtime image input channel changing. Useful for handling different image channel outputs from different envs. """
- def forward(self, xi: Tensor):
- return xi.transpose(3, 1).transpose(3, 2)
+ r""" Runtime image input channel changing. Useful for handling different image channel outputs from different envs. """
+ def forward(self, xi: Tensor):
+ return xi.transpose(3, 1).transpose(3, 2)
class StateActionSplitter(Module):
- r""" `Actor / Critic` models require breaking the state and action into 2 streams. """
+ r""" `Actor / Critic` models require breaking the state and action into 2 streams. """
- def forward(self, s_a_tuple: Tuple[Tensor]):
- r""" Returns tensors as -> (State Tensor, Action Tensor) """
- return s_a_tuple[0], s_a_tuple[1]
+ def forward(self, s_a_tuple: Tuple[Tensor]):
+ r""" Returns tensors as -> (State Tensor, Action Tensor) """
+ return s_a_tuple[0], s_a_tuple[1]
class StateActionPassThrough(nn.Module):
- r""" Passes action input untouched, but runs the state tensors through a sub module. """
- def __init__(self, layers):
- super().__init__()
- self.layers = layers
+ r""" Passes action input untouched, but runs the state tensors through a sub module. """
+ def __init__(self, layers):
+ super().__init__()
+ self.layers=layers
- def forward(self, state_action):
- return self.layers(state_action[0]), state_action[1]
+ def forward(self, state_action):
+ return self.layers(state_action[0]), state_action[1]
class TabularEmbedWrapper(Module):
- r""" Basic `TabularModel` compatibility wrapper. Typically, state inputs will be either categorical or continuous. """
- def __init__(self, tabular_model: TabularModel):
- super().__init__()
- self.tabular_model = tabular_model
+ r""" Basic `TabularModel` compatibility wrapper. Typically, state inputs will be either categorical or continuous. """
+ def __init__(self, tabular_model: TabularModel):
+ super().__init__()
+ self.tabular_model=tabular_model
- def forward(self, xi: Tensor, *args):
- return self.tabular_model(xi, xi)
+ def forward(self, xi: Tensor, *args):
+ return self.tabular_model(xi, xi)
class CriticTabularEmbedWrapper(Module):
- r""" Similar to `TabularEmbedWrapper` but assumes input is state / action and requires concatenation. """
- def __init__(self, tabular_model: TabularModel, exclude_cat):
- super().__init__()
- self.tabular_model = tabular_model
- self.exclude_cat = exclude_cat
-
- def forward(self, args):
- if not self.exclude_cat:
- return self.tabular_model(*args)
- else:
- return self.tabular_model(0, torch.cat(args, 1))
+ r""" Similar to `TabularEmbedWrapper` but assumes input is state / action and requires concatenation. """
+ def __init__(self, tabular_model: TabularModel, exclude_cat):
+ super().__init__()
+ self.tabular_model=tabular_model
+ self.exclude_cat=exclude_cat
+
+ def forward(self, args):
+ if not self.exclude_cat:
+ return self.tabular_model(*args)
+ else:
+ return self.tabular_model(0, torch.cat(args, 1))
diff --git a/fast_rl/core/metrics.py b/fast_rl/core/metrics.py
index 02d293b..e66b216 100644
--- a/fast_rl/core/metrics.py
+++ b/fast_rl/core/metrics.py
@@ -1,6 +1,9 @@
+from collections import deque
+
import torch
-from fastai.basic_train import LearnerCallback, Any
+from fastai.basic_train import LearnerCallback, Any, ifnone
from fastai.callback import Callback, is_listy, add_metrics
+import numpy as np
class EpsilonMetric(LearnerCallback):
@@ -43,3 +46,29 @@ def on_train_begin(self, **kwargs):
def on_epoch_end(self, last_metrics, **kwargs: Any):
return add_metrics(last_metrics, [sum(self.train_reward), sum(self.valid_reward)])
+
+class RollingRewardMetric(LearnerCallback):
+ _order = -20
+
+ def __init__(self, learn,rolling_size=None):
+ super().__init__(learn)
+ self.rolling_sz=ifnone(rolling_size,self.learn.data.bs)
+ self.train_reward, self.valid_reward = [], []
+ self.train_rolling_reward,self.valid_rolling_reward=deque([],maxlen=self.rolling_sz), deque([],maxlen=self.rolling_sz)
+
+ def on_epoch_begin(self, **kwargs:Any):
+ self.train_reward, self.valid_reward = [], []
+
+ def on_batch_end(self, **kwargs: Any):
+ if self.learn.model.training: self.train_reward.append(self.learn.data.train_ds.item.reward.cpu().numpy()[0][0])
+ elif not self.learn.recorder.no_val: self.valid_reward.append(self.learn.data.valid_ds.item.reward.cpu().numpy()[0][0])
+
+ def on_train_begin(self, **kwargs):
+ metric_names = ['train_rolling_reward'] if self.learn.recorder.no_val else ['train_rolling_reward', 'valid_rolling_reward']
+ self.learn.recorder.add_metric_names(metric_names)
+
+ def on_epoch_end(self, last_metrics, **kwargs: Any):
+ self.train_rolling_reward.append(sum(self.train_reward))
+ self.valid_rolling_reward.append(sum(self.valid_reward))
+ return add_metrics(last_metrics, [np.average(self.train_rolling_reward), np.average(self.valid_rolling_reward)])
+
diff --git a/tests/test_cem.py b/tests/test_cem.py
new file mode 100644
index 0000000..40db469
--- /dev/null
+++ b/tests/test_cem.py
@@ -0,0 +1,23 @@
+import pytest
+from fastai.tabular.data import emb_sz_rule
+
+from fast_rl.agents.cem import CEMLearner, CEMTrainer
+from fast_rl.agents.cem_models import CEMModel
+from fast_rl.core.data_block import MDPDataBunch
+import numpy as np
+
+from fast_rl.core.metrics import RewardMetric, RollingRewardMetric
+
+
+@pytest.mark.usefixtures('skip_performance_check')
+def test_cem():
+ data=MDPDataBunch.from_env('CartPole-v0',render='human',add_valid=False,bs=16)
+ bs, state, action=data.bs, data.state, data.action
+ if np.any(state.n_possible_values==np.inf):
+ emb_szs=[]
+ else:
+ emb_szs=[(d+1, int(emb_sz_rule(d))) for d in state.n_possible_values.reshape(-1, )]
+
+ model=CEMModel(4,2,embed_szs=emb_szs,layers=[128])
+ reinforce_learner=CEMLearner(data,model,trainers=[CEMTrainer],callback_fns=[RewardMetric,RollingRewardMetric])
+ reinforce_learner.fit(600,lr=0.01,wd=0)
diff --git a/tests/test_dist_dqn.py b/tests/test_dist_dqn.py
new file mode 100644
index 0000000..7ac826f
--- /dev/null
+++ b/tests/test_dist_dqn.py
@@ -0,0 +1,25 @@
+import pytest
+
+from fast_rl.agents.dist_dqn import DistDQNLearner, BaseDistDQNTrainer
+from fast_rl.agents.dist_dqn_models import DistributionalDQN
+from fast_rl.core.data_block import MDPDataBunch, partial, ResolutionWrapper
+from fast_rl.core.metrics import *
+
+
+def test_dist_dqn():
+ data=MDPDataBunch.from_env('CartPole-v0', render='rgb_array', bs=4, add_valid=False, keep_env_open=False,
+ res_wrap=partial(ResolutionWrapper, w_step=4, h_step=4))
+ model=DistributionalDQN((4,),2)
+ metrics=[RewardMetric, RollingRewardMetric,EpsilonMetric]
+ learner=DistDQNLearner(data=data,model=model,trainers=BaseDistDQNTrainer,callback_fns=metrics,loss_func=lambda x,y:x)
+ learner.fit(3,wd=0,lr=0.0001)
+
+
+@pytest.mark.usefixtures('skip_performance_check')
+def test_dist_dqn_perf():
+ data=MDPDataBunch.from_env('CartPole-v0', render='human', bs=32, add_valid=False, keep_env_open=False,
+ res_wrap=partial(ResolutionWrapper, w_step=4, h_step=4))
+ model=DistributionalDQN((4,),2)
+ metrics=[RewardMetric, RollingRewardMetric,EpsilonMetric]
+ learner=DistDQNLearner(data=data,model=model,trainers=BaseDistDQNTrainer,callback_fns=metrics,loss_func=lambda x,y:x)
+ learner.fit(1600,wd=0,lr=0.0001)
\ No newline at end of file
diff --git a/tests/test_dqn.py b/tests/test_dqn.py
index 141d104..26db85e 100644
--- a/tests/test_dqn.py
+++ b/tests/test_dqn.py
@@ -6,15 +6,16 @@
from fast_rl.agents.dqn import create_dqn_model, dqn_learner, DQNLearner
from fast_rl.agents.dqn_models import *
-from fast_rl.core.agent_core import ExperienceReplay, PriorityExperienceReplay, GreedyEpsilon
+from fast_rl.core.agent_core import ExperienceReplay, PriorityExperienceReplay, GreedyEpsilon, NStepExperienceReplay, \
+ ExplorationStrategy
from fast_rl.core.data_block import MDPDataBunch, FEED_TYPE_STATE, FEED_TYPE_IMAGE, ResolutionWrapper
-from fast_rl.core.metrics import RewardMetric, EpsilonMetric
+from fast_rl.core.metrics import RewardMetric, EpsilonMetric, RollingRewardMetric
from fast_rl.core.train import GroupAgentInterpretation, AgentInterpretation
from torch import optim
p_model = [DQNModule, FixedTargetDQNModule,DoubleDuelingModule,DuelingDQNModule,DoubleDQNModule]
p_exp = [ExperienceReplay,
- PriorityExperienceReplay]
+ PriorityExperienceReplay,NStepExperienceReplay]
p_format = [FEED_TYPE_STATE]#, FEED_TYPE_IMAGE]
p_envs = ['CartPole-v1']
@@ -37,18 +38,21 @@ def learner2gif(lnr:DQNLearner,s_format,group_interp:GroupAgentInterpretation,na
def trained_learner(model_cls, env, s_format, experience, bs, layers, memory_size=1000000, decay=0.001,
- copy_over_frequency=300, lr=None, epochs=450,**kwargs):
+ copy_over_frequency=300, lr=None, epochs=450,lin_cls=None,explore=None,model_kwargs=None,**kwargs):
if lr is None: lr = [0.001, 0.00025]
+ model_kwargs=ifnone(model_kwargs,{})
memory = experience(memory_size=memory_size, reduce_ram=True)
- explore = GreedyEpsilon(epsilon_start=1, epsilon_end=0.1, decay=decay)
+ metrics=[RewardMetric, RollingRewardMetric]
+ if explore is None: metrics.append(EpsilonMetric)
+ explore = ifnone(explore,GreedyEpsilon(epsilon_start=1, epsilon_end=0.02, decay=decay))
if type(lr) == list: lr = lr[0] if model_cls == DQNModule else lr[1]
data = MDPDataBunch.from_env(env, render='human', bs=bs, add_valid=False, keep_env_open=False, feed_type=s_format,
- memory_management_strategy='k_partitions_top', k=3,**kwargs)
- if model_cls == DQNModule: model = create_dqn_model(data=data, base_arch=model_cls, lr=lr, layers=layers, opt=optim.RMSprop)
- else: model = create_dqn_model(data=data, base_arch=model_cls, lr=lr, layers=layers)
+ memory_management_strategy='k_partitions_top', k=1,**kwargs)
+ if model_cls == DQNModule: model = create_dqn_model(data=data, base_arch=model_cls, lr=lr, layers=layers, opt=optim.Adam,lin_cls=ifnone(lin_cls,nn.Linear),**model_kwargs)
+ else: model = create_dqn_model(data=data, base_arch=model_cls, lr=lr, layers=layers,lin_cls=ifnone(lin_cls,nn.Linear),**model_kwargs)
learn = dqn_learner(data, model, memory=memory, exploration_method=explore, copy_over_frequency=copy_over_frequency,
- callback_fns=[RewardMetric, EpsilonMetric])
- learn.fit(epochs)
+ callback_fns=metrics)
+ learn.fit(epochs,wd=0)
return learn
# @pytest.mark.usefixtures('skip_performance_check')
@@ -103,28 +107,6 @@ def test_dqn_fit_maze_env(model_cls, s_format, mem):
memory_size=1000000, decay=0.00001, res_wrap=partial(ResolutionWrapper, w_step=3, h_step=3))
learner2gif(learn,s_format,group_interp,'maze_5x5',extra_s)
- # success = False
- # while not success:
- # try:
- # data = MDPDataBunch.from_env('maze-random-5x5-v0', render='rgb_array', bs=5, max_steps=20,
- # add_valid=False, keep_env_open=False, feed_type=s_format)
- # model = create_dqn_model(data, model_cls, opt=torch.optim.RMSprop)
- # memory = ExperienceReplay(10000)
- # exploration_method = GreedyEpsilon(epsilon_start=1, epsilon_end=0.1, decay=0.001)
- # learner = dqn_learner(data=data, model=model, memory=memory, exploration_method=exploration_method,
- # callback_fns=[RewardMetric, EpsilonMetric])
- # learner.fit(2)
- #
- # assert config_env_expectations['maze-random-5x5-v0']['action_shape'] == (
- # 1, data.action.n_possible_values.item())
- # if s_format == FEED_TYPE_STATE:
- # assert config_env_expectations['maze-random-5x5-v0']['state_shape'] == data.state.s.shape
- # sleep(1)
- # success = True
- # except Exception as e:
- # if not str(e).__contains__('Surface'):
- # raise Exception
-
@pytest.mark.usefixtures('skip_performance_check')
@pytest.mark.parametrize(["model_cls", "s_format", 'experience'], list(product(p_model, p_format, p_exp)))
@@ -148,6 +130,19 @@ def test_dqn_models_minigrids(model_cls, s_format, experience):
@pytest.mark.parametrize(["model_cls", "s_format", 'experience'],
list(product(p_model, p_format, p_exp)))
def test_dqn_models_cartpole(model_cls, s_format, experience):
+ group_interp = GroupAgentInterpretation()
+ extra_s=f'{experience.__name__}_{model_cls.__name__}_{s_format}'
+ for i in range(1):
+ learn = trained_learner(model_cls, 'CartPole-v1', s_format, experience, bs=32, layers=[64, 64],epochs=200,
+ memory_size=1000000,decay=0.001,res_wrap=partial(ResolutionWrapper, w_step=3, h_step=3))
+ # learner2gif(learn,s_format,group_interp,'cartpole',extra_s)
+
+
+@pytest.mark.usefixtures('skip_performance_check')
+@pytest.mark.parametrize(["s_format", 'experience'],
+ list(product(p_format, p_exp)))
+def test_dqn_models_categorical_cartpole(s_format, experience):
+ model_cls=DistributionalDQN
group_interp = GroupAgentInterpretation()
extra_s=f'{experience.__name__}_{model_cls.__name__}_{s_format}'
for i in range(5):
@@ -155,14 +150,40 @@ def test_dqn_models_cartpole(model_cls, s_format, experience):
memory_size=1000000, decay=0.001)
learner2gif(learn,s_format,group_interp,'cartpole',extra_s)
- # meta = f'{experience.__name__}_{"FEED_TYPE_STATE" if s_format == FEED_TYPE_STATE else "FEED_TYPE_IMAGE"}'
- # interp = AgentInterpretation(learn, ds_type=DatasetType.Train)
- # interp.plot_rewards(cumulative=True, per_episode=True, group_name=meta)
- # group_interp.add_interpretation(interp)
- # filename = f'{learn.model.name.lower()}_{meta}'
- # group_interp.to_pickle(f'../docs_src/data/cartpole_{learn.model.name.lower()}/', filename)
- # [g.write('../res/run_gifs/cartpole') for g in interp.generate_gif()]
- # del learn
+
+
+@pytest.mark.usefixtures('skip_performance_check')
+@pytest.mark.parametrize(["s_format"],
+ list(product(p_format)))
+def test_dqn_models_distributional_cartpole(s_format):
+ experience=ExperienceReplay
+ group_interp = GroupAgentInterpretation()
+ model_cls=DistributionalDQN
+ extra_s=f'{experience.__name__}_{model_cls.__name__}_{s_format}'
+ for i in range(1):
+ learn = trained_learner(model_cls, 'CartPole-v1', s_format, experience, bs=32, layers=[512],lr=[0.0001,0.0001],
+ memory_size=100000, decay=0.008,model_kwargs={'v_max':10,'v_min':-10,'n_atoms':51,'opt':optim.Adam},
+ res_wrap=partial(ResolutionWrapper, w_step=3, h_step=3),epochs=800,copy_over_frequency=300)
+
+ # learner2gif(learn,s_format,group_interp,'cartpole',extra_s)
+
+
+layer_clss=[GaussianNoisyLinear,GaussianNoisyFactorizedLinear]
+
+
+@pytest.mark.usefixtures('skip_performance_check')
+@pytest.mark.parametrize(["model_cls", "s_format",'layer_cls'],
+ list(product(p_model, p_format,layer_clss)))
+def test_dqn_models_noisy_layers_cartpole(model_cls, s_format,layer_cls):
+ experience=ExperienceReplay
+ group_interp = GroupAgentInterpretation()
+ extra_s=f'{experience.__name__}_{model_cls.__name__}_{s_format}_{layer_cls.__name__}'
+ for i in range(1):
+ # Since we are using noisy layers, just use default exploration strategy (no exploration)
+ learn = trained_learner(model_cls, 'CartPole-v1', s_format, experience, bs=32, layers=[512],lr=0.0001,
+ memory_size=100000, decay=0.01,lin_cls=layer_cls,explore=ExplorationStrategy(),epochs=800)
+
+ learner2gif(learn,s_format,group_interp,'cartpole_layer_exp',extra_s)
@pytest.mark.usefixtures('skip_performance_check')
@@ -177,14 +198,6 @@ def test_dqn_models_lunarlander(model_cls, s_format, experience):
learner2gif(learn, s_format, group_interp, 'lunarlander', extra_s)
del learn
gc.collect()
- # meta = f'{experience.__name__}_{"FEED_TYPE_STATE" if s_format == FEED_TYPE_STATE else "FEED_TYPE_IMAGE"}'
- # interp = AgentInterpretation(learn, ds_type=DatasetType.Train)
- # interp.plot_rewards(cumulative=True, per_episode=True, group_name=meta)
- # group_interp.add_interpretation(interp)
- # filename = f'{learn.model.name.lower()}_{meta}'
- # group_interp.to_pickle(f'../docs_src/data/lunarlander_{learn.model.name.lower()}/', filename)
- # [g.write('../res/run_gifs/lunarlander') for g in interp.generate_gif()]
- # del learn
@pytest.mark.usefixtures('skip_performance_check')
diff --git a/tests/test_rainbow_dqn.py b/tests/test_rainbow_dqn.py
new file mode 100644
index 0000000..1cb10cf
--- /dev/null
+++ b/tests/test_rainbow_dqn.py
@@ -0,0 +1,27 @@
+from time import sleep
+
+import pytest
+
+from fast_rl.agents.rainbow_dqn import *
+from fast_rl.agents.rainbow_dqn_models import RainbowDQN
+from fast_rl.core.data_block import MDPDataBunch, partial, ResolutionWrapper
+from fast_rl.core.metrics import *
+
+
+def test_rainbow_dqn():
+ data=MDPDataBunch.from_env('CartPole-v0', render='rgb_array', bs=4, add_valid=False, keep_env_open=False,
+ res_wrap=partial(ResolutionWrapper, w_step=4, h_step=4))
+ model=RainbowDQN((4,),2)
+ metrics=[RewardMetric, RollingRewardMetric,EpsilonMetric]
+ learner=RainbowDQNLearner(data=data,model=model,trainers=BaseRainbowDQNTrainer,callback_fns=metrics,loss_func=lambda x,y:x)
+ learner.fit(3,wd=0,lr=0.0001)
+
+
+@pytest.mark.usefixtures('skip_performance_check')
+def test_rainbow_dqn_perf():
+ data=MDPDataBunch.from_env('CartPole-v0', render='human', bs=32, add_valid=False, keep_env_open=False,
+ res_wrap=partial(ResolutionWrapper, w_step=4, h_step=4))
+ model=RainbowDQN((4,),2)
+ metrics=[RewardMetric, RollingRewardMetric,EpsilonMetric]
+ learner=RainbowDQNLearner(data=data,model=model,trainers=BaseRainbowDQNTrainer,callback_fns=metrics,use_per=False,loss_func=lambda x,y:x)
+ learner.fit(1600,wd=0,lr=0.0001)
diff --git a/tests/test_reinforce.py b/tests/test_reinforce.py
new file mode 100644
index 0000000..26892ef
--- /dev/null
+++ b/tests/test_reinforce.py
@@ -0,0 +1,26 @@
+import pytest
+
+from fast_rl.agents.reinforce import BaseReinforceTrainer, ReinforceLearner
+from fast_rl.agents.reinforce_models import PGN
+from fast_rl.core.data_block import MDPDataBunch, partial, ResolutionWrapper
+from fast_rl.core.metrics import *
+
+
+def test_reinforce():
+ data=MDPDataBunch.from_env('CartPole-v0', render='rgb_array', bs=1, add_valid=False, keep_env_open=False,
+ res_wrap=partial(ResolutionWrapper, w_step=4, h_step=4))
+ model=PGN((4,),2)
+ metrics=[RewardMetric, RollingRewardMetric,EpsilonMetric]
+ learner=ReinforceLearner(data=data,model=model,trainers=BaseReinforceTrainer,episodes_to_train=4,
+ callback_fns=metrics,loss_func=lambda x,y:x)
+ learner.fit(3,wd=0,lr=0.0001)
+
+
+@pytest.mark.usefixtures('skip_performance_check')
+def test_reinforce_perf():
+ data=MDPDataBunch.from_env('CartPole-v0', render='rgb_array', bs=1, add_valid=False, keep_env_open=False,
+ res_wrap=partial(ResolutionWrapper, w_step=4, h_step=4))
+ model=PGN((4,),2)
+ metrics=[RewardMetric, RollingRewardMetric,EpsilonMetric]
+ learner=ReinforceLearner(data=data,model=model,trainers=BaseReinforceTrainer,episodes_to_train=4,callback_fns=metrics,loss_func=lambda x,y:x)
+ learner.fit(500,wd=0,lr=0.01)