quantumiracle
diff --git a/‎README.md‎
Lines changed: 34 additions & 2 deletions b/‎README.md‎
Lines changed: 34 additions & 2 deletions
diff --git a/‎common/__pycache__/buffers.cpython-36.pyc‎
1.77 KB b/‎common/__pycache__/buffers.cpython-36.pyc‎
1.77 KB
diff --git a/‎common/buffers.py‎
Lines changed: 50 additions & 5 deletions b/‎common/buffers.py‎
Lines changed: 50 additions & 5 deletions
diff --git a/‎common/wrappers.py‎
Lines changed: 80 additions & 0 deletions b/‎common/wrappers.py‎
Lines changed: 80 additions & 0 deletions
@@ -1,9 +1,23 @@
-# Popular Model-free Reinforcement Learning Algorithms  [![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text=State-of-the-art-Model-free-Reinforcement-Learning-Algorithms%20&url=hhttps://github.com/quantumiracle/STOA-RL-Algorithms&hashtags=RL)
+# Popular Model-free Reinforcement Learning Algorithms  
+<!-- [![Tweet](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/intent/tweet?text=State-of-the-art-Model-free-Reinforcement-Learning-Algorithms%20&url=hhttps://github.com/quantumiracle/STOA-RL-Algorithms&hashtags=RL) -->
 
 
 **PyTorch** and **Tensorflow 2.0** implementation of state-of-the-art model-free reinforcement learning algorithms on both Openai gym environments and a self-implemented Reacher environment. 
 
-Algorithms include **Soft Actor-Critic (SAC), Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt (including Cross-entropy (CE) Method)**, **PointNet**, **Transporter**, **Recurrent Policy Gradient**, **Soft Decision Tree**, **Probabilistic Mixture-of-Experts**, etc.
+Algorithms include:
+* **Actor-Critic (AC/A2C)**;
+* **Soft Actor-Critic (SAC)**;
+* **Deep Deterministic Policy Gradient (DDPG)**;
+* **Twin Delayed DDPG (TD3)**; 
+* **Proximal Policy Optimization (PPO)**;
+* **QT-Opt (including Cross-entropy (CE) Method)**;
+* **PointNet**;
+* **Transporter**;
+* **Recurrent Policy Gradient**;
+* **Soft Decision Tree**;
+* **Probabilistic Mixture-of-Experts**;
+* **QMIX**
+* etc.
 
 Please note that this repo is more of a personal collection of algorithms I implemented and tested during my research and study period, rather than an official open-source library/package for usage. However, I think it could be helpful to share it with others and I'm expecting useful discussions on my implementations. But I didn't spend much time on cleaning or structuring the code. As you may notice that there may be several versions of implementation for each algorithm, I intentionally show all of them here for you to refer and compare. Also, this repo contains only **PyTorch** Implementation.
 
@@ -36,6 +50,10 @@ Since Tensorflow 2.0 has already incorporated the dynamic graph construction ins
    `sac_discrete.py`: for discrete action space.
 
     paper (the author is actually one of my classmates at IC): https://arxiv.org/abs/1910.07207
+    
+  **SAC Discrete PER**
+  
+   `sac_discrete_per.py`: for discrete action space, and with prioritized experience replay (PER).
 
 * **Deep Deterministic Policy Gradient (DDPG)**:
 
@@ -86,6 +104,8 @@ Since Tensorflow 2.0 has already incorporated the dynamic graph construction ins
   `td3_lstm.py`: TD3 with LSTM policy.
 
   `sac_v2_lstm.py`: SAC with LSTM policy.
+  
+  `sac_v2_gru.py`: SAC with GRU policy.
 
   References:
 
@@ -107,6 +127,18 @@ Since Tensorflow 2.0 has already incorporated the dynamic graph construction ins
    paper: [Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement Learning
 ](https://arxiv.org/pdf/2104.09122)
 
+ * **QMIX**:
+
+     `qmix.py`: a fully cooperative multi-agent RL algorithm, demo environment using [pettingzoo](https://www.pettingzoo.ml/atari/entombed_cooperative).
+
+     paper: http://proceedings.mlr.press/v80/rashid18a.html
+     
+ * **Phasic Policy Gradient (PPG)**:
+
+   todo
+
+   paper: [Phasic Policy Gradient](http://proceedings.mlr.press/v139/cobbe21a.html)
+     
 
  * **Maximum a Posteriori Policy Optimisation (MPO)**:
 
 
@@ -36,6 +36,56 @@ def __len__(
     def get_length(self):
         return len(self.buffer)
 
+class ReplayBufferPER:
+    """ 
+    Replay buffer with Prioritized Experience Replay (PER),
+    TD error as sampling weights. This is a simple version without sumtree.
+
+    Reference:
+    https://github.com/Felhof/DiscreteSAC/blob/main/utilities/ReplayBuffer.py
+    """
+    def __init__(self, capacity):
+        self.capacity = capacity
+        self.buffer = []
+        self.position = 0
+        self.weights = np.zeros(int(capacity))
+        self.max_weight = 10**-2
+        self.delta = 10**-4
+
+    def push(self, state, action, reward, next_state, done):
+        if len(self.buffer) < self.capacity:
+            self.buffer.append(None)
+        self.buffer[self.position] = (state, action, reward, next_state, done)
+        self.weights[self.position] = self.max_weight  # new sample has max weights
+
+        self.position = int((self.position + 1) % self.capacity)  # as a ring buffer
+
+    def sample(self, batch_size):
+        set_weights = self.weights[:self.position] + self.delta
+        probabilities = set_weights / sum(set_weights)
+        self.indices = np.random.choice(range(self.position), batch_size, p=probabilities, replace=False)
+        batch = np.array(self.buffer)[list(self.indices)]
+        state, action, reward, next_state, done = map(np.stack,
+                                                      zip(*batch))  # stack for each element
+        ''' 
+        the * serves as unpack: sum(a,b) <=> batch=(a,b), sum(*batch) ;
+        zip: a=[1,2], b=[2,3], zip(a,b) => [(1, 2), (2, 3)] ;
+        the map serves as mapping the function on each list element: map(square, [2,3]) => [4,9] ;
+        np.stack((1,2)) => array([1, 2])
+        '''
+        return state, action, reward, next_state, done
+
+    def update_weights(self, prediction_errors):
+        max_error = max(prediction_errors)
+        self.max_weight = max(self.max_weight, max_error)
+        self.weights[self.indices] = prediction_errors
+
+    def __len__(
+            self):  # this is a stupid func! cannot work in multiprocessing case, len(replay_buffer) is not available in proxy of manager!
+        return len(self.buffer)
+
+    def get_length(self):
+        return len(self.buffer)
 
 class ReplayBufferLSTM:
     """ 
@@ -73,9 +123,6 @@ def __len__(
     def get_length(self):
         return len(self.buffer)
 
-
-
-
 class ReplayBufferLSTM2:
     """ 
     Replay buffer for agent with LSTM network additionally storing previous action, 
@@ -128,8 +175,6 @@ def get_length(self):
         return len(self.buffer)
 
 
-
-
 class ReplayBufferGRU:
     """ 
     Replay buffer for agent with GRU network additionally storing previous action, 
 
@@ -0,0 +1,80 @@
+
+import numpy as np
+import gym
+
+
+class Dict2TupleWrapper():
+    """ Wrap the PettingZoo envs to have a similar style as LaserFrame in NFSP """
+    def __init__(self, env, keep_info=False):
+        super(Dict2TupleWrapper, self).__init__()
+        self.env = env
+        self.num_agents = env.num_agents
+        self.keep_info = keep_info  # if True keep info as dict
+        if len(env.observation_space.shape) > 1: # image
+            old_shape = env.observation_space.shape
+            self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]), dtype=np.uint8)
+            self.obs_type = 'rgb_image'
+        else:
+            self.observation_space = env.observation_space
+            self.obs_type = 'ram'
+        self.action_space = env.action_space
+        self.observation_spaces = env.observation_spaces
+        self.action_spaces = env.action_spaces
+        try:   # both pettingzoo and slimevolley can work with this
+            self.agents = env.agents
+        except:
+            self.agents = env.unwrapped.agents
+    
+    @property
+    def unwrapped(self,):
+        return self.env
+
+    @property
+    def spec(self):
+        return self.env.spec
+
+    def observation_swapaxis(self, observation):
+        return (np.swapaxes(observation[0], 2, 0), np.swapaxes(observation[1], 2, 0))
+    
+    def reset(self):
+        obs_dict = self.env.reset()
+        if self.obs_type == 'ram':
+            return tuple(obs_dict.values())
+        else:
+            return self.observation_swapaxis(tuple(obs_dict.values()))
+
+    def step(self, actions): 
+        actions = {agent_name: action for agent_name, action in zip(self.agents, actions)}
+        obs, rewards, dones, infos = self.env.step(actions)
+        if self.obs_type == 'ram':
+            o = tuple(obs.values())
+        else:
+            o = self.observation_swapaxis(tuple(obs.values()))
+        r = list(rewards.values())
+        d = list(dones.values())
+        if self.keep_info:  # a special case for VectorEnv
+            info = infos
+        else:
+            info = list(infos.values())
+        del obs,rewards, dones, infos
+        # r = self._zerosum_filter(r)
+
+        return o, r, d, info
+
+    def _zerosum_filter(self, r):
+        ## zero-sum filter: 
+        # added for making non-zero sum game to be zero-sum, e.g. tennis_v2
+        if np.sum(r) != 0:
+            nonzero_idx = np.nonzero(r)[0][0]
+            r[1-nonzero_idx] = -r[nonzero_idx]
+        return r
+
+    def seed(self, seed):
+        self.env.seed(seed)
+        np.random.seed(seed)
+
+    def render(self,):
+        self.env.render()
+
+    def close(self):
+        self.env.close()