Implemented Monte-Carlo method on GymMinigrid Envrionment, MiniGrid-Empty-8x8-v0.
gen_obsgenerates partially observable agent's view (an image)- For discrete observation, we use
agent_pos, which returns the grid number at which the agent is present.
| Num | Action |
|---|---|
| 0 | Turn Left |
| 1 | Turn Right |
| 2 | Move Forward |
- Reward is 1 when agent reaches goal, else 0
- Gamma
- 0.9
- Training Episodes
- 75
- Exploration
- Epsilon = Epsilon/1.1
Implemented SARSA-λ and Backward SARSA method on GymMinigrid Envrionment, MiniGrid-Empty-8x8-v0 and MiniGrid-FourRooms-v0.
gen_obsgenerates partially observable agent's view (an image)- For discrete observation, we use
agent_pos, which returns the grid number at which the agent is present.
| Num | Action |
|---|---|
| 0 | Turn Left |
| 1 | Turn Right |
| 2 | Move Forward |
- Reward is 1 when agent reaches goal, else 0
- Gamma
- 0.9
- Sarsa Lambda
- 0.99
- Training Episodes
- 50
- Exploration
- Epsilon = Epsilon/1.05
Implemented SARSA-λ and Backward SARSA method on GymMinigrid Envrionment, MiniGrid-Empty-8x8-v0.
gen_obsgenerates partially observable agent's view (an image)- For discrete observation, we use
agent_pos, which returns the grid number at which the agent is present.
| Num | Action |
|---|---|
| 0 | Turn Left |
| 1 | Turn Right |
| 2 | Move Forward |
- Reward is 1 when agent reaches goal, else 0
- Gamma
- Trained agents with 5 different values of gamma
- 0.9, 0.7, 0.5, 0.3, 0.1
- Trained agents with 5 different values of gamma
- Training Episodes
- 150
- Exploration
- Epsilon = Epsilon/1.1
Implemented DQN on Gym Envrionment, Gym-CartPole-v0.
| Num | Observation | Min | Max |
|---|---|---|---|
| 0 | Cart Position | -4.8 | 4.8 |
| 1 | Cart Velocity | -Inf | Inf |
| 2 | Pole Angle | -0.418 rad(-24 deg) | 0.418 rad(-24 deg) |
| 3 | Pole Angular Velocity | -Inf | Inf |
| Num | Action |
|---|---|
| 0 | Push Cart to Left |
| 1 | Push Cart to Right |
- Reward is 1 for every step taken, including the termination step
- Network Architecture
- 4 Linear Layers of dim = [16, 32, 16, 2]
- Optimizer
- Adam Optimizer
- Learning Rate
- 0.0001
- Batch Size
- 128
- Training Episodes
- 700
Implemented Policy Gradient Method (Actor-Critic) on Gym Envrionment, Gym-CartPole-v0.
The observation is a ndarray with shape (3,) representing the x-y coordinates of the pendulum's free end and its angular velocity.
| Num | Observation | Min | Max |
|---|---|---|---|
| 0 | x = cos(theta) | -1.0 | 1.0 |
| 1 | y = sin(angle) | -1.0 | 1.0 |
| 2 | Angular Velocity | -8.0 | 8.0 |
The action is a ndarray with shape (1,) representing the torque applied to free end of the pendulum.
| Num | Action | Min | Max |
|---|---|---|---|
| 0 | Torque | -2.0 | 2.0 |
- The reward function is a function of theta, angle made by the pendulum.
-
Network Architecture
- Actor
- 4 Linear Layers of dim = [31,128,32,2]
- Critic
- 4 Linear Layers of dim = [31,128,32,1]
- Actor
-
Optimizer
- Adam Optimizer
-
Learning Rate
- 0.0005
-
Batch Size
- 64
-
Training Episodes
- 1200











