Skip to content

Commit f492208

Browse files
authored
Add results for RL algorithm (#12)
1 parent a346382 commit f492208

13 files changed

+507
-99
lines changed

README.md

Lines changed: 49 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -6,119 +6,72 @@ Inverse Reinforcement Learning Algorithm implementation with python.
66

77
# Implemented Algorithms
88

9-
## Maximum Entropy IRL: [1]
9+
## Maximum Entropy IRL:
1010

11-
## Maximum Entropy Deep IRL
11+
Implementation of the Maximum Entropy inverse reinforcement learning algorithm from [1] and is based on the implementation
12+
of [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
13+
It is an IRL algorithm using Q-Learning with a Maximum Entropy update function.
1214

13-
# Experiments
15+
## Maximum Entropy Deep IRL:
1416

15-
## Mountaincar-v0
16-
[gym](https://www.gymlibrary.dev/environments/classic_control/mountain_car/)
17-
18-
The expert demonstrations for the Mountaincar-v0 are the same as used in [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
19-
20-
*Heatmap of Expert demonstrations with 400 states*:
21-
22-
<img src="demo/heatmaps/expert_state_frequencies_mountaincar.png">
23-
24-
### Maximum Entropy Inverse Reinforcement Learning
25-
26-
IRL using Q-Learning with a Maximum Entropy update function.
27-
28-
#### Training
29-
30-
*Learner training for 1000 episodes*:
31-
32-
<img src="demo/learning_curves/maxent_999_flat.png">
33-
34-
*Learner training for 4000 episodes*:
35-
36-
<img src="demo/learning_curves/maxent_4999_flat.png">
37-
38-
#### Heatmaps
39-
40-
*Learner state frequencies after 1000 episodes*:
41-
42-
<img src="demo/heatmaps/learner_999_flat.png">
43-
44-
*Learner state frequencies after 2000 episodes*:
45-
46-
<img src="demo/heatmaps/learner_1999_flat.png">
47-
48-
*Learner state frequencies after 5000 episodes*:
49-
50-
<img src="demo/heatmaps/learner_4999_flat.png">
51-
52-
<img src="demo/heatmaps/theta_999_flat.png">
53-
54-
*State rewards heatmap after 5000 episodes*:
55-
56-
<img src="demo/heatmaps/theta_4999_flat.png">
17+
An implementation of the Maximum Entropy inverse reinforcement learning algorithm, which uses a neural-network for the
18+
actor.
19+
The estimated irl-reward is learned similar as in Maximum Entropy IRL.
20+
It is an IRL algorithm using Deep Q-Learning with a Maximum Entropy update function.
5721

58-
*State rewards heatmap after 14000 episodes*:
22+
## Maximum Entropy Deep RL:
5923

60-
<img src="demo/heatmaps/theta_13999_flat.png">
24+
An implementation of the Maximum Entropy reinforcement learning algorithm.
25+
This algorithm is used to compare the IRL algorithms with an RL algorithm.
6126

62-
#### Testing
27+
# Experiment
6328

64-
*Testing results of the model after 29000 episodes*:
65-
66-
<img src="demo/test_results/test_maxentropy_flat.png">
67-
68-
69-
### Deep Maximum Entropy Inverse Reinforcement Learning
70-
71-
IRL using Deep Q-Learning with a Maximum Entropy update function.
72-
73-
#### Training
74-
75-
*Learner training for 1000 episodes*:
76-
77-
<img src="demo/learning_curves/maxentdeep_999_w_reset_10.png">
78-
79-
*Learner training for 5000 episodes*:
80-
81-
<img src="demo/learning_curves/maxentdeep_4999_w_reset_10.png">
82-
83-
#### Heatmaps
84-
85-
*Learner state frequencies after 1000 episodes*:
86-
87-
<img src="demo/heatmaps/learner_999_maxentdeep_w_reset_10.png">
88-
89-
*Learner state frequencies after 2000 episodes*:
90-
91-
<img src="demo/heatmaps/learner_1999_maxentdeep_w_reset_10.png">
92-
93-
*Learner state frequencies after 5000 episodes*:
94-
95-
<img src="demo/heatmaps/learner_4999_maxentdeep_w_reset_10.png">
96-
97-
*State rewards heatmap after 1000 episodes*:
98-
99-
<img src="demo/heatmaps/theta_999_maxentdeep_w_reset_10.png">
29+
## Mountaincar-v0
10030

101-
*State rewards heatmap after 2000 episodes*:
31+
The Mountaincar-v0 is used for evaluating the different algorithms.
32+
Therefore, the implementation of the MDP for the Mountaincar
33+
from [gym](https://www.gymlibrary.dev/environments/classic_control/mountain_car/) is used.
10234

103-
<img src="demo/heatmaps/theta_1999_maxentdeep_w_reset_10.png">
35+
The expert demonstrations for the Mountaincar-v0 are the same as used
36+
in [lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent).
10437

105-
*State rewards heatmap after 5000 episodes*:
38+
*Heatmap of Expert demonstrations with 400 states*:
10639

107-
<img src="demo/heatmaps/theta_4999_maxentdeep_w_reset_10.png">
40+
<img src="demo/heatmaps/expert_state_frequencies_mountaincar.png">
10841

42+
### Comparing the algorithms
10943

110-
#### Testing
44+
The following tables compare the result of training and testing the two IRL algorithms Maximum Entropy and
45+
Maximum Entropy Deep. Furthermore, results for the RL algorithm Maximum Entropy Deep algorithm are shown, to
46+
highlight the differences between IRL and RL.
11147

112-
*Testing results of the best model after 5000 episodes*:
48+
| Algorithm | Training Curve after 1000 Episodes | Training Curve after 5000 Episodes |
49+
|--------------------------|----------------------------------------------------------------------------|-----------------------------------------------------------------------------|
50+
| Maximum Entropy IRL | <img src="demo/learning_curves/maxent_999_flat.png" width="400"> | <img src="demo/learning_curves/maxent_4999_flat.png" width="400"> |
51+
| Maximum Entropy Deep IRL | <img src="demo/learning_curves/maxentdeep_999_w_reset_10.png" width="400"> | <img src="demo/learning_curves/maxentdeep_4999_w_reset_10.png" width="400"> |
52+
| Maximum Entropy Deep RL | <img src="demo/learning_curves/maxentdeep_999_RL.png" width="400"> | <img src="demo/learning_curves/maxentdeep_4999_RL.png" width="400"> |
11353

114-
<img src="demo/test_results/test_maxentropydeep_best_model_results.png">
54+
| Algorithm | State Frequencies Learner: 1000 Episodes | State Frequencies Learner: 2000 Episodes | State Frequencies Learner: 5000 Episodes |
55+
|--------------------------|-----------------------------------------------------------------------------|------------------------------------------------------------------------------|------------------------------------------------------------------------------|
56+
| Maximum Entropy IRL | <img src="demo/heatmaps/learner_999_flat.png" width="400"> | <img src="demo/heatmaps/learner_1999_flat.png" width="400"> | <img src="demo/heatmaps/learner_4999_flat.png" width="400"> |
57+
| Maximum Entropy Deep IRL | <img src="demo/heatmaps/learner_999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/learner_1999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/learner_4999_maxentdeep_w_reset_10.png" width="400"> |
58+
| Maximum Entropy Deep RL | <img src="demo/heatmaps/learner_999_deep_RL.png" width="400"> | <img src="demo/heatmaps/learner_1999_deep_RL.png" width="400"> | <img src="demo/heatmaps/learner_4999_deep_RL.png" width="400"> |
11559

116-
### Deep Maximum Entropy Inverse Reinforcement Learning with Critic
60+
| Algorithm | IRL Rewards: 1000 Episodes | IRL Rewards: 2000 Episodes | IRL Rewards: 5000 Episodes | IRL Rewards: 14000 Episodes |
61+
|--------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------------|------------------------------------------------------------|
62+
| Maximum Entropy IRL | <img src="demo/heatmaps/theta_999_flat.png" width="400"> | None | <img src="demo/heatmaps/theta_4999_flat.png" width="400"> | <img src="demo/heatmaps/theta_13999_flat.png" width="400"> |
63+
| Maximum Entropy Deep IRL | <img src="demo/heatmaps/theta_999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/theta_1999_maxentdeep_w_reset_10.png" width="400"> | <img src="demo/heatmaps/theta_4999_maxentdeep_w_reset_10.png" width="400"> | None |
64+
| Maximum Entropy Deep RL | None | None | None | None |
11765

118-
Coming soon...
66+
| Algorithm | Testing Results: 100 Runs |
67+
|--------------------------|-----------------------------------------------------------------------------------------|
68+
| Maximum Entropy IRL | <img src="demo/test_results/test_maxentropy_flat.png" width="400"> |
69+
| Maximum Entropy Deep IRL | <img src="demo/test_results/test_maxentropydeep_best_model_results.png" width="400"> |
70+
| Maximum Entropy Deep RL | <img src="demo/test_results/test_maxentropydeep_best_model_RL_results.png" width="400"> |
11971

12072
# References
121-
The implementation of MaxEntropyIRL and MountainCar is based on the implementation of:
73+
74+
The implementation of MaxEntropyIRL and MountainCar is based on the implementation of:
12275
[lets-do-irl](https://github.com/reinforcement-learning-kr/lets-do-irl/tree/master/mountaincar/maxent)
12376

12477
[1] [BD. Ziebart, et al., "Maximum Entropy Inverse Reinforcement Learning", AAAI 2008](https://cdn.aaai.org/AAAI/2008/AAAI08-227.pdf).
@@ -133,12 +86,12 @@ pip install .
13386
# Usage
13487

13588
```commandline
136-
usage: irl [-h] [--version] [--training] [--testing] [--render] ALGORITHM
89+
usage: irl-runner [-h] [--version] [--training] [--testing] [--render] ALGORITHM
13790
13891
Implementation of IRL algorithms
13992
14093
positional arguments:
141-
ALGORITHM Currently supported training algorithm: [max-entropy, max-entropy-deep]
94+
ALGORITHM Currently supported training algorithm: [max-entropy, max-entropy-deep, max-entropy-deep-rl]
14295
14396
options:
14497
-h, --help show this help message and exit
19.2 KB
Loading
21.2 KB
Loading
21 KB
Loading
22.2 KB
Loading
27.1 KB
Loading
42.1 KB
Loading
Binary file not shown.

setup.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ testing =
7878
# script_name = irlwpython.module:function
7979
# For example:
8080
console_scripts =
81-
irl = irlwpython.main:run
81+
irl-runner = irlwpython.main:run
8282
# And any other entry points, for example:
8383
# pyscaffold.cli =
8484
# awesome = pyscaffoldext.awesome.extension:AwesomeExtension

src/irlwpython/MaxEntropyDeepRL.py

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
import numpy as np
2+
import math
3+
4+
import torch
5+
import torch.optim as optim
6+
import torch.nn as nn
7+
8+
from irlwpython.FigurePrinter import FigurePrinter
9+
10+
11+
class QNetwork(nn.Module):
12+
def __init__(self, input_size, output_size):
13+
super(QNetwork, self).__init__()
14+
self.fc1 = nn.Linear(input_size, 64)
15+
self.relu1 = nn.ReLU()
16+
self.fc2 = nn.Linear(64, 32)
17+
self.relu2 = nn.ReLU()
18+
self.output_layer = nn.Linear(32, output_size)
19+
20+
self.printer = FigurePrinter()
21+
22+
def forward(self, state):
23+
x = self.fc1(state)
24+
x = self.relu1(x)
25+
x = self.fc2(x)
26+
x = self.relu2(x)
27+
q_values = self.output_layer(x)
28+
return q_values
29+
30+
31+
class MaxEntropyDeepRL:
32+
def __init__(self, target, state_dim, action_size, feature_matrix, one_feature, learning_rate=0.001, gamma=0.99):
33+
self.feature_matrix = feature_matrix
34+
self.one_feature = one_feature
35+
36+
self.target = target
37+
38+
self.q_network = QNetwork(state_dim, action_size)
39+
self.target_q_network = QNetwork(state_dim, action_size)
40+
self.target_q_network.load_state_dict(self.q_network.state_dict())
41+
self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
42+
43+
self.gamma = gamma
44+
45+
self.printer = FigurePrinter()
46+
47+
def select_action(self, state, epsilon):
48+
"""
49+
Selects an action based on the q values from the network with epsilon greedy.
50+
:param state:
51+
:param epsilon:
52+
:return:
53+
"""
54+
if np.random.rand() < epsilon:
55+
return np.random.choice(3)
56+
else:
57+
with torch.no_grad():
58+
q_values = self.q_network(torch.FloatTensor(state))
59+
return torch.argmax(q_values).item()
60+
61+
def update_q_network(self, state, action, reward, next_state, done):
62+
"""
63+
Updates the q network based on the reward
64+
:param state:
65+
:param action:
66+
:param reward:
67+
:param next_state:
68+
:param done:
69+
:return:
70+
"""
71+
state = torch.FloatTensor(state)
72+
next_state = torch.FloatTensor(next_state)
73+
q_values = self.q_network(state)
74+
next_q_values = self.target_q_network(next_state)
75+
76+
target = q_values.clone()
77+
if not done:
78+
target[action] = reward + self.gamma * torch.max(next_q_values).item()
79+
else:
80+
target[action] = reward
81+
82+
loss = nn.MSELoss()(q_values, target.detach())
83+
self.optimizer.zero_grad()
84+
loss.backward()
85+
self.optimizer.step()
86+
87+
def update_target_network(self):
88+
"""
89+
Updates the target network.
90+
:return:
91+
"""
92+
self.target_q_network.load_state_dict(self.q_network.state_dict())
93+
94+
def train(self, n_states, episodes=30000, max_steps=200,
95+
epsilon_start=1.0,
96+
epsilon_decay=0.995, epsilon_min=0.01):
97+
"""
98+
Trains the network using the maximum entropy deep reinforcement algorithm.
99+
:param n_states:
100+
:param episodes: Count of training episodes
101+
:param max_steps: Max steps per episode
102+
:param epsilon_start:
103+
:param epsilon_decay:
104+
:param epsilon_min:
105+
:return:
106+
"""
107+
learner_feature_expectations = np.zeros(n_states)
108+
109+
epsilon = epsilon_start
110+
episode_arr, scores = [], []
111+
112+
best_reward = -math.inf
113+
for episode in range(episodes):
114+
state, info = self.target.env_reset()
115+
total_reward = 0
116+
117+
for step in range(max_steps):
118+
action = self.select_action(state, epsilon)
119+
120+
next_state, reward, done, _, _ = self.target.env_step(action)
121+
total_reward += reward
122+
123+
self.update_q_network(state, action, reward, next_state, done)
124+
self.update_target_network()
125+
126+
# State counting for densitiy
127+
state_idx = self.target.state_to_idx(state)
128+
learner_feature_expectations += self.feature_matrix[int(state_idx)]
129+
130+
state = next_state
131+
if done:
132+
break
133+
134+
# Keep track of best performing network
135+
if total_reward > best_reward:
136+
best_reward = total_reward
137+
torch.save(self.q_network.state_dict(),
138+
f"../results/maxentropydeep_{episode}_best_network_w_{total_reward}_RL.pth")
139+
140+
if (episode + 1) % 10 == 0:
141+
# calculate density
142+
learner = learner_feature_expectations / episode
143+
learner_feature_expectations = np.zeros(n_states)
144+
145+
scores.append(total_reward)
146+
episode_arr.append(episode)
147+
epsilon = max(epsilon * epsilon_decay, epsilon_min)
148+
print(f"Episode: {episode + 1}, Total Reward: {total_reward}, Epsilon: {epsilon}")
149+
150+
if (episode + 1) % 1000 == 0:
151+
score_avg = np.mean(scores)
152+
print('{} episode average score is {:.2f}'.format(episode, score_avg))
153+
self.printer.save_plot_as_png(episode_arr, scores,
154+
f"../learning_curves/maxent_{episodes}_{episode}_qnetwork_RL.png")
155+
self.printer.save_heatmap_as_png(learner.reshape((20, 20)), f"../heatmap/learner_{episode}_deep_RL.png")
156+
self.printer.save_heatmap_as_png(self.theta.reshape((20, 20)),
157+
f"../heatmap/theta_{episode}_deep_RL.png")
158+
159+
torch.save(self.q_network.state_dict(), f"../results/maxent_{episodes}_{episode}_network_main.pth")
160+
161+
if episode == episodes - 1:
162+
self.printer.save_plot_as_png(episode_arr, scores,
163+
f"../learning_curves/maxentdeep_{episodes}_qdeep_RL.png")
164+
165+
torch.save(self.q_network.state_dict(), f"src/irlwpython/results/maxentdeep_{episodes}_q_network_RL.pth")
166+
167+
def test(self, model_path, epsilon=0.01, repeats=100):
168+
"""
169+
Tests the previous trained model.
170+
:return:
171+
"""
172+
self.q_network.load_state_dict(torch.load(model_path))
173+
episodes, scores = [], []
174+
175+
for episode in range(repeats):
176+
state, info = self.target.env_reset()
177+
score = 0
178+
179+
while True:
180+
self.target.env_render()
181+
action = self.select_action(state, epsilon)
182+
next_state, reward, done, _, _ = self.target.env_step(action)
183+
184+
score += reward
185+
state = next_state
186+
187+
if done:
188+
scores.append(score)
189+
episodes.append(episode)
190+
break
191+
192+
if episode % 1 == 0:
193+
print('{} episode score is {:.2f}'.format(episode, score))
194+
195+
self.printer.save_plot_as_png(episodes, scores,
196+
"src/irlwpython/learning_curves"
197+
"/test_maxentropydeep_best_model_RL_results.png")

0 commit comments

Comments
 (0)