Skip to content

Commit c494530

Browse files
committed
fixing some things in the wholesale component
1 parent 5823a37 commit c494530

File tree

2 files changed

+44
-42
lines changed

2 files changed

+44
-42
lines changed

src/body.tex

Lines changed: 43 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ \chapter{Introduction}
102102
takes approximately two to three hours to complete and each time slot, a slot in which energy is produced, consumed and
103103
traded, takes five seconds. Previous researchers have
104104
identified the problem as a \ac{POMDP}, a common model of \ac {RL} literature \cite[]{tactexurieli2016mdp}. Deep neural network
105-
architectures have proven to be successful in solving games in a variety of instances. It is therefore intuitive to
105+
architectures have proven to be successful in solving games in a variety of instances. It is intuitive to
106106
attempt to apply such architectures to the problems posed by the \ac{PowerTAC} simulation. Unfortunately, most such
107107
implementations are only available in Python \cite[]{baselines, plappert2016kerasrl, schaarschmidt2017tensorforce} and
108108
\ac{PowerTAC} is almost exclusively based on Java. An extension of the current communication protocols to other
@@ -118,7 +118,7 @@ \chapter{Introduction}
118118
attempt to align their decisions to those that their teacher would do \cite[]{schmitt2018kickstarting}. High level
119119
problem solving agents may be trained by first training several small narrow focus agent networks on sub problems and
120120
then applying transfer learning to transfer the knowledge from the narrow focus agents to the generic high level
121-
https://itsfoss.com/pdf-editors-linux/agent \cite[]{parisotto2015actor}. For problems where a reward function is difficult to construct, \emph{inverse
121+
agent \cite[]{parisotto2015actor}. For problems where a reward function is difficult to construct, \emph{inverse
122122
reinforcement learning} can be used to train an agent to behave similar to an observable expert. The policy function of
123123
the agent shows good performance despite lacking a specific reward function \cite[]{NG2004Apprentice}.
124124

@@ -1922,7 +1922,7 @@ \section{Wholesale market}
19221922
\citet{tactexurieli2016mdp} was assumed. More specifically, the agent only concerns itself with the activities in the
19231923
wholesale market and does not act or evaluate tariff market or balancing market activities. This is due to the
19241924
separation of concern approach described earlier. It is therefore a \ac{MDP} that can be solved with \ac{RL} techniques.
1925-
The goal was the ability to apply current and future neural network implementations to the \ac{PowerTAC} problem set. For this,
1925+
The goal was the ability to apply current and future deep \ac{RL} implementations to the \ac{PowerTAC} problem set. For this,
19261926
many of the previously described implementations were necessary. Now that a Python based broker is possible, application
19271927
of \ac{PPO}, \ac{DQN} and other modern \ac{RL} agent implementations seems reasonable. All required messages can be
19281928
subscribed to via the publish-subscribe pattern. What is missing are the following components which are explained in
@@ -1945,7 +1945,7 @@ \subsection{\ac{MDP} design}%
19451945
\subsubsection{\ac{MDP} design comparison}%
19461946
\label{ssub:mdp_design_comparison}
19471947

1948-
There are two possible ways of modelling the \ac{MDP}: per time slot or per game. Per time slot is aligned to the
1948+
There are two possible ways of modeling the \ac{MDP}: per time slot or per game. Per time slot is aligned to the
19491949
definition by \citet{tactexurieli2016mdp}. Per game considers each game a unified \ac{MDP} where the agent acts in all
19501950
time slots and therefore has an action space of 48 values per time slot.
19511951

@@ -1967,10 +1967,10 @@ \subsubsection{\ac{MDP} design comparison}%
19671967
such as \ac{DQN}, \ac{SARSA} or \ac{A3C} are not easily applied to such large action spaces. They are written to be
19681968
applied to discrete action spaces \cite[]{baselines}. \ac{PowerTAC} trading is in its purest form a continuous action
19691969
space, allowing the agent to define both amount and price for a target time slot. Furthermore, the agent would observe
1970-
24 environments in parallel and generate 24 largely independent trading decisions. The network would have to learn to
1971-
match each input block to an output action, as the input for time slot 370 has little effect on the action that should
1972-
be taken in time slot 380. In a separated \ac{MDP}, each environment observation would only hold the data needed for the
1973-
specific time slot rather than information about earlier and later slots.
1970+
information for 24 open time slots in parallel and generate 24 largely independent trading decisions. The network would
1971+
have to learn to match each input block to an output action, as the input for time slot 370 has little effect on the
1972+
action that should be taken in time slot 380. In a separated \ac{MDP}, each environment observation would only hold the
1973+
data needed for the specific time slot rather than information about earlier and later slots.
19741974

19751975
\subsubsection{\ac{MDP} implementation}%
19761976
\label{sub:mdp_design_and_implementation}
@@ -2021,7 +2021,7 @@ \subsection{Reversal of program flow control}%
20212021

20222022
The environment expects the agent to expose an \ac{API} that includes two calls: \texttt{forward} and
20232023
\texttt{backward}. This pattern has been adopted from the keras-rl and Tensorforce libraries. The reason is simple:
2024-
While most libraries put the agent control of the program flow, the \ac{PowerTAC} broker will be stepped by the
2024+
While most libraries put the agent in control of the program flow, the \ac{PowerTAC} broker will be stepped by the
20252025
server and the \ac{RL} agent itself has no control of the flow. The forward and backward methods are
20262026
directly aligned with the keras-rl framework and easily applicable to the Tensorforce \texttt{act()} and
20272027
\texttt{atomic\_observe()} methods of their agent implementations. The abstract \texttt{PowerTacWholesaleAgent} class just defines a
@@ -2065,12 +2065,13 @@ \subsection{Reward functions}%
20652065
environment \cite[p.469ff.]{amodei2016concrete, sutton2018reinforcement}. While the Atari agents often receive their reward directly from the
20662066
game as many games include a game point counter \cite[]{mnih2013playing}, \ac{PowerTAC} technically simulates a
20672067
real-world energy market which means the score equals the brokers profit. Nonetheless, the profit is dependent on a number of
2068-
factors and therefore hardly a good choice for a reward proxy. Using the purchase prices of the energy purchased is also
2069-
noisy, as it depends on the supply and demand of the entire market. Generally, the broker attempts to purchase energy at
2070-
a good price and a good price can be defined as one that is better than that of other participants in the market. A
2071-
reward function based on the relation between the average price paid by the broker and the average price paid by the
2072-
overall market hence describes how well the agent did in comparison to the others and consequently removes the market
2073-
price fluctuation noise from the reward values.
2068+
factors, where the wholesale trading component only makes up a comparatively small part and thus it is hardly a
2069+
good choice for a reward proxy. Using the purchase prices of the energy purchased is also noisy, as it depends on the
2070+
supply and demand of the entire market. Generally, the broker attempts to trade energy at a good price and a good
2071+
price can be defined as one that is better than that of other participants in the market. A reward function based on the
2072+
relation between the average price paid by the broker and the average price paid by the overall market hence describes
2073+
how well the agent did in comparison to the others and consequently removes the market price fluctuation noise from the
2074+
reward values.
20742075

20752076
To calculate this reward, all the purchases of the agent as well as all market clearings are averaged for a given target
20762077
time slot.
@@ -2090,10 +2091,11 @@ \subsection{Reward functions}%
20902091
\end{equation}
20912092

20922093
for both the market averages and the broker averages. This encourages the agent to buy for low prices and to sell for high
2093-
prices when possible. $sum(q)$ is the net purchasing amount after the 24 trading opportunities are completed, i.e.\ did
2094-
the broker end up with a positive or negative net flow of energy in the wholesale market. This reward function has one
2095-
one immediate drawback: it can only be calculated once the market for the target time slot is closed. The agent
2096-
therefore doesn't get any feedback during any step except the terminal state.
2094+
prices when possible. $sum(q)$ is the net purchasing amount after the 24 trading opportunities are completed, i.e.\ it
2095+
describes if
2096+
the broker ended up with a positive or negative net flow of energy in the wholesale market. This reward function has one
2097+
immediate drawback: it can only be calculated once the market for the target time slot is closed. The agent therefore
2098+
doesn't get any feedback during any step except the terminal state.
20972099

20982100
While \ac{RL} research has stated sparse reward as a core part of \ac{RL}, many of the recent algorithms do
20992101
not deal well with such sparse rewards. Experience replay partially works so well in the Atari domain due to the dense
@@ -2136,11 +2138,11 @@ \subsection{Reward functions}%
21362138
where the price is influenced by any market participant.
21372139

21382140
Other reward functions are present in the \texttt{reward\_functions.py} file such as an automatically adjusting one that
2139-
punishes balancing strongly at first and disregards the price but shifts towards the price based reward using a factor
2140-
similar to $\alpha$ above once the balancing amounts are reduced. Generally, a lot of work is still required to
2141-
construct a more ideal reward function. Reward functions are difficult to design, because systems tend to overfit on the
2142-
reward function in a way that the results do not intuitively make sense but optimize towards the slightly misdefined
2143-
reward function \cite[]{amodei2016concrete}.
2141+
punishes balancing strongly at first and disregards the price but shifts towards the price based reward once the
2142+
balancing amounts are reduced. Generally, more work is required to construct a better reward function.
2143+
Reward functions are difficult to design, because systems tend to overfit on the reward function in a way that the
2144+
results do not intuitively make sense but optimize towards the slightly misdefined reward function
2145+
\cite[]{amodei2016concrete}.
21442146

21452147
\subsection{Input preprocessing}%
21462148
\label{sub:input_preprocessing}
@@ -2164,18 +2166,17 @@ \subsection{Tensorforce agent}%
21642166
the environment, returns actions and learns when passed the required information. The development of \ac{RL} agents
21652167
includes a lot of trial and error which is why I created another \ac{CLI} endpoint called \texttt{wholesale}. The
21662168
\ac{CLI} allows the custom selection of the reward function, action type, network structure, agent type and tagging the
2167-
trial with custom strings. It starts an instance of the \texttt{LogEnvManagerAdapter} which runs through all recorded
2169+
trial with custom strings. It starts an instance of the \texttt{LogEnvManagerAdapter} which runs through recorded
21682170
games and simulates the necessary events. To run several trials, I created a helper tool that automatically generates
21692171
these CLI calls and runs all variations of them in sequence. In total, dozens of offline simulation approximations were
2170-
run during the development and a set of
2171-
%TODO or more?
2172-
72 configurations were run with a variety of hyperparameters. Each run included 5 simulated games which result in
2173-
roughly 200.000 learning steps and the average reward of the last game is considered the final performance of that run.
2174-
The following table summarizes all trials executed.
2172+
run during the development and a set of 72 configurations were run as a final analysis with a variety of
2173+
hyperparameters. Each run included 5 simulated games which result in roughly 200.000 learning steps and the average
2174+
reward of the last game is considered the final performance of that run. Table~\ref{tab:trading} summarizes all trials
2175+
executed.
21752176

21762177
\begin{table}[]
21772178
\caption{Wholesale offline trading results overview for various hyperparameters}
2178-
\label{fig:wstable}
2179+
\label{tab:trading}
21792180

21802181
\resizebox{\textwidth}{0.48\textheight}{
21812182
\begin{tabular}{l|l|l|l|l|l}
@@ -2256,8 +2257,8 @@ \subsection{Tensorforce agent}%
22562257
\end{tabular}
22572258
}
22582259
\end{table}
2259-
The reward function shown in the table were received by multiplying the results of the original reward functions by 1000
2260-
to improve signal strength. Forecasting error was set at 2\% per time slot, the network configurations can be seen in
2260+
The reward values shown in the table were received by multiplying the results of the original reward functions by 1000
2261+
to improve signal strength. The forecasting error was set at 2\% per time slot, the network configurations can be seen in
22612262
the \texttt{broker-python} repository\footnote{All trials were recorded using tensorboard and are included in the
22622263
attached DVD}.
22632264
As a frame of reference, the benchmark agent which always orders exactly what is forecast at every time step and offers
@@ -2269,16 +2270,17 @@ \subsection{Tensorforce agent}%
22692270
It is also not clearly visible what caused the wide range of rewards. One hypothesis is that the reward functions tried
22702271
were not describing the problem appropriately enough for the agent to make good decisions. The reward in the
22712272
\ac{PowerTAC} setting differs significantly from the reward functions in other research as it doesn't have a "way
2272-
forward". Atari game rewards are directly taken from the games high scores and the Mujoco locomotion rewards are based
2273-
on distance traveled \cite[]{heess2017emergence}. The wholesale trading reward described above in contrast offers no
2274-
such progress forward. The actions of the agent, depending on the action type configured, allow for "easy" highly
2273+
forward". Atari game rewards are directly taken from the games high scores and the Mujoco based locomotion rewards are
2274+
describing the distance traveled \cite[]{heess2017emergence}. The wholesale trading reward described above in contrast offers no
2275+
such \emph{progressing} forward. The actions of the agent, depending on the action type configured, allow for "easy" highly
22752276
negative rewards but to achieve a good positive reward, the agent needs to find the right chain of trades that balances
2276-
the portfolio at a good price. It may be that the agents get caught in local optima or that some other parameter setting
2277-
was overlooked by me. Another hypothesis is the lack of memory of the agent. The wholesale trader
2278-
implementations tried were all based on acyclic feed-forward networks and therefore contained no sense of memory. A
2279-
\ac{LSTM} based approach may lead to better results, but an initial trial did not lead to an immediate improvement.
2277+
the portfolio at a good price. All good actions of the first 23 steps can be ruined with a terrible trade at the end.
2278+
This doesn't apply to the formerly mentioned environments. It may also be that the agents get caught in local optima or that
2279+
some other parameter setting was overlooked by me. Another hypothesis is the lack of memory of the agent. The wholesale
2280+
trader implementations tried were all based on acyclic feed-forward networks and therefore contained no sense of memory.
2281+
A \ac{LSTM} based approach may lead to better results, but an initial trial did not lead to an immediate improvement.
22802282

2281-
In summary, I was not yet able to create a learning neural network based algorithm that showed any significant level of
2283+
In summary, I was not able to create a learning neural network based algorithm that showed any significant level of
22822284
competency in the wholesale trading environment. Several reward function schemes, implementations and hyperparameters
22832285
have been tried but further investigation is required to determine why the performance of the agent variants is as
22842286
unsatisfying as it is.

src/main.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
\documentclass[12pt,a4paper,oneside,hyphens]{report}
1+
\documentclass[12pt,a4paper,oneside,hyphens, draft]{report}
22
\input{head.tex}
33
\input{glossary.tex}
44

0 commit comments

Comments
 (0)