@@ -102,7 +102,7 @@ \chapter{Introduction}
102102takes approximately two to three hours to complete and each time slot, a slot in which energy is produced, consumed and
103103traded, takes five seconds. Previous researchers have
104104identified the problem as a \ac {POMDP}, a common model of \ac {RL} literature \cite []{tactexurieli2016mdp }. Deep neural network
105- architectures have proven to be successful in solving games in a variety of instances. It is therefore intuitive to
105+ architectures have proven to be successful in solving games in a variety of instances. It is intuitive to
106106attempt to apply such architectures to the problems posed by the \ac {PowerTAC} simulation. Unfortunately, most such
107107implementations are only available in Python \cite []{baselines , plappert2016kerasrl , schaarschmidt2017tensorforce } and
108108\ac {PowerTAC} is almost exclusively based on Java. An extension of the current communication protocols to other
@@ -118,7 +118,7 @@ \chapter{Introduction}
118118attempt to align their decisions to those that their teacher would do \cite []{schmitt2018kickstarting }. High level
119119problem solving agents may be trained by first training several small narrow focus agent networks on sub problems and
120120then applying transfer learning to transfer the knowledge from the narrow focus agents to the generic high level
121- https://itsfoss.com/pdf-editors-linux/ agent \cite []{parisotto2015actor }. For problems where a reward function is difficult to construct, \emph {inverse
121+ agent \cite []{parisotto2015actor }. For problems where a reward function is difficult to construct, \emph {inverse
122122reinforcement learning } can be used to train an agent to behave similar to an observable expert. The policy function of
123123the agent shows good performance despite lacking a specific reward function \cite []{NG2004Apprentice }.
124124
@@ -1922,7 +1922,7 @@ \section{Wholesale market}
19221922\citet {tactexurieli2016mdp } was assumed. More specifically, the agent only concerns itself with the activities in the
19231923wholesale market and does not act or evaluate tariff market or balancing market activities. This is due to the
19241924separation of concern approach described earlier. It is therefore a \ac {MDP} that can be solved with \ac {RL} techniques.
1925- The goal was the ability to apply current and future neural network implementations to the \ac {PowerTAC} problem set. For this,
1925+ The goal was the ability to apply current and future deep \ac {RL} implementations to the \ac {PowerTAC} problem set. For this,
19261926many of the previously described implementations were necessary. Now that a Python based broker is possible, application
19271927of \ac {PPO}, \ac {DQN} and other modern \ac {RL} agent implementations seems reasonable. All required messages can be
19281928subscribed to via the publish-subscribe pattern. What is missing are the following components which are explained in
@@ -1945,7 +1945,7 @@ \subsection{\ac{MDP} design}%
19451945\subsubsection {\ac {MDP} design comparison }%
19461946\label {ssub:mdp_design_comparison }
19471947
1948- There are two possible ways of modelling the \ac {MDP}: per time slot or per game. Per time slot is aligned to the
1948+ There are two possible ways of modeling the \ac {MDP}: per time slot or per game. Per time slot is aligned to the
19491949definition by \citet {tactexurieli2016mdp }. Per game considers each game a unified \ac {MDP} where the agent acts in all
19501950time slots and therefore has an action space of 48 values per time slot.
19511951
@@ -1967,10 +1967,10 @@ \subsubsection{\ac{MDP} design comparison}%
19671967such as \ac {DQN}, \ac {SARSA} or \ac {A3C} are not easily applied to such large action spaces. They are written to be
19681968applied to discrete action spaces \cite []{baselines }. \ac {PowerTAC} trading is in its purest form a continuous action
19691969space, allowing the agent to define both amount and price for a target time slot. Furthermore, the agent would observe
1970- 24 environments in parallel and generate 24 largely independent trading decisions. The network would have to learn to
1971- match each input block to an output action, as the input for time slot 370 has little effect on the action that should
1972- be taken in time slot 380. In a separated \ac {MDP}, each environment observation would only hold the data needed for the
1973- specific time slot rather than information about earlier and later slots.
1970+ information for 24 open time slots in parallel and generate 24 largely independent trading decisions. The network would
1971+ have to learn to match each input block to an output action, as the input for time slot 370 has little effect on the
1972+ action that should be taken in time slot 380. In a separated \ac {MDP}, each environment observation would only hold the
1973+ data needed for the specific time slot rather than information about earlier and later slots.
19741974
19751975\subsubsection {\ac {MDP} implementation }%
19761976\label {sub:mdp_design_and_implementation }
@@ -2021,7 +2021,7 @@ \subsection{Reversal of program flow control}%
20212021
20222022The environment expects the agent to expose an \ac {API} that includes two calls: \texttt {forward } and
20232023\texttt {backward }. This pattern has been adopted from the keras-rl and Tensorforce libraries. The reason is simple:
2024- While most libraries put the agent control of the program flow, the \ac {PowerTAC} broker will be stepped by the
2024+ While most libraries put the agent in control of the program flow, the \ac {PowerTAC} broker will be stepped by the
20252025server and the \ac {RL} agent itself has no control of the flow. The forward and backward methods are
20262026directly aligned with the keras-rl framework and easily applicable to the Tensorforce \texttt {act() } and
20272027\texttt {atomic\_ observe() } methods of their agent implementations. The abstract \texttt {PowerTacWholesaleAgent } class just defines a
@@ -2065,12 +2065,13 @@ \subsection{Reward functions}%
20652065environment \cite [p.469ff.]{amodei2016concrete , sutton2018reinforcement }. While the Atari agents often receive their reward directly from the
20662066game as many games include a game point counter \cite []{mnih2013playing }, \ac {PowerTAC} technically simulates a
20672067real-world energy market which means the score equals the brokers profit. Nonetheless, the profit is dependent on a number of
2068- factors and therefore hardly a good choice for a reward proxy. Using the purchase prices of the energy purchased is also
2069- noisy, as it depends on the supply and demand of the entire market. Generally, the broker attempts to purchase energy at
2070- a good price and a good price can be defined as one that is better than that of other participants in the market. A
2071- reward function based on the relation between the average price paid by the broker and the average price paid by the
2072- overall market hence describes how well the agent did in comparison to the others and consequently removes the market
2073- price fluctuation noise from the reward values.
2068+ factors, where the wholesale trading component only makes up a comparatively small part and thus it is hardly a
2069+ good choice for a reward proxy. Using the purchase prices of the energy purchased is also noisy, as it depends on the
2070+ supply and demand of the entire market. Generally, the broker attempts to trade energy at a good price and a good
2071+ price can be defined as one that is better than that of other participants in the market. A reward function based on the
2072+ relation between the average price paid by the broker and the average price paid by the overall market hence describes
2073+ how well the agent did in comparison to the others and consequently removes the market price fluctuation noise from the
2074+ reward values.
20742075
20752076To calculate this reward, all the purchases of the agent as well as all market clearings are averaged for a given target
20762077time slot.
@@ -2090,10 +2091,11 @@ \subsection{Reward functions}%
20902091\end {equation }
20912092
20922093for both the market averages and the broker averages. This encourages the agent to buy for low prices and to sell for high
2093- prices when possible. $ sum(q)$ is the net purchasing amount after the 24 trading opportunities are completed, i.e.\ did
2094- the broker end up with a positive or negative net flow of energy in the wholesale market. This reward function has one
2095- one immediate drawback: it can only be calculated once the market for the target time slot is closed. The agent
2096- therefore doesn't get any feedback during any step except the terminal state.
2094+ prices when possible. $ sum(q)$ is the net purchasing amount after the 24 trading opportunities are completed, i.e.\ it
2095+ describes if
2096+ the broker ended up with a positive or negative net flow of energy in the wholesale market. This reward function has one
2097+ immediate drawback: it can only be calculated once the market for the target time slot is closed. The agent therefore
2098+ doesn't get any feedback during any step except the terminal state.
20972099
20982100While \ac {RL} research has stated sparse reward as a core part of \ac {RL}, many of the recent algorithms do
20992101not deal well with such sparse rewards. Experience replay partially works so well in the Atari domain due to the dense
@@ -2136,11 +2138,11 @@ \subsection{Reward functions}%
21362138where the price is influenced by any market participant.
21372139
21382140Other reward functions are present in the \texttt {reward\_ functions.py } file such as an automatically adjusting one that
2139- punishes balancing strongly at first and disregards the price but shifts towards the price based reward using a factor
2140- similar to $ \alpha $ above once the balancing amounts are reduced. Generally, a lot of work is still required to
2141- construct a more ideal reward function. Reward functions are difficult to design, because systems tend to overfit on the
2142- reward function in a way that the results do not intuitively make sense but optimize towards the slightly misdefined
2143- reward function \cite []{amodei2016concrete }.
2141+ punishes balancing strongly at first and disregards the price but shifts towards the price based reward once the
2142+ balancing amounts are reduced. Generally, more work is required to construct a better reward function.
2143+ Reward functions are difficult to design, because systems tend to overfit on the reward function in a way that the
2144+ results do not intuitively make sense but optimize towards the slightly misdefined reward function
2145+ \cite []{amodei2016concrete }.
21442146
21452147\subsection {Input preprocessing }%
21462148\label {sub:input_preprocessing }
@@ -2164,18 +2166,17 @@ \subsection{Tensorforce agent}%
21642166the environment, returns actions and learns when passed the required information. The development of \ac {RL} agents
21652167includes a lot of trial and error which is why I created another \ac {CLI} endpoint called \texttt {wholesale }. The
21662168\ac {CLI} allows the custom selection of the reward function, action type, network structure, agent type and tagging the
2167- trial with custom strings. It starts an instance of the \texttt {LogEnvManagerAdapter } which runs through all recorded
2169+ trial with custom strings. It starts an instance of the \texttt {LogEnvManagerAdapter } which runs through recorded
21682170games and simulates the necessary events. To run several trials, I created a helper tool that automatically generates
21692171these CLI calls and runs all variations of them in sequence. In total, dozens of offline simulation approximations were
2170- run during the development and a set of
2171- % TODO or more?
2172- 72 configurations were run with a variety of hyperparameters. Each run included 5 simulated games which result in
2173- roughly 200.000 learning steps and the average reward of the last game is considered the final performance of that run.
2174- The following table summarizes all trials executed.
2172+ run during the development and a set of 72 configurations were run as a final analysis with a variety of
2173+ hyperparameters. Each run included 5 simulated games which result in roughly 200.000 learning steps and the average
2174+ reward of the last game is considered the final performance of that run. Table~\ref {tab:trading } summarizes all trials
2175+ executed.
21752176
21762177\begin {table }[]
21772178 \caption {Wholesale offline trading results overview for various hyperparameters}
2178- \label {fig:wstable }
2179+ \label {tab:trading }
21792180
21802181 \resizebox {\textwidth }{0.48\textheight }{
21812182 \begin {tabular }{l|l|l|l|l|l}
@@ -2256,8 +2257,8 @@ \subsection{Tensorforce agent}%
22562257 \end {tabular }
22572258 }
22582259\end {table }
2259- The reward function shown in the table were received by multiplying the results of the original reward functions by 1000
2260- to improve signal strength. Forecasting error was set at 2\% per time slot, the network configurations can be seen in
2260+ The reward values shown in the table were received by multiplying the results of the original reward functions by 1000
2261+ to improve signal strength. The forecasting error was set at 2\% per time slot, the network configurations can be seen in
22612262the \texttt {broker-python } repository\footnote {All trials were recorded using tensorboard and are included in the
22622263attached DVD}.
22632264As a frame of reference, the benchmark agent which always orders exactly what is forecast at every time step and offers
@@ -2269,16 +2270,17 @@ \subsection{Tensorforce agent}%
22692270It is also not clearly visible what caused the wide range of rewards. One hypothesis is that the reward functions tried
22702271were not describing the problem appropriately enough for the agent to make good decisions. The reward in the
22712272\ac {PowerTAC} setting differs significantly from the reward functions in other research as it doesn't have a "way
2272- forward". Atari game rewards are directly taken from the games high scores and the Mujoco locomotion rewards are based
2273- on distance traveled \cite []{heess2017emergence }. The wholesale trading reward described above in contrast offers no
2274- such progress forward. The actions of the agent, depending on the action type configured, allow for "easy" highly
2273+ forward". Atari game rewards are directly taken from the games high scores and the Mujoco based locomotion rewards are
2274+ describing the distance traveled \cite []{heess2017emergence }. The wholesale trading reward described above in contrast offers no
2275+ such \emph { progressing } forward. The actions of the agent, depending on the action type configured, allow for "easy" highly
22752276negative rewards but to achieve a good positive reward, the agent needs to find the right chain of trades that balances
2276- the portfolio at a good price. It may be that the agents get caught in local optima or that some other parameter setting
2277- was overlooked by me. Another hypothesis is the lack of memory of the agent. The wholesale trader
2278- implementations tried were all based on acyclic feed-forward networks and therefore contained no sense of memory. A
2279- \ac {LSTM} based approach may lead to better results, but an initial trial did not lead to an immediate improvement.
2277+ the portfolio at a good price. All good actions of the first 23 steps can be ruined with a terrible trade at the end.
2278+ This doesn't apply to the formerly mentioned environments. It may also be that the agents get caught in local optima or that
2279+ some other parameter setting was overlooked by me. Another hypothesis is the lack of memory of the agent. The wholesale
2280+ trader implementations tried were all based on acyclic feed-forward networks and therefore contained no sense of memory.
2281+ A \ac {LSTM} based approach may lead to better results, but an initial trial did not lead to an immediate improvement.
22802282
2281- In summary, I was not yet able to create a learning neural network based algorithm that showed any significant level of
2283+ In summary, I was not able to create a learning neural network based algorithm that showed any significant level of
22822284competency in the wholesale trading environment. Several reward function schemes, implementations and hyperparameters
22832285have been tried but further investigation is required to determine why the performance of the agent variants is as
22842286unsatisfying as it is.
0 commit comments