-
Notifications
You must be signed in to change notification settings - Fork 86
3.2 RL Environment
The environment E, labelled as ctc-executioner-v0 and a child class of gym.Env, is a simulator for order execution.
This section will first provide an overview of the environment and then describe each component and their functionalities.

The environment covers the entire process of an order execution such that an agent that makes use of this environment does not have to be aware of the inner workings and can regard the execution process as a black-box.
Upon initialization, an order book and a match engine is provided. The order book is the essential core that implicitly defines the state space and the outcome of each step. All the other components, including the match engine, are therefore abstractions and mechanisms in order to construct the environment that allows to investigate and learn how to place orders.
During the execution process, which is initiated by an agent using the reset method, a memory serves as the storage for an internal state which contains an ongoing execution, whose values will be updated while the agent proceeds its epochs. (The current implementation supports only one execution to be stored in the memory and therefore multiple agents at a time would cause raise conditions).
With every step taken by the agent, a chain of tasks will be processed:
- The agent selects an action
aand passes it to the environment. - A internal state
s(defined as ActionState) is being constructed whereas it is either derived from a previous state or from the order book in case a new epoch has started. - Then an Order is created according to the remaining inventory and time horizon the agent has left, and the specified action to be taken.
- The order is sent to the match engine which will perform an attempt to execute the order in the current order book state (from which the agents state was derived) and the following order book states, provided the time horizon is not consumed already.
- The matching will result in either no-, a partial- or a full-execution of the submitted order. Whichever outcome it might be, a certain reward can be derived alongside the next state (again derived from the order book) and whether the epoch is done or not.
- Those values will then be stored in the memory and returned to the agent that might want to take another step.
Unlike in most traditional reinforcement learning environments, each step taken by the agent leads to a complete change of the state space. Consider a chess board environment, where the state space is the board equipped with figures. After every move taken by the agent, the state space would look exactly the same, except of the figure moved with that step. This process would go on until the agent either wins or looses the game and the state space would be reset to the very same as in the beginning of the previous epoch. In the execution environment, however, the state space will likely never be the same since a random sequence of order book state throughout an epoch defines the state space. Since these order book states are likely to be different for every step, the state the agent is in will therefore change equally. It is, as if not only one or two figures of the chess board change their position, but almost all of them.
An agent that is compatible with the OpenAI gym.Env interface will be able to make use of this environment.
At each time-step the agent selects an action a_t from the set of legal actions, A = {l_min, . . . , l_max}, whereas l_min is the most negative limit level and l_max is the most positive limit level.
A discrete action space as ∈ N is a vector of size equal to the number of limit levels configured.
The action space features actions a ∈ Z which represent the limit level in $0.10 steps.
The action space is configurable and the default implementation is of size 101, derived from the limit level starting at -50 up to +50. Negative limit levels indicate the listing deep in the book and positive listings relate to the level in the opposing side of the book.