-
Notifications
You must be signed in to change notification settings - Fork 86
3.2 RL Environment
The environment, defined as ctc-executioner-v0, is a child class of gym.Env where the methods step and reset are implemented in order to simulate order executions.
This section will first provide an overview of the environment and then describe each component and their functionalities.

The environment covers the entire process of an order execution such that an agent that makes use of this environment does not have to be aware of the inner workings and can regard the execution process as a black-box.
Upon initialization, an order book and a match engine is provided. The order book is the essential core that implicitly defines the state space and the outcome of each step. All the other components, including the match engine, are therefore abstractions and mechanisms in order to construct the environment that allows to investigate and learn how to place orders.
During the execution process, which is initiated by an agent, a memory serves as the storage for an ongoing execution, whereas the values will be updated while the agent proceeds its epochs. (The current implementation supports only one execution to be stored in the memory and therefore multiple agents at a time would cause raise conditions).
With every step taken by the agent, a chain of tasks will be processed:
- A state (defined as ActionState) is being constructed whereas is can either be derived from a previous state or from the order book in case a new epoch has started.
- Then an Order is created according to the remaining inventory and time horizon the agent has left, and the specified action to be taken.
- The order is sent to the match engine which will perform an attempt to execute the order in the current order book state (from which the agents state was derived) and the following order book states, provided the time horizon is not consumed already.
- The matching will result in either no-, a partial- or a full-execution of the submitted order. Whichever outcome it might be, a certain reward can be derived alongside the next state (again derived from the order book) and whether the epoch is done or not.
- Those values will then be stored in the memory and returned to the agent that might want to take another step.
Unlike in most traditional reinforcement learning environments, each step taken by the agent leads to a complete change of the state space. Consider a chess board environment, where the state space is the board equipped with figures. After every move taken by the agent, the state space would look exactly the same, except of the figure moved with that step. This process would go on until the agent either wins or looses the game and the state space would be reset to the very same state as in the beginning of the previous epoch. In the execution environment, however, the state will likely never be the same since a random order book state is chosen at the beginning of the epoch from which a state is derived. From there, an execution is proceeded leading to another state that is again derived from an order book state. Since these order book states are likely to be different the state the agent is in will therefore change equally. It is, as if, not only one or two figures of the chess board change their position, but almost all of them.
An agent that is compatible with the OpenAI gym.Env interface will be able to make use of this environment.