|
1 | 1 | Ecole Theoretical Model |
2 | 2 | ======================= |
| 3 | + |
| 4 | +The ECOLE API and classes directly relate to the different components of |
| 5 | +an episodic `partially-observable Markov decision process <https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process>`_ |
| 6 | +(PO-MDP). |
| 7 | + |
| 8 | +Markov decision process |
| 9 | +----------------------- |
| 10 | +Consider a regular Markov decision process |
| 11 | +:math:`(\mathcal{S}, \mathcal{A}, p_{init}, p_{trans}, R)`, whose components are |
| 12 | + |
| 13 | +* a state space :math:`\mathcal{S}` |
| 14 | +* an action space :math:`\mathcal{A}` |
| 15 | +* an initial state distribution :math:`p_{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}` |
| 16 | +* a state transition distribution |
| 17 | + :math:`p_{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}` |
| 18 | +* a reward function :math:`R: \mathcal{S} \to \mathbb{R}`. |
| 19 | + |
| 20 | +.. note:: |
| 21 | + |
| 22 | + The choice of having deterministic rewards :math:`r_t = R(s_t)` is |
| 23 | + arbitrary here, in order to best fit the ECOLE API. Note that it is |
| 24 | + not a restrictive choice though, as any MDP with stochastic rewards |
| 25 | + :math:`r_t \sim p_{reward}(r_t|s_{t-1},a_{t-1},s_{t})` |
| 26 | + can be converted into an equivalent MDP with deterministic ones, |
| 27 | + by considering the reward as part of the state. |
| 28 | + |
| 29 | +Together with an action policy |
| 30 | + |
| 31 | +.. math:: |
| 32 | +
|
| 33 | + \pi: \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0} |
| 34 | +
|
| 35 | +an MDP can be unrolled to produce state-action trajectories |
| 36 | + |
| 37 | +.. math:: |
| 38 | +
|
| 39 | + \tau=(s_0,a_0,s_1,\dots) |
| 40 | +
|
| 41 | +that obey the following joint distribution |
| 42 | + |
| 43 | +.. math:: |
| 44 | +
|
| 45 | + \tau \sim \underbrace{p_{init}(s_0)}_{\text{initial state}} |
| 46 | + \prod_{t=0}^\infty \underbrace{\pi(a_t | s_t)}_{\text{next action}} |
| 47 | + \underbrace{p_{trans}(s_{t+1} | a_t, s_t)}_{\text{next state}} |
| 48 | + \text{.} |
| 49 | +
|
| 50 | +MDP control problem |
| 51 | +^^^^^^^^^^^^^^^^^^^ |
| 52 | +We define the MDP control problem as that of finding a policy |
| 53 | +:math:`\pi^\star` which is optimal with respect to the expected total |
| 54 | +reward, |
| 55 | + |
| 56 | +.. math:: |
| 57 | + :label: mdp_control |
| 58 | +
|
| 59 | + \pi^\star = \argmax_{\pi} \lim_{T \to \infty} |
| 60 | + \mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right] |
| 61 | + \text{,} |
| 62 | +
|
| 63 | +where :math:`r_t := R(s_t)`. |
| 64 | + |
| 65 | +.. note:: |
| 66 | + |
| 67 | + In the general case this quantity may not be bounded, for example for MDPs |
| 68 | + that correspond to continuing tasks. In ECOLE we garantee that all |
| 69 | + environments correspond to **episodic** tasks, that is, each episode is |
| 70 | + garanteed to start from an initial state :math:`s_0`, and end in a |
| 71 | + terminal state :math:`s_{final}`. For convenience this terminal state can |
| 72 | + be considered as absorbing, i.e., |
| 73 | + :math:`p_{dyn}(s_{t+1}|a_t,s_t=s_{final}) := \delta_{s_{final}}(s_{t+1})`, |
| 74 | + and associated to a null reward, :math:`R(s_{final}) := 0`, so that all |
| 75 | + future states encountered after :math:`s_{final}` can be safely ignored in |
| 76 | + the MDP control problem. |
| 77 | + |
| 78 | +Partially-observable Markov decision process |
| 79 | +-------------------------------------------- |
| 80 | +In the PO-MDP setting, complete information about the current MDP state |
| 81 | +is not necessarily available to the decision-maker. Instead, |
| 82 | +at each step only a partial observation :math:`o \in \Omega` |
| 83 | +is made available, which can be seen as the result of applying an observation |
| 84 | +function :math:`O: \mathcal{S} \to \Omega` to the current state. As a result, |
| 85 | +PO-MDP trajectories take the form |
| 86 | + |
| 87 | +.. math:: |
| 88 | +
|
| 89 | + \tau=(o_0,r_0,a_0,o_1\dots) |
| 90 | + \text{,} |
| 91 | +
|
| 92 | +where :math:`o_t:= O(s_t)` and :math:`r_t:=R(s_t)` are respectively the |
| 93 | +observation and the reward collected at time step :math:`t`. Due to the |
| 94 | +non-Markovian nature of those trajectories, that is, |
| 95 | + |
| 96 | +.. math:: |
| 97 | +
|
| 98 | + o_{t+1},r_{t+1} \nindep o_0,r_0,a_0,\dots,o_{t-1},r_{t-1},a_{t-1} \mid o_t,r_t,a_t |
| 99 | + \text{,} |
| 100 | +
|
| 101 | +the decision-maker must take into account the whole history of past |
| 102 | +observations, rewards and actions, in order to decide on an optimal action |
| 103 | +at current time step :math:`t`. The PO-MDP policy then takes the form |
| 104 | + |
| 105 | +.. math:: |
| 106 | +
|
| 107 | + \pi:\mathcal{A} \times \mathcal{H} \to \mathbb{R}_{\geq 0} |
| 108 | + \text{,} |
| 109 | +
|
| 110 | +where :math:`h_t:=(o_0,r_0,a_0,\dots,o_t,r_t)\in\mathcal{H}` represents the |
| 111 | +PO-MDP history at time step :math:`t`, so that :math:`a_t \sim \pi(a_t|h_t)`. |
| 112 | + |
| 113 | +PO-MDP control problem |
| 114 | +^^^^^^^^^^^^^^^^^^^^^^ |
| 115 | +The PO-MDP control problem can then be written identically to the MDP one, |
| 116 | + |
| 117 | +.. math:: |
| 118 | + :label: pomdp_control |
| 119 | +
|
| 120 | + \pi^\star = \argmax_{\pi} \lim_{T \to \infty} |
| 121 | + \mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right] |
| 122 | + \text{.} |
| 123 | +
|
| 124 | +ECOLE as PO-MDP components |
| 125 | +-------------------------- |
| 126 | + |
| 127 | +The following ECOLE components can be directly translated into PO-MDP |
| 128 | +components from the above formulation: |
| 129 | + |
| 130 | +* :py:class:`~ecole.typing.RewardFunction` <=> :math:`R` |
| 131 | +* :py:class:`~ecole.typing.ObservationFunction` <=> :math:`O` |
| 132 | +* :py:meth:`~ecole.typing.Dynamics.reset_dynamics` <=> :math:`p_{init}(s_0)` |
| 133 | +* :py:meth:`~ecole.typing.Dynamics.step_dynamics` <=> :math:`p_{trans}(s_{t+1}|s_t,a_t)` |
| 134 | + |
| 135 | +The :py:class:`~ecole.environment.EnvironmentComposer` class wraps all of |
| 136 | +those components together to form the PO-MDP. Its API can be interpreted as |
| 137 | +follows: |
| 138 | + |
| 139 | +* :py:meth:`~ecole.environment.EnvironmentComposer.reset` <=> |
| 140 | + :math:`s_0 \sim p_{init}(s_0), r_0=R(s_0), o_0=O(s_0)` |
| 141 | +* :py:meth:`~ecole.environment.EnvironmentComposer.step` <=> |
| 142 | + :math:`s_{t+1} \sim p_{trans}(s_{t+1}|a_s,s_t), r_t=R(s_t), o_t=O(s_t)` |
| 143 | +* ``done == True`` <=> the PO-MDP will now enter the terminal state, |
| 144 | + :math:`s_{t+1}==s_{final}`. As such, the current episode ends now. |
| 145 | + |
| 146 | +The state space :math:`\mathcal{S}` can be considered to be the whole computer |
| 147 | +memory occupied by the environment, which includes the state of the underlying |
| 148 | +SCIP solver instance. The action space :math:`\mathcal{A}` is specific to each |
| 149 | +environment. |
| 150 | + |
| 151 | +.. note:: |
| 152 | + We allow the environment to specify a set of valid actions at each time |
| 153 | + step :math:`t`. The ``action_set`` value returned by |
| 154 | + :py:meth:`~ecole.environment.EnvironmentComposer.reset` and |
| 155 | + :py:meth:`~ecole.environment.EnvironmentComposer.step` serves this purpose, |
| 156 | + and can be left to ``None`` when the action set is implicit. |
| 157 | + |
| 158 | + |
| 159 | +.. note:: |
| 160 | + |
| 161 | + As can be seen from :eq:`pomdp_control`, the initial reward :math:`r_0` |
| 162 | + returned by :py:meth:`~ecole.environment.EnvironmentComposer.reset` |
| 163 | + does not affect the control problem. In ECOLE we |
| 164 | + nevertheless chose to preserve this initial reward, in order to obtain |
| 165 | + meaningfull cumulated episode rewards (e.g., total running time). |
0 commit comments