11Ecole Theoretical Model
22=======================
33
4- The ECOLE API and classes directly relate to the different components of
4+ The Ecole API and classes directly relate to the different components of
55an episodic `partially-observable Markov decision process <https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process >`_
66(PO-MDP).
77
@@ -20,7 +20,7 @@ Consider a regular Markov decision process
2020.. note ::
2121
2222 The choice of having deterministic rewards :math: `r_t = R(s_t)` is
23- arbitrary here, in order to best fit the ECOLE API. Note that it is
23+ arbitrary here, in order to best fit the Ecole API. Note that it is
2424 not a restrictive choice though, as any MDP with stochastic rewards
2525 :math: `r_t \sim p_{reward}(r_t|s_{t-1 },a_{t-1 },s_{t})`
2626 can be converted into an equivalent MDP with deterministic ones,
@@ -56,16 +56,16 @@ reward,
5656.. math ::
5757 :label: mdp_control
5858
59- \pi ^\star = \argmax _ {\pi } \lim _{T \to \infty }
60- \mathbb {E}_\tau \left [\sum _{t=0 }^{T} r_t\right ]
59+ \pi ^\star = \underset {\pi }{ \operatorname {arg \, max} }
60+ \lim _{T \to \infty } \ mathbb {E}_\tau \left [\sum _{t=0 }^{T} r_t\right ]
6161 \text {,}
6262
6363 where :math: `r_t := R(s_t)`.
6464
6565.. note ::
6666
6767 In the general case this quantity may not be bounded, for example for MDPs
68- that correspond to continuing tasks. In ECOLE we garantee that all
68+ that correspond to continuing tasks. In Ecole we garantee that all
6969 environments correspond to **episodic ** tasks, that is, each episode is
7070 garanteed to start from an initial state :math: `s_0 `, and end in a
7171 terminal state :math: `s_{final}`. For convenience this terminal state can
@@ -95,7 +95,7 @@ non-Markovian nature of those trajectories, that is,
9595
9696.. math ::
9797
98- o_{t+1 },r_{t+1 } \nindep o_0 ,r_0 ,a_0 ,\dots ,o_{t-1 },r_{t-1 },a_{t-1 } \mid o_t,r_t,a_t
98+ o_{t+1 },r_{t+1 } \mathop { \rlap { \perp }\mkern 2 mu{ \not \perp }} o_0 ,r_0 ,a_0 ,\dots ,o_{t-1 },r_{t-1 },a_{t-1 } \mid o_t,r_t,a_t
9999 \text {,}
100100
101101 the decision-maker must take into account the whole history of past
@@ -117,14 +117,14 @@ The PO-MDP control problem can then be written identically to the MDP one,
117117.. math ::
118118 :label: pomdp_control
119119
120- \pi ^\star = \argmax _ {\pi } \lim _{T \to \infty }
120+ \pi ^\star = \underset {\pi }{ \operatorname {arg \, max} } \lim _{T \to \infty }
121121 \mathbb {E}_\tau \left [\sum _{t=0 }^{T} r_t\right ]
122122 \text {.}
123123
124- ECOLE as PO-MDP components
124+ Ecole as PO-MDP components
125125--------------------------
126126
127- The following ECOLE components can be directly translated into PO-MDP
127+ The following Ecole components can be directly translated into PO-MDP
128128components from the above formulation:
129129
130130* :py:class: `~ecole.typing.RewardFunction ` <=> :math: `R`
@@ -160,6 +160,6 @@ environment.
160160
161161 As can be seen from :eq: `pomdp_control `, the initial reward :math: `r_0 `
162162 returned by :py:meth: `~ecole.environment.EnvironmentComposer.reset `
163- does not affect the control problem. In ECOLE we
163+ does not affect the control problem. In Ecole we
164164 nevertheless chose to preserve this initial reward, in order to obtain
165165 meaningfull cumulated episode rewards (e.g., total running time).
0 commit comments