@@ -8,21 +8,21 @@ an episodic `partially-observable Markov decision process <https://en.wikipedia.
88Markov decision process
99-----------------------
1010Consider a regular Markov decision process
11- :math: `(\mathcal {S}, \mathcal {A}, p_{init}, p_{trans}, R)`, whose components are
11+ :math: `(\mathcal {S}, \mathcal {A}, p_\textit {init}, p_\textit {trans}, R)`, whose components are
1212
1313* a state space :math: `\mathcal {S}`
1414* an action space :math: `\mathcal {A}`
15- * an initial state distribution :math: `p_{init}: \mathcal {S} \to \mathbb {R}_{\geq 0 }`
15+ * an initial state distribution :math: `p_\textit {init}: \mathcal {S} \to \mathbb {R}_{\geq 0 }`
1616* a state transition distribution
17- :math: `p_{trans}: \mathcal {S} \times \mathcal {A} \times \mathcal {S} \to \mathbb {R}_{\geq 0 }`
17+ :math: `p_\textit {trans}: \mathcal {S} \times \mathcal {A} \times \mathcal {S} \to \mathbb {R}_{\geq 0 }`
1818* a reward function :math: `R: \mathcal {S} \to \mathbb {R}`.
1919
2020.. note ::
2121
2222 The choice of having deterministic rewards :math: `r_t = R(s_t)` is
2323 arbitrary here, in order to best fit the Ecole API. Note that it is
2424 not a restrictive choice though, as any MDP with stochastic rewards
25- :math: `r_t \sim p_{reward}(r_t|s_{t-1 },a_{t-1 },s_{t})`
25+ :math: `r_t \sim p_\textit {reward}(r_t|s_{t-1 },a_{t-1 },s_{t})`
2626 can be converted into an equivalent MDP with deterministic ones,
2727 by considering the reward as part of the state.
2828
@@ -42,9 +42,9 @@ that obey the following joint distribution
4242
4343.. math ::
4444
45- \tau \sim \underbrace {p_{init}(s_0 )}_{\text {initial state}}
45+ \tau \sim \underbrace {p_\textit {init}(s_0 )}_{\text {initial state}}
4646 \prod _{t=0 }^\infty \underbrace {\pi (a_t | s_t)}_{\text {next action}}
47- \underbrace {p_{trans}(s_{t+1 } | a_t, s_t)}_{\text {next state}}
47+ \underbrace {p_\textit {trans}(s_{t+1 } | a_t, s_t)}_{\text {next state}}
4848 \text {.}
4949
5050 MDP control problem
@@ -68,11 +68,11 @@ where :math:`r_t := R(s_t)`.
6868 that correspond to continuing tasks. In Ecole we garantee that all
6969 environments correspond to **episodic ** tasks, that is, each episode is
7070 garanteed to start from an initial state :math: `s_0 `, and end in a
71- terminal state :math: `s_{final}`. For convenience this terminal state can
71+ terminal state :math: `s_\textit {final}`. For convenience this terminal state can
7272 be considered as absorbing, i.e.,
73- :math: `p_{dyn }(s_{t+1 }|a_t,s_t=s_{final}) := \delta _{s_{final}}(s_{t+1 })`,
74- and associated to a null reward, :math: `R(s_{final}) := 0 `, so that all
75- future states encountered after :math: `s_{final}` can be safely ignored in
73+ :math: `p_\textit {trans }(s_{t+1 }|a_t,s_t=s_\textit {final}) := \delta _{s_\textit {final}}(s_{t+1 })`,
74+ and associated to a null reward, :math: `R(s_\textit {final}) := 0 `, so that all
75+ future states encountered after :math: `s_\textit {final}` can be safely ignored in
7676 the MDP control problem.
7777
7878Partially-observable Markov decision process
@@ -129,19 +129,19 @@ components from the above formulation:
129129
130130* :py:class: `~ecole.typing.RewardFunction ` <=> :math: `R`
131131* :py:class: `~ecole.typing.ObservationFunction ` <=> :math: `O`
132- * :py:meth: `~ecole.typing.Dynamics.reset_dynamics ` <=> :math: `p_{init}(s_0 )`
133- * :py:meth: `~ecole.typing.Dynamics.step_dynamics ` <=> :math: `p_{trans}(s_{t+1 }|s_t,a_t)`
132+ * :py:meth: `~ecole.typing.Dynamics.reset_dynamics ` <=> :math: `p_\textit {init}(s_0 )`
133+ * :py:meth: `~ecole.typing.Dynamics.step_dynamics ` <=> :math: `p_\textit {trans}(s_{t+1 }|s_t,a_t)`
134134
135135The :py:class: `~ecole.environment.EnvironmentComposer ` class wraps all of
136136those components together to form the PO-MDP. Its API can be interpreted as
137137follows:
138138
139139* :py:meth: `~ecole.environment.EnvironmentComposer.reset ` <=>
140- :math: `s_0 \sim p_{init}(s_0 ), r_0 =R(s_0 ), o_0 =O(s_0 )`
140+ :math: `s_0 \sim p_\textit {init}(s_0 ), r_0 =R(s_0 ), o_0 =O(s_0 )`
141141* :py:meth: `~ecole.environment.EnvironmentComposer.step ` <=>
142- :math: `s_{t+1 } \sim p_{trans}(s_{t+1 }|a_s,s_t), r_t=R(s_t), o_t=O(s_t)`
142+ :math: `s_{t+1 } \sim p_\textit {trans}(s_{t+1 }|a_s,s_t), r_t=R(s_t), o_t=O(s_t)`
143143* ``done == True `` <=> the PO-MDP will now enter the terminal state,
144- :math: `s_{t+1 }==s_{final}`. As such, the current episode ends now.
144+ :math: `s_{t+1 }==s_\textit {final}`. As such, the current episode ends now.
145145
146146The state space :math: `\mathcal {S}` can be considered to be the whole computer
147147memory occupied by the environment, which includes the state of the underlying
0 commit comments