Skip to content

Commit a96302b

Browse files
gasseAntoinePrv
authored andcommitted
Minor text fixes
1 parent 78d0a8b commit a96302b

File tree

1 file changed

+15
-15
lines changed

1 file changed

+15
-15
lines changed

docs/discussion/theory.rst

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,21 +8,21 @@ an episodic `partially-observable Markov decision process <https://en.wikipedia.
88
Markov decision process
99
-----------------------
1010
Consider a regular Markov decision process
11-
:math:`(\mathcal{S}, \mathcal{A}, p_{init}, p_{trans}, R)`, whose components are
11+
:math:`(\mathcal{S}, \mathcal{A}, p_\textit{init}, p_\textit{trans}, R)`, whose components are
1212

1313
* a state space :math:`\mathcal{S}`
1414
* an action space :math:`\mathcal{A}`
15-
* an initial state distribution :math:`p_{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}`
15+
* an initial state distribution :math:`p_\textit{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}`
1616
* a state transition distribution
17-
:math:`p_{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}`
17+
:math:`p_\textit{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}`
1818
* a reward function :math:`R: \mathcal{S} \to \mathbb{R}`.
1919

2020
.. note::
2121

2222
The choice of having deterministic rewards :math:`r_t = R(s_t)` is
2323
arbitrary here, in order to best fit the Ecole API. Note that it is
2424
not a restrictive choice though, as any MDP with stochastic rewards
25-
:math:`r_t \sim p_{reward}(r_t|s_{t-1},a_{t-1},s_{t})`
25+
:math:`r_t \sim p_\textit{reward}(r_t|s_{t-1},a_{t-1},s_{t})`
2626
can be converted into an equivalent MDP with deterministic ones,
2727
by considering the reward as part of the state.
2828

@@ -42,9 +42,9 @@ that obey the following joint distribution
4242

4343
.. math::
4444
45-
\tau \sim \underbrace{p_{init}(s_0)}_{\text{initial state}}
45+
\tau \sim \underbrace{p_\textit{init}(s_0)}_{\text{initial state}}
4646
\prod_{t=0}^\infty \underbrace{\pi(a_t | s_t)}_{\text{next action}}
47-
\underbrace{p_{trans}(s_{t+1} | a_t, s_t)}_{\text{next state}}
47+
\underbrace{p_\textit{trans}(s_{t+1} | a_t, s_t)}_{\text{next state}}
4848
\text{.}
4949
5050
MDP control problem
@@ -68,11 +68,11 @@ where :math:`r_t := R(s_t)`.
6868
that correspond to continuing tasks. In Ecole we garantee that all
6969
environments correspond to **episodic** tasks, that is, each episode is
7070
garanteed to start from an initial state :math:`s_0`, and end in a
71-
terminal state :math:`s_{final}`. For convenience this terminal state can
71+
terminal state :math:`s_\textit{final}`. For convenience this terminal state can
7272
be considered as absorbing, i.e.,
73-
:math:`p_{dyn}(s_{t+1}|a_t,s_t=s_{final}) := \delta_{s_{final}}(s_{t+1})`,
74-
and associated to a null reward, :math:`R(s_{final}) := 0`, so that all
75-
future states encountered after :math:`s_{final}` can be safely ignored in
73+
:math:`p_\textit{trans}(s_{t+1}|a_t,s_t=s_\textit{final}) := \delta_{s_\textit{final}}(s_{t+1})`,
74+
and associated to a null reward, :math:`R(s_\textit{final}) := 0`, so that all
75+
future states encountered after :math:`s_\textit{final}` can be safely ignored in
7676
the MDP control problem.
7777

7878
Partially-observable Markov decision process
@@ -129,19 +129,19 @@ components from the above formulation:
129129

130130
* :py:class:`~ecole.typing.RewardFunction` <=> :math:`R`
131131
* :py:class:`~ecole.typing.ObservationFunction` <=> :math:`O`
132-
* :py:meth:`~ecole.typing.Dynamics.reset_dynamics` <=> :math:`p_{init}(s_0)`
133-
* :py:meth:`~ecole.typing.Dynamics.step_dynamics` <=> :math:`p_{trans}(s_{t+1}|s_t,a_t)`
132+
* :py:meth:`~ecole.typing.Dynamics.reset_dynamics` <=> :math:`p_\textit{init}(s_0)`
133+
* :py:meth:`~ecole.typing.Dynamics.step_dynamics` <=> :math:`p_\textit{trans}(s_{t+1}|s_t,a_t)`
134134

135135
The :py:class:`~ecole.environment.EnvironmentComposer` class wraps all of
136136
those components together to form the PO-MDP. Its API can be interpreted as
137137
follows:
138138

139139
* :py:meth:`~ecole.environment.EnvironmentComposer.reset` <=>
140-
:math:`s_0 \sim p_{init}(s_0), r_0=R(s_0), o_0=O(s_0)`
140+
:math:`s_0 \sim p_\textit{init}(s_0), r_0=R(s_0), o_0=O(s_0)`
141141
* :py:meth:`~ecole.environment.EnvironmentComposer.step` <=>
142-
:math:`s_{t+1} \sim p_{trans}(s_{t+1}|a_s,s_t), r_t=R(s_t), o_t=O(s_t)`
142+
:math:`s_{t+1} \sim p_\textit{trans}(s_{t+1}|a_s,s_t), r_t=R(s_t), o_t=O(s_t)`
143143
* ``done == True`` <=> the PO-MDP will now enter the terminal state,
144-
:math:`s_{t+1}==s_{final}`. As such, the current episode ends now.
144+
:math:`s_{t+1}==s_\textit{final}`. As such, the current episode ends now.
145145

146146
The state space :math:`\mathcal{S}` can be considered to be the whole computer
147147
memory occupied by the environment, which includes the state of the underlying

0 commit comments

Comments
 (0)