Skip to content

Commit c2f9a31

Browse files
gasseAntoinePrv
authored andcommitted
Documentation: theory page
1 parent d98ce88 commit c2f9a31

File tree

2 files changed

+177
-0
lines changed

2 files changed

+177
-0
lines changed

docs/conf.py.in

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,20 @@ napoleon_google_docstring = False
5353
napoleon_numpy_docstring = True
5454

5555

56+
# LaTex configuration (for math)
57+
extensions += ["sphinx.ext.imgmath"]
58+
imgmath_image_format = "svg"
59+
imgmath_latex_preamble = r'''
60+
\DeclareMathOperator*{\argmax}{arg\,max}
61+
\DeclareMathOperator*{\argmin}{arg\,min}
62+
\newcommand\indep{\protect\mathpalette{\protect\independenT}{\perp}}
63+
\def\independenT#1#2{\mathop{\rlap{$#1#2$}\mkern2mu{#1#2}}}
64+
\newcommand\nindep{\protect\mathpalette{\protect\nindependenT}{\perp}}
65+
\def\nindependenT#1#2{\mathop{\rlap{$#1#2$}\mkern2mu{\not#1#2}}}
66+
\newcommand{\overbar}[1]{\mkern 1.5mu\overline{\mkern-1.5mu#1\mkern-1.5mu}\mkern 1.5mu}
67+
'''
68+
69+
5670
# Preprocess docstring to remove "core" from type name
5771
def preprocess_signature(app, what, name, obj, options, signature, return_annotation):
5872
if signature is not None:

docs/discussion/theory.rst

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,165 @@
11
Ecole Theoretical Model
22
=======================
3+
4+
The ECOLE API and classes directly relate to the different components of
5+
an episodic `partially-observable Markov decision process <https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process>`_
6+
(PO-MDP).
7+
8+
Markov decision process
9+
-----------------------
10+
Consider a regular Markov decision process
11+
:math:`(\mathcal{S}, \mathcal{A}, p_{init}, p_{trans}, R)`, whose components are
12+
13+
* a state space :math:`\mathcal{S}`
14+
* an action space :math:`\mathcal{A}`
15+
* an initial state distribution :math:`p_{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}`
16+
* a state transition distribution
17+
:math:`p_{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}`
18+
* a reward function :math:`R: \mathcal{S} \to \mathbb{R}`.
19+
20+
.. note::
21+
22+
The choice of having deterministic rewards :math:`r_t = R(s_t)` is
23+
arbitrary here, in order to best fit the ECOLE API. Note that it is
24+
not a restrictive choice though, as any MDP with stochastic rewards
25+
:math:`r_t \sim p_{reward}(r_t|s_{t-1},a_{t-1},s_{t})`
26+
can be converted into an equivalent MDP with deterministic ones,
27+
by considering the reward as part of the state.
28+
29+
Together with an action policy
30+
31+
.. math::
32+
33+
\pi: \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}
34+
35+
an MDP can be unrolled to produce state-action trajectories
36+
37+
.. math::
38+
39+
\tau=(s_0,a_0,s_1,\dots)
40+
41+
that obey the following joint distribution
42+
43+
.. math::
44+
45+
\tau \sim \underbrace{p_{init}(s_0)}_{\text{initial state}}
46+
\prod_{t=0}^\infty \underbrace{\pi(a_t | s_t)}_{\text{next action}}
47+
\underbrace{p_{trans}(s_{t+1} | a_t, s_t)}_{\text{next state}}
48+
\text{.}
49+
50+
MDP control problem
51+
^^^^^^^^^^^^^^^^^^^
52+
We define the MDP control problem as that of finding a policy
53+
:math:`\pi^\star` which is optimal with respect to the expected total
54+
reward,
55+
56+
.. math::
57+
:label: mdp_control
58+
59+
\pi^\star = \argmax_{\pi} \lim_{T \to \infty}
60+
\mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
61+
\text{,}
62+
63+
where :math:`r_t := R(s_t)`.
64+
65+
.. note::
66+
67+
In the general case this quantity may not be bounded, for example for MDPs
68+
that correspond to continuing tasks. In ECOLE we garantee that all
69+
environments correspond to **episodic** tasks, that is, each episode is
70+
garanteed to start from an initial state :math:`s_0`, and end in a
71+
terminal state :math:`s_{final}`. For convenience this terminal state can
72+
be considered as absorbing, i.e.,
73+
:math:`p_{dyn}(s_{t+1}|a_t,s_t=s_{final}) := \delta_{s_{final}}(s_{t+1})`,
74+
and associated to a null reward, :math:`R(s_{final}) := 0`, so that all
75+
future states encountered after :math:`s_{final}` can be safely ignored in
76+
the MDP control problem.
77+
78+
Partially-observable Markov decision process
79+
--------------------------------------------
80+
In the PO-MDP setting, complete information about the current MDP state
81+
is not necessarily available to the decision-maker. Instead,
82+
at each step only a partial observation :math:`o \in \Omega`
83+
is made available, which can be seen as the result of applying an observation
84+
function :math:`O: \mathcal{S} \to \Omega` to the current state. As a result,
85+
PO-MDP trajectories take the form
86+
87+
.. math::
88+
89+
\tau=(o_0,r_0,a_0,o_1\dots)
90+
\text{,}
91+
92+
where :math:`o_t:= O(s_t)` and :math:`r_t:=R(s_t)` are respectively the
93+
observation and the reward collected at time step :math:`t`. Due to the
94+
non-Markovian nature of those trajectories, that is,
95+
96+
.. math::
97+
98+
o_{t+1},r_{t+1} \nindep o_0,r_0,a_0,\dots,o_{t-1},r_{t-1},a_{t-1} \mid o_t,r_t,a_t
99+
\text{,}
100+
101+
the decision-maker must take into account the whole history of past
102+
observations, rewards and actions, in order to decide on an optimal action
103+
at current time step :math:`t`. The PO-MDP policy then takes the form
104+
105+
.. math::
106+
107+
\pi:\mathcal{A} \times \mathcal{H} \to \mathbb{R}_{\geq 0}
108+
\text{,}
109+
110+
where :math:`h_t:=(o_0,r_0,a_0,\dots,o_t,r_t)\in\mathcal{H}` represents the
111+
PO-MDP history at time step :math:`t`, so that :math:`a_t \sim \pi(a_t|h_t)`.
112+
113+
PO-MDP control problem
114+
^^^^^^^^^^^^^^^^^^^^^^
115+
The PO-MDP control problem can then be written identically to the MDP one,
116+
117+
.. math::
118+
:label: pomdp_control
119+
120+
\pi^\star = \argmax_{\pi} \lim_{T \to \infty}
121+
\mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
122+
\text{.}
123+
124+
ECOLE as PO-MDP components
125+
--------------------------
126+
127+
The following ECOLE components can be directly translated into PO-MDP
128+
components from the above formulation:
129+
130+
* :py:class:`~ecole.typing.RewardFunction` <=> :math:`R`
131+
* :py:class:`~ecole.typing.ObservationFunction` <=> :math:`O`
132+
* :py:meth:`~ecole.typing.Dynamics.reset_dynamics` <=> :math:`p_{init}(s_0)`
133+
* :py:meth:`~ecole.typing.Dynamics.step_dynamics` <=> :math:`p_{trans}(s_{t+1}|s_t,a_t)`
134+
135+
The :py:class:`~ecole.environment.EnvironmentComposer` class wraps all of
136+
those components together to form the PO-MDP. Its API can be interpreted as
137+
follows:
138+
139+
* :py:meth:`~ecole.environment.EnvironmentComposer.reset` <=>
140+
:math:`s_0 \sim p_{init}(s_0), r_0=R(s_0), o_0=O(s_0)`
141+
* :py:meth:`~ecole.environment.EnvironmentComposer.step` <=>
142+
:math:`s_{t+1} \sim p_{trans}(s_{t+1}|a_s,s_t), r_t=R(s_t), o_t=O(s_t)`
143+
* ``done == True`` <=> the PO-MDP will now enter the terminal state,
144+
:math:`s_{t+1}==s_{final}`. As such, the current episode ends now.
145+
146+
The state space :math:`\mathcal{S}` can be considered to be the whole computer
147+
memory occupied by the environment, which includes the state of the underlying
148+
SCIP solver instance. The action space :math:`\mathcal{A}` is specific to each
149+
environment.
150+
151+
.. note::
152+
We allow the environment to specify a set of valid actions at each time
153+
step :math:`t`. The ``action_set`` value returned by
154+
:py:meth:`~ecole.environment.EnvironmentComposer.reset` and
155+
:py:meth:`~ecole.environment.EnvironmentComposer.step` serves this purpose,
156+
and can be left to ``None`` when the action set is implicit.
157+
158+
159+
.. note::
160+
161+
As can be seen from :eq:`pomdp_control`, the initial reward :math:`r_0`
162+
returned by :py:meth:`~ecole.environment.EnvironmentComposer.reset`
163+
does not affect the control problem. In ECOLE we
164+
nevertheless chose to preserve this initial reward, in order to obtain
165+
meaningfull cumulated episode rewards (e.g., total running time).

0 commit comments

Comments
 (0)