ds4dm · AntoinePrv · Aug 4, 2020 · Jul 30, 2020 · Aug 3, 2020 · Aug 3, 2020
diff --git a/docs/conf.py.in b/docs/conf.py.in
@@ -11,6 +11,9 @@ extensions = [
     "sphinx.ext.viewcode",
 ]
 
+# Math setting
+extensions += ["sphinx.ext.mathjax"]
+
 # Code style
 pygments_style = "monokai"
 

diff --git a/docs/discussion/theory.rst b/docs/discussion/theory.rst
@@ -1,2 +1,165 @@
 Ecole Theoretical Model
 =======================
+
+The Ecole API and classes directly relate to the different components of
+an episodic `partially-observable Markov decision process <https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process>`_
+(PO-MDP).
+
+Markov decision process
+-----------------------
+Consider a regular Markov decision process
+:math:`(\mathcal{S}, \mathcal{A}, p_{init}, p_{trans}, R)`, whose components are
+
+* a state space :math:`\mathcal{S}`
+* an action space :math:`\mathcal{A}`
+* an initial state distribution :math:`p_{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}`
+* a state transition distribution
+  :math:`p_{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}`
+* a reward function :math:`R: \mathcal{S} \to \mathbb{R}`.
+
+.. note::
+
+    The choice of having deterministic rewards :math:`r_t = R(s_t)` is
+    arbitrary here, in order to best fit the Ecole API. Note that it is
+    not a restrictive choice though, as any MDP with stochastic rewards
+    :math:`r_t \sim p_{reward}(r_t|s_{t-1},a_{t-1},s_{t})`
+    can be converted into an equivalent MDP with deterministic ones,
+    by considering the reward as part of the state.
+
+Together with an action policy 
+
+.. math::
+
+    \pi: \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}
+
+an MDP can be unrolled to produce state-action trajectories
+
+.. math::
+
+   \tau=(s_0,a_0,s_1,\dots)
+
+that obey the following joint distribution
+
+.. math::
+
+    \tau \sim \underbrace{p_{init}(s_0)}_{\text{initial state}}
+    \prod_{t=0}^\infty \underbrace{\pi(a_t | s_t)}_{\text{next action}}
+    \underbrace{p_{trans}(s_{t+1} | a_t, s_t)}_{\text{next state}}
+    \text{.}
+
+MDP control problem
+^^^^^^^^^^^^^^^^^^^
+We define the MDP control problem as that of finding a policy
+:math:`\pi^\star` which is optimal with respect to the expected total
+reward,
+
+.. math::
+   :label: mdp_control
+
+   \pi^\star = \underset{\pi}{\operatorname{arg\,max}}
+   \lim_{T \to \infty} \mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
+   \text{,}
+
+where :math:`r_t := R(s_t)`.
+
+.. note::
+
+    In the general case this quantity may not be bounded, for example for MDPs
+    that correspond to continuing tasks. In Ecole we garantee that all
+    environments correspond to **episodic** tasks, that is, each episode is
+    garanteed to start from an initial state :math:`s_0`, and end in a
+    terminal state :math:`s_{final}`. For convenience this terminal state can
+    be considered as absorbing, i.e.,
+    :math:`p_{dyn}(s_{t+1}|a_t,s_t=s_{final}) := \delta_{s_{final}}(s_{t+1})`,
+    and associated to a null reward, :math:`R(s_{final}) := 0`, so that all
+    future states encountered after :math:`s_{final}` can be safely ignored in
+    the MDP control problem.
+
+Partially-observable Markov decision process
+--------------------------------------------
+In the PO-MDP setting, complete information about the current MDP state
+is not necessarily available to the decision-maker. Instead,
+at each step only a partial observation :math:`o \in \Omega`
+is made available, which can be seen as the result of applying an observation
+function :math:`O: \mathcal{S} \to \Omega` to the current state. As a result,
+PO-MDP trajectories take the form
+
+.. math::
+
+   \tau=(o_0,r_0,a_0,o_1\dots)
+   \text{,}
+
+where :math:`o_t:= O(s_t)` and :math:`r_t:=R(s_t)` are respectively the
+observation and the reward collected at time step :math:`t`. Due to the
+non-Markovian nature of those trajectories, that is,
+
+.. math::
+
+    o_{t+1},r_{t+1} \mathop{\rlap{\perp}\mkern2mu{\not\perp}} o_0,r_0,a_0,\dots,o_{t-1},r_{t-1},a_{t-1} \mid o_t,r_t,a_t
+    \text{,}
+
+the decision-maker must take into account the whole history of past
+observations, rewards and actions, in order to decide on an optimal action
+at current time step :math:`t`. The PO-MDP policy then takes the form
+
+.. math::
+
+   \pi:\mathcal{A} \times \mathcal{H} \to \mathbb{R}_{\geq 0}
+   \text{,}
+
+where :math:`h_t:=(o_0,r_0,a_0,\dots,o_t,r_t)\in\mathcal{H}` represents the
+PO-MDP history at time step :math:`t`, so that :math:`a_t \sim \pi(a_t|h_t)`.
+
+PO-MDP control problem
+^^^^^^^^^^^^^^^^^^^^^^
+The PO-MDP control problem can then be written identically to the MDP one,
+
+.. math::
+   :label: pomdp_control
+
+   \pi^\star = \underset{\pi}{\operatorname{arg\,max}} \lim_{T \to \infty}
+   \mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
+   \text{.}
+
+Ecole as PO-MDP components
+--------------------------
+
+The following Ecole components can be directly translated into PO-MDP
+components from the above formulation:
+
+* :py:class:`~ecole.typing.RewardFunction` <=> :math:`R`
+* :py:class:`~ecole.typing.ObservationFunction` <=> :math:`O`
+* :py:meth:`~ecole.typing.Dynamics.reset_dynamics` <=> :math:`p_{init}(s_0)`
+* :py:meth:`~ecole.typing.Dynamics.step_dynamics` <=> :math:`p_{trans}(s_{t+1}|s_t,a_t)`
+
+The :py:class:`~ecole.environment.EnvironmentComposer` class wraps all of
+those components together to form the PO-MDP. Its API can be interpreted as
+follows:
+
+* :py:meth:`~ecole.environment.EnvironmentComposer.reset` <=>
+  :math:`s_0 \sim p_{init}(s_0), r_0=R(s_0), o_0=O(s_0)`
+* :py:meth:`~ecole.environment.EnvironmentComposer.step` <=>
+  :math:`s_{t+1} \sim p_{trans}(s_{t+1}|a_s,s_t), r_t=R(s_t), o_t=O(s_t)`
+* ``done == True`` <=> the PO-MDP will now enter the terminal state,
+  :math:`s_{t+1}==s_{final}`. As such, the current episode ends now.
+
+The state space :math:`\mathcal{S}` can be considered to be the whole computer
+memory occupied by the environment, which includes the state of the underlying
+SCIP solver instance. The action space :math:`\mathcal{A}` is specific to each
+environment.
+
+.. note::
+   We allow the environment to specify a set of valid actions at each time
+   step :math:`t`. The ``action_set`` value returned by
+   :py:meth:`~ecole.environment.EnvironmentComposer.reset` and
+   :py:meth:`~ecole.environment.EnvironmentComposer.step` serves this purpose,
+   and can be left to ``None`` when the action set is implicit.
+
+
+.. note::
+
+   As can be seen from :eq:`pomdp_control`, the initial reward :math:`r_0`
+   returned by :py:meth:`~ecole.environment.EnvironmentComposer.reset`
+   does not affect the control problem. In Ecole we
+   nevertheless chose to preserve this initial reward, in order to obtain
+   meaningfull cumulated episode rewards (e.g., total running time).
diff --git a/docs/static/css/custom.css b/docs/static/css/custom.css
@@ -120,3 +120,52 @@
 .highlight .k {
 	color: #77D1F6 !important;
 }
+
+/* CSS to fix Mathjax equation numbers displaying above.
+ *
+ * Credit to @hagenw https://github.com/readthedocs/sphinx_rtd_theme/pull/383
+ */
+div.math {
+	position: relative;
+	padding-right: 2.5em;
+}
+.eqno {
+	height: 100%;
+	position: absolute;
+	right: 0;
+	padding-left: 5px;
+	padding-bottom: 5px;
+	/* Fix for mouse over in Firefox */
+	padding-right: 1px;
+}
+.eqno:before {
+	/* Force vertical alignment of number */
+	display: inline-block;
+	height: 100%;
+	vertical-align: middle;
+	content: "";
+}
+.eqno .headerlink {
+	display: none;
+	visibility: hidden;
+	font-size: 14px;
+	padding-left: .3em;
+}
+.eqno:hover .headerlink {
+	display: inline-block;
+	visibility: hidden;
+	margin-right: -1.05em;
+}
+.eqno .headerlink:after {
+	visibility: visible;
+	content: "\f0c1";
+	font-family: FontAwesome;
+	display: inline-block;
+	margin-left: -.9em;
+}
+/* Make responsive */
+.MathJax_Display {
+	max-width: 100%;
+	overflow-x: auto;
+	overflow-y: hidden;
+}