DOC Revamp create custom datafits and penalties (#116)

PABannier · Badr-MOUFAD · web-flow · commit 2cef37c9bda5 · 2022-11-02T20:18:09.000+01:00
Co-authored-by: Pierre-Antoine Bannier &lt;pierreantoine.bannier@gmail.com&gt;
Co-authored-by: Badr MOUFAD &lt;Badr.MOUFAD@emines.um6p.ma&gt;
diff --git a/doc/add.rst b/doc/add.rst
diff --git a/doc/add_datafit.rst b/doc/add_datafit.rst
@@ -0,0 +1,121 @@
+:orphan:
+
+How to add a custom datafit
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _how:
+
+Motivated by generalized linear models but not limited to it, ``skglm`` solves problems of the form
+
+.. math::
+      \hat{\beta} \in
+      \arg\min_{\beta \in \mathbb{R}^p}
+      F(X\beta) + \Omega(\beta)
+      := \sum_{i=1}^n f_i([X\beta]_i) + \sum_{j=1}^p \Omega_j(\beta_j)
+      \enspace .
+
+
+Here, :math:`X \in \mathbb{R}^{n \times p}` denotes the design matrix with :math:`n` samples and :math:`p` features,
+and :math:`\beta \in \mathbb{R}^p` is the coefficient vector.
+
+skglm can solve any problems of this form with arbitrary smooth datafit :math:`F` and arbitrary penalty :math:`\Omega` whose proximal operator can be evaluated explicitly, by defining two classes: a ``Penalty`` and a ``Datafit``.
+
+They can then be passed to a :class:`~skglm.GeneralizedLinearEstimator`.
+
+.. code-block:: python
+
+   clf = GeneralizedLinearEstimator(
+      MyDatafit(),
+      MyPenalty(),
+   )
+
+
+A ``Datafit`` is a jitclass which must inherit from the ``BaseDatafit`` class:
+
+.. literalinclude:: ../skglm/datafits/base.py
+   :pyobject: BaseDatafit
+
+
+To define a custom datafit, you need to implement the methods declared in the ``BaseDatafit`` class.
+One needs to overload at least the ``value`` and ``gradient`` methods for skglm to support the datafit.
+Optionally, overloading the methods with the suffix ``_sparse`` adds support for sparse datasets (CSC matrix).
+As an example, we show how to implement the Poisson datafit in skglm.
+
+
+A case in point: defining Poisson datafit
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, this requires deriving some quantities used by the solvers like the gradient or the Hessian matrix of the datafit.
+With :math:`y \in \mathbb{R}^n` the target vector, the Poisson datafit reads
+
+.. math::
+    f(X\beta) = \frac{1}{n}\sum_{i=1}^n \exp([X\beta]_i) - y_i[X\beta]_i 
+    \enspace .
+
+
+Let's define some useful quantities to simplify our computations. For :math:`z \in \mathbb{R}^n` and :math:`\beta \in \mathbb{R}^p`,
+
+.. math::
+   f(z) = \sum_{i=1}^n f_i(z_i)  \qquad  F(\beta) = f(X\beta)
+   \enspace .
+
+
+Computing the gradient of :math:`F` and its Hessian matrix yields
+
+.. math::
+   \nabla F(\beta) = X^{\top} \underbrace{\nabla f(X\beta)}_\textrm{raw grad} \qquad \nabla^2 F(\beta) = X^{\top} \underbrace{\nabla^2 f(X\beta)}_\textrm{raw hessian} X
+   \enspace .
+
+
+Besides, it directly follows that
+
+.. math::
+   \nabla f(z) = (f_i'(z_i))_{1 \leq i \leq n} \qquad \nabla^2 f(z) = \textrm{diag}(f_i''(z_i))_{1 \leq i \leq n}
+   \enspace .
+
+
+We can now apply these definitions to the Poisson datafit:
+
+.. math::
+    f_i(z_i) = \frac{1}{n} \left(\exp(z_i) - y_iz_i\right)
+    \enspace .
+
+
+Therefore,
+
+.. math::
+   f_i'(z_i) = \frac{1}{n}(\exp(z_i) - y_i) \qquad f_i''(z_i) = \frac{1}{n}\exp(z_i)
+   \enspace .
+
+
+Computing ``raw_grad`` and ``raw_hessian`` for the Poisson datafit yields
+
+.. math::
+   \nabla f(X\beta) = \frac{1}{n}(\exp([X\beta]_i) - y_i)_{1 \leq i \leq n} \qquad \nabla^2 f(X\beta) = \frac{1}{n}\textrm{diag}(\exp([X\beta]_i))_{1 \leq i \leq n}
+   \enspace .
+
+
+Both ``raw_grad`` and ``raw_hessian`` are methods used by the ``ProxNewton`` solver.
+But other optimizers require different methods to be implemented. For instance, ``AndersonCD`` uses the ``gradient_scalar`` method:
+it is the derivative of the datafit with respect to the :math:`j`-th coordinate of :math:`\beta`.
+
+For the Poisson datafit, this yields
+
+.. math::
+    \frac{\partial F(\beta)}{\partial \beta_j} = \frac{1}{n}
+      \sum_{i=1}^n X_{i,j} \left(
+         \exp([X\beta]_i) - y 
+      \right)
+      \enspace .
+
+
+When implementing these quantities in the ``Poisson`` datafit class, this gives:
+
+.. literalinclude:: ../skglm/datafits/single_task.py
+   :pyobject: Poisson
+
+
+Note that we have not initialized any quantities in the ``initialize`` method.
+Usually it serves to compute a Lipschitz constant of the datafit, whose inverse is used by the solver as a step size.
+However, in this example, the Poisson datafit has no Lipschitz constant since the eigenvalues of the Hessian matrix are unbounded. 
+This implies that a step size is not known in advance and a line search has to be performed at every epoch by the solver.
diff --git a/doc/add_penalty.rst b/doc/add_penalty.rst
@@ -0,0 +1,72 @@
+:orphan:
+
+How to add a custom penalty
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _how:
+
+skglm supports any arbitrary proximable penalty.
+
+
+It is implemented as a jitclass which must inherit from the ``BasePenalty`` class:
+
+.. literalinclude:: ../skglm/penalties/base.py
+   :pyobject: BasePenalty
+
+To implement your own penalty, you only need to define a new jitclass, inheriting from ``BasePenalty`` and define how its value, proximal operator, distance to subdifferential (for KKT violation) and penalized features are computed.
+
+A case in point: defining L1 penalty
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We detail how the :math:`\ell_1` penalty is implemented in skglm.
+For a vector :math:`\beta \in \mathbb{R}^p`, the :math:`\ell_1` penalty is defined as follows:
+
+.. math::
+   \lvert\lvert \beta \rvert\rvert_1 = \sum_{i=1}^p |\beta _i| \enspace .
+
+
+The regularization level is controlled by the hyperparameter :math:`\lambda \in \mathbb{R}^+`, that is defined and initialized in the constructor of the class.
+
+The method ``get_spec`` allows to strongly type the attributes of the penalty object, thus allowing Numba to JIT-compile the class.
+It should return an iterable of tuples, the first element being the name of the attribute, the second its Numba type (e.g. `float64`, `bool_`).
+Additionally, a penalty should implement ``params_to_dict``, a helper method to get all the parameters of a penalty returned in a dictionary.
+
+To optimize an objective with a given penalty, skglm needs at least the proximal operator of the penalty applied to the :math:`j`-th coordinate.
+For the ``L1`` penalty, it is the well-known soft-thresholding operator:
+
+.. math::
+    \textrm{ST}(\beta , \lambda) = \mathrm{max}(0, \lvert \beta \rvert - \lambda) \mathrm{sgn}(\beta) \enspace .
+
+
+Note that skglm expects the threshold level to be the regularization hyperparameter :math:`\lambda \in \mathbb{R}^+` **scaled by** the stepsize.
+
+
+Besides, by default all solvers in skglm have ``ws_strategy`` turned on to ``subdiff``.
+This means that the optimality conditions (thus the stopping criterion) is computed using the method ``subdiff_distance`` of the penalty.
+If not implemented, the user should set ``ws_strategy`` to ``fixpoint``.
+
+For the :math:`\ell_1` penalty, the distance of the negative gradient of the datafit :math:`F` to the subdifferential of the penalty reads
+
+.. math::
+   \mathrm{dist}(-\nabla_j F(\beta), \partial |\beta_j|) = \begin{cases}
+        \mathrm{max}(0, \lvert -\nabla_j F(\beta) \rvert - \lambda) \\
+        \lvert -\nabla_j F(\beta) - \lambda \mathrm{sgn}(\beta_j) \lvert \\
+    \end{cases}
+   \enspace .
+
+
+The method ``is_penalized`` returns a binary mask with the penalized features.
+For the :math:`\ell_1` penalty, all the coefficients are penalized.
+Finally, ``generalized_support`` returns the generalized support of the penalty for some coefficient vector ``w``.
+It is typically the non-zero coefficients of the solution vector for :math:`\ell_1`.
+
+
+Optionally, a penalty might implement ``alpha_max`` which returns the smallest :math:`\lambda` for which the optimal solution is a null vector.
+Note that since ``lambda`` is a reserved keyword in Python, ``alpha`` in skglm codebase corresponds to :math:`\lambda`.
+
+When putting all together, this gives the implementation of the ``L1`` penalty:
+
+
+.. literalinclude:: ../skglm/penalties/separable.py
+   :pyobject: L1
+
diff --git a/doc/conf.py b/doc/conf.py
@@ -159,7 +159,8 @@
     'navbar_links': [
         ("Examples", "auto_examples/index"),
         ("API", "api"),
-        ("Add custom penalty and datafit", "add"),
+        ("Add custom datafit", "add_datafit"),
+        ("Add custom penalty", "add_penalty"),
         ("GitHub", "https://github.com/scikit-learn-contrib/skglm", True)
     ],
     'bootswatch_theme': "united"