Skip to content

Commit 2cef37c

Browse files
DOC Revamp create custom datafits and penalties (#116)
Co-authored-by: Pierre-Antoine Bannier <[email protected]> Co-authored-by: Badr MOUFAD <[email protected]>
1 parent a4362ba commit 2cef37c

File tree

4 files changed

+195
-50
lines changed

4 files changed

+195
-50
lines changed

doc/add.rst

Lines changed: 0 additions & 49 deletions
This file was deleted.

doc/add_datafit.rst

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
:orphan:
2+
3+
How to add a custom datafit
4+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
5+
6+
.. _how:
7+
8+
Motivated by generalized linear models but not limited to it, ``skglm`` solves problems of the form
9+
10+
.. math::
11+
\hat{\beta} \in
12+
\arg\min_{\beta \in \mathbb{R}^p}
13+
F(X\beta) + \Omega(\beta)
14+
:= \sum_{i=1}^n f_i([X\beta]_i) + \sum_{j=1}^p \Omega_j(\beta_j)
15+
\enspace .
16+
17+
18+
Here, :math:`X \in \mathbb{R}^{n \times p}` denotes the design matrix with :math:`n` samples and :math:`p` features,
19+
and :math:`\beta \in \mathbb{R}^p` is the coefficient vector.
20+
21+
skglm can solve any problems of this form with arbitrary smooth datafit :math:`F` and arbitrary penalty :math:`\Omega` whose proximal operator can be evaluated explicitly, by defining two classes: a ``Penalty`` and a ``Datafit``.
22+
23+
They can then be passed to a :class:`~skglm.GeneralizedLinearEstimator`.
24+
25+
.. code-block:: python
26+
27+
clf = GeneralizedLinearEstimator(
28+
MyDatafit(),
29+
MyPenalty(),
30+
)
31+
32+
33+
A ``Datafit`` is a jitclass which must inherit from the ``BaseDatafit`` class:
34+
35+
.. literalinclude:: ../skglm/datafits/base.py
36+
:pyobject: BaseDatafit
37+
38+
39+
To define a custom datafit, you need to implement the methods declared in the ``BaseDatafit`` class.
40+
One needs to overload at least the ``value`` and ``gradient`` methods for skglm to support the datafit.
41+
Optionally, overloading the methods with the suffix ``_sparse`` adds support for sparse datasets (CSC matrix).
42+
As an example, we show how to implement the Poisson datafit in skglm.
43+
44+
45+
A case in point: defining Poisson datafit
46+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47+
48+
First, this requires deriving some quantities used by the solvers like the gradient or the Hessian matrix of the datafit.
49+
With :math:`y \in \mathbb{R}^n` the target vector, the Poisson datafit reads
50+
51+
.. math::
52+
f(X\beta) = \frac{1}{n}\sum_{i=1}^n \exp([X\beta]_i) - y_i[X\beta]_i
53+
\enspace .
54+
55+
56+
Let's define some useful quantities to simplify our computations. For :math:`z \in \mathbb{R}^n` and :math:`\beta \in \mathbb{R}^p`,
57+
58+
.. math::
59+
f(z) = \sum_{i=1}^n f_i(z_i) \qquad F(\beta) = f(X\beta)
60+
\enspace .
61+
62+
63+
Computing the gradient of :math:`F` and its Hessian matrix yields
64+
65+
.. math::
66+
\nabla F(\beta) = X^{\top} \underbrace{\nabla f(X\beta)}_\textrm{raw grad} \qquad \nabla^2 F(\beta) = X^{\top} \underbrace{\nabla^2 f(X\beta)}_\textrm{raw hessian} X
67+
\enspace .
68+
69+
70+
Besides, it directly follows that
71+
72+
.. math::
73+
\nabla f(z) = (f_i'(z_i))_{1 \leq i \leq n} \qquad \nabla^2 f(z) = \textrm{diag}(f_i''(z_i))_{1 \leq i \leq n}
74+
\enspace .
75+
76+
77+
We can now apply these definitions to the Poisson datafit:
78+
79+
.. math::
80+
f_i(z_i) = \frac{1}{n} \left(\exp(z_i) - y_iz_i\right)
81+
\enspace .
82+
83+
84+
Therefore,
85+
86+
.. math::
87+
f_i'(z_i) = \frac{1}{n}(\exp(z_i) - y_i) \qquad f_i''(z_i) = \frac{1}{n}\exp(z_i)
88+
\enspace .
89+
90+
91+
Computing ``raw_grad`` and ``raw_hessian`` for the Poisson datafit yields
92+
93+
.. math::
94+
\nabla f(X\beta) = \frac{1}{n}(\exp([X\beta]_i) - y_i)_{1 \leq i \leq n} \qquad \nabla^2 f(X\beta) = \frac{1}{n}\textrm{diag}(\exp([X\beta]_i))_{1 \leq i \leq n}
95+
\enspace .
96+
97+
98+
Both ``raw_grad`` and ``raw_hessian`` are methods used by the ``ProxNewton`` solver.
99+
But other optimizers require different methods to be implemented. For instance, ``AndersonCD`` uses the ``gradient_scalar`` method:
100+
it is the derivative of the datafit with respect to the :math:`j`-th coordinate of :math:`\beta`.
101+
102+
For the Poisson datafit, this yields
103+
104+
.. math::
105+
\frac{\partial F(\beta)}{\partial \beta_j} = \frac{1}{n}
106+
\sum_{i=1}^n X_{i,j} \left(
107+
\exp([X\beta]_i) - y
108+
\right)
109+
\enspace .
110+
111+
112+
When implementing these quantities in the ``Poisson`` datafit class, this gives:
113+
114+
.. literalinclude:: ../skglm/datafits/single_task.py
115+
:pyobject: Poisson
116+
117+
118+
Note that we have not initialized any quantities in the ``initialize`` method.
119+
Usually it serves to compute a Lipschitz constant of the datafit, whose inverse is used by the solver as a step size.
120+
However, in this example, the Poisson datafit has no Lipschitz constant since the eigenvalues of the Hessian matrix are unbounded.
121+
This implies that a step size is not known in advance and a line search has to be performed at every epoch by the solver.

doc/add_penalty.rst

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
:orphan:
2+
3+
How to add a custom penalty
4+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
5+
6+
.. _how:
7+
8+
skglm supports any arbitrary proximable penalty.
9+
10+
11+
It is implemented as a jitclass which must inherit from the ``BasePenalty`` class:
12+
13+
.. literalinclude:: ../skglm/penalties/base.py
14+
:pyobject: BasePenalty
15+
16+
To implement your own penalty, you only need to define a new jitclass, inheriting from ``BasePenalty`` and define how its value, proximal operator, distance to subdifferential (for KKT violation) and penalized features are computed.
17+
18+
A case in point: defining L1 penalty
19+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
20+
21+
We detail how the :math:`\ell_1` penalty is implemented in skglm.
22+
For a vector :math:`\beta \in \mathbb{R}^p`, the :math:`\ell_1` penalty is defined as follows:
23+
24+
.. math::
25+
\lvert\lvert \beta \rvert\rvert_1 = \sum_{i=1}^p |\beta _i| \enspace .
26+
27+
28+
The regularization level is controlled by the hyperparameter :math:`\lambda \in \mathbb{R}^+`, that is defined and initialized in the constructor of the class.
29+
30+
The method ``get_spec`` allows to strongly type the attributes of the penalty object, thus allowing Numba to JIT-compile the class.
31+
It should return an iterable of tuples, the first element being the name of the attribute, the second its Numba type (e.g. `float64`, `bool_`).
32+
Additionally, a penalty should implement ``params_to_dict``, a helper method to get all the parameters of a penalty returned in a dictionary.
33+
34+
To optimize an objective with a given penalty, skglm needs at least the proximal operator of the penalty applied to the :math:`j`-th coordinate.
35+
For the ``L1`` penalty, it is the well-known soft-thresholding operator:
36+
37+
.. math::
38+
\textrm{ST}(\beta , \lambda) = \mathrm{max}(0, \lvert \beta \rvert - \lambda) \mathrm{sgn}(\beta) \enspace .
39+
40+
41+
Note that skglm expects the threshold level to be the regularization hyperparameter :math:`\lambda \in \mathbb{R}^+` **scaled by** the stepsize.
42+
43+
44+
Besides, by default all solvers in skglm have ``ws_strategy`` turned on to ``subdiff``.
45+
This means that the optimality conditions (thus the stopping criterion) is computed using the method ``subdiff_distance`` of the penalty.
46+
If not implemented, the user should set ``ws_strategy`` to ``fixpoint``.
47+
48+
For the :math:`\ell_1` penalty, the distance of the negative gradient of the datafit :math:`F` to the subdifferential of the penalty reads
49+
50+
.. math::
51+
\mathrm{dist}(-\nabla_j F(\beta), \partial |\beta_j|) = \begin{cases}
52+
\mathrm{max}(0, \lvert -\nabla_j F(\beta) \rvert - \lambda) \\
53+
\lvert -\nabla_j F(\beta) - \lambda \mathrm{sgn}(\beta_j) \lvert \\
54+
\end{cases}
55+
\enspace .
56+
57+
58+
The method ``is_penalized`` returns a binary mask with the penalized features.
59+
For the :math:`\ell_1` penalty, all the coefficients are penalized.
60+
Finally, ``generalized_support`` returns the generalized support of the penalty for some coefficient vector ``w``.
61+
It is typically the non-zero coefficients of the solution vector for :math:`\ell_1`.
62+
63+
64+
Optionally, a penalty might implement ``alpha_max`` which returns the smallest :math:`\lambda` for which the optimal solution is a null vector.
65+
Note that since ``lambda`` is a reserved keyword in Python, ``alpha`` in skglm codebase corresponds to :math:`\lambda`.
66+
67+
When putting all together, this gives the implementation of the ``L1`` penalty:
68+
69+
70+
.. literalinclude:: ../skglm/penalties/separable.py
71+
:pyobject: L1
72+

doc/conf.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,8 @@
159159
'navbar_links': [
160160
("Examples", "auto_examples/index"),
161161
("API", "api"),
162-
("Add custom penalty and datafit", "add"),
162+
("Add custom datafit", "add_datafit"),
163+
("Add custom penalty", "add_penalty"),
163164
("GitHub", "https://github.com/scikit-learn-contrib/skglm", True)
164165
],
165166
'bootswatch_theme': "united"

0 commit comments

Comments
 (0)