Skip to content

Commit eee3223

Browse files
borispfmathurinmBadr-MOUFAD
authored
DOC use math mode in all docstrings (#146)
Co-authored-by: mathurinm <[email protected]> Co-authored-by: Badr-MOUFAD <[email protected]>
1 parent 07045ff commit eee3223

File tree

17 files changed

+306
-238
lines changed

17 files changed

+306
-238
lines changed
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
window.MathJax.startup = {
2+
ready: () => {
3+
AM = MathJax.InputJax.AsciiMath.AM;
4+
AM.newsymbol({ input: "ell", tag: "mi", output: "\u2113", tex: "ell", ttype: AM.TOKEN.CONST });
5+
AM.newsymbol({ input: "||", tag: "mi", output: "\u2225", tex: "Vert", ttype: AM.TOKEN.CONST });
6+
AM.newsymbol({ input: "triangleq", tag: "mo", output: "\u225C", tex: "triangleq", ttype: AM.TOKEN.CONST });
7+
MathJax.startup.defaultReady();
8+
}
9+
};

doc/conf.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,22 @@
179179
],
180180
}
181181

182+
# Enable asciimath parsing in MathJax and configure the HTML renderer to output
183+
# the default asciimath delimiters. Asciimath will not be correctly rendered in
184+
# other output formats, but can likely be fixed using py-asciimath[1] to convert
185+
# to Latex.
186+
# [1]: https://pypi.org/project/py-asciimath/
187+
mathjax3_config = {
188+
"loader": {
189+
"load": ['input/asciimath']
190+
},
191+
}
192+
mathjax_inline = ['`', '`']
193+
mathjax_display = ['`', '`']
194+
195+
html_static_path = ['_static']
196+
html_js_files = ["scripts/asciimath-defines.js"]
197+
182198
# -- Options for copybutton ---------------------------------------------
183199
# complete explanation of the regex expression can be found here
184200
# https://sphinx-copybutton.readthedocs.io/en/latest/use.html#using-regexp-prompt-identifiers

doc/doc-requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ sphinx-bootstrap-theme
88
sphinx_copybutton
99
sphinx-gallery
1010
pytest
11-
furo
11+
furo

doc/index.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,16 @@
33
You can adapt this file completely to your liking, but it should at least
44
contain the root `toctree` directive.
55
6-
========
6+
=========
77
``skglm``
8-
========
8+
=========
99
*— A fast and modular scikit-learn replacement for sparse GLMs —*
1010

1111
--------
1212

1313

14-
``skglm`` is a Python package that offers **fast estimators** for sparse Generalized Linear Models (GLMs)
15-
that are **100% compatible with** ``scikit-learn``. It is **highly flexible** and supports a wide range of GLMs.
14+
``skglm`` is a Python package that offers **fast estimators** for sparse Generalized Linear Models (GLMs)
15+
that are **100% compatible with** ``scikit-learn``. It is **highly flexible** and supports a wide range of GLMs.
1616
You get to choose from ``skglm``'s already-made estimators or **customize your own** by combining the available datafits and penalties.
1717

1818
Get a hands-on glimpse on ``skglm`` through the :ref:`Getting started page <getting_started>`.
@@ -40,7 +40,7 @@ There are several reasons to opt for ``skglm`` among which:
4040

4141

4242
Installing ``skglm``
43-
-------------------
43+
--------------------
4444

4545
``skglm`` is available on PyPi. Get the latest version of the package by running
4646

@@ -59,7 +59,7 @@ Other advanced topics and uses-cases are covered in :ref:`Tutorials <tutorials>`
5959

6060
.. note::
6161

62-
- Currently, ``skglm`` is unavailable on Conda but will be released very soon...
62+
- Currently, ``skglm`` is unavailable on Conda but will be released very soon...
6363

6464

6565
Cite

doc/tutorials/add_datafit.rst

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Motivated by generalized linear models but not limited to it, ``skglm`` solves p
1212
\arg\min_{\beta \in \mathbb{R}^p}
1313
F(X\beta) + \Omega(\beta)
1414
:= \sum_{i=1}^n f_i([X\beta]_i) + \sum_{j=1}^p \Omega_j(\beta_j)
15-
\enspace .
15+
\ .
1616
1717
1818
Here, :math:`X \in \mathbb{R}^{n \times p}` denotes the design matrix with :math:`n` samples and :math:`p` features,
@@ -49,50 +49,50 @@ First, this requires deriving some quantities used by the solvers like the gradi
4949
With :math:`y \in \mathbb{R}^n` the target vector, the Poisson datafit reads
5050

5151
.. math::
52-
f(X\beta) = \frac{1}{n}\sum_{i=1}^n \exp([X\beta]_i) - y_i[X\beta]_i
53-
\enspace .
52+
f(X\beta) = \frac{1}{n}\sum_{i=1}^n \exp([X\beta]_i) - y_i[X\beta]_i
53+
\ .
5454
5555
5656
Let's define some useful quantities to simplify our computations. For :math:`z \in \mathbb{R}^n` and :math:`\beta \in \mathbb{R}^p`,
5757

5858
.. math::
5959
f(z) = \sum_{i=1}^n f_i(z_i) \qquad F(\beta) = f(X\beta)
60-
\enspace .
60+
\ .
6161
6262
6363
Computing the gradient of :math:`F` and its Hessian matrix yields
6464

6565
.. math::
66-
\nabla F(\beta) = X^{\top} \underbrace{\nabla f(X\beta)}_\textrm{raw grad} \qquad \nabla^2 F(\beta) = X^{\top} \underbrace{\nabla^2 f(X\beta)}_\textrm{raw hessian} X
67-
\enspace .
66+
\nabla F(\beta) = X^{\top} \underbrace{\nabla f(X\beta)}_"raw grad" \qquad \nabla^2 F(\beta) = X^{\top} \underbrace{\nabla^2 f(X\beta)}_"raw hessian" X
67+
\ .
6868
6969
7070
Besides, it directly follows that
7171

7272
.. math::
73-
\nabla f(z) = (f_i'(z_i))_{1 \leq i \leq n} \qquad \nabla^2 f(z) = \textrm{diag}(f_i''(z_i))_{1 \leq i \leq n}
74-
\enspace .
73+
\nabla f(z) = (f_i^'(z_i))_{1 \leq i \leq n} \qquad \nabla^2 f(z) = "diag"(f_i^('')(z_i))_{1 \leq i \leq n}
74+
\ .
7575
7676
7777
We can now apply these definitions to the Poisson datafit:
7878

7979
.. math::
8080
f_i(z_i) = \frac{1}{n} \left(\exp(z_i) - y_iz_i\right)
81-
\enspace .
81+
\ .
8282
8383
8484
Therefore,
8585

8686
.. math::
87-
f_i'(z_i) = \frac{1}{n}(\exp(z_i) - y_i) \qquad f_i''(z_i) = \frac{1}{n}\exp(z_i)
88-
\enspace .
87+
f_i^('')(z_i) = \frac{1}{n}(\exp(z_i) - y_i) \qquad f^''_i(z_i) = \frac{1}{n}\exp(z_i)
88+
\ .
8989
9090
9191
Computing ``raw_grad`` and ``raw_hessian`` for the Poisson datafit yields
9292

9393
.. math::
94-
\nabla f(X\beta) = \frac{1}{n}(\exp([X\beta]_i) - y_i)_{1 \leq i \leq n} \qquad \nabla^2 f(X\beta) = \frac{1}{n}\textrm{diag}(\exp([X\beta]_i))_{1 \leq i \leq n}
95-
\enspace .
94+
\nabla f(X\beta) = \frac{1}{n}(\exp([X\beta]_i) - y_i)_{1 \leq i \leq n} \qquad \nabla^2 f(X\beta) = \frac{1}{n}"diag"(\exp([X\beta]_i))_{1 \leq i \leq n}
95+
\ .
9696
9797
9898
Both ``raw_grad`` and ``raw_hessian`` are methods used by the ``ProxNewton`` solver.
@@ -106,7 +106,7 @@ For the Poisson datafit, this yields
106106
\sum_{i=1}^n X_{i,j} \left(
107107
\exp([X\beta]_i) - y
108108
\right)
109-
\enspace .
109+
\ .
110110
111111
112112
When implementing these quantities in the ``Poisson`` datafit class, this gives:

doc/tutorials/add_penalty.rst

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,20 +22,20 @@ We detail how the :math:`\ell_1` penalty is implemented in skglm.
2222
For a vector :math:`\beta \in \mathbb{R}^p`, the :math:`\ell_1` penalty is defined as follows:
2323

2424
.. math::
25-
\lvert\lvert \beta \rvert\rvert_1 = \sum_{i=1}^p |\beta _i| \enspace .
25+
|| \beta ||_1 = \sum_{i=1}^p |\beta _i| \ .
2626
2727
28-
The regularization level is controlled by the hyperparameter :math:`\lambda \in \mathbb{R}^+`, that is defined and initialized in the constructor of the class.
28+
The regularization level is controlled by the hyperparameter :math:`\lambda \in bb(R)^+`, that is defined and initialized in the constructor of the class.
2929

3030
The method ``get_spec`` allows to strongly type the attributes of the penalty object, thus allowing Numba to JIT-compile the class.
31-
It should return an iterable of tuples, the first element being the name of the attribute, the second its Numba type (e.g. `float64`, `bool_`).
31+
It should return an iterable of tuples, the first element being the name of the attribute, the second its Numba type (e.g. ``float64``, ``bool_``).
3232
Additionally, a penalty should implement ``params_to_dict``, a helper method to get all the parameters of a penalty returned in a dictionary.
3333

3434
To optimize an objective with a given penalty, skglm needs at least the proximal operator of the penalty applied to the :math:`j`-th coordinate.
3535
For the ``L1`` penalty, it is the well-known soft-thresholding operator:
3636

3737
.. math::
38-
\textrm{ST}(\beta , \lambda) = \mathrm{max}(0, \lvert \beta \rvert - \lambda) \mathrm{sgn}(\beta) \enspace .
38+
"ST"(\beta , \lambda) = "max"(0, |\beta| - \lambda) "sgn"(\beta)\ .
3939
4040
4141
Note that skglm expects the threshold level to be the regularization hyperparameter :math:`\lambda \in \mathbb{R}^+` **scaled by** the stepsize.
@@ -48,11 +48,10 @@ If not implemented, the user should set ``ws_strategy`` to ``fixpoint``.
4848
For the :math:`\ell_1` penalty, the distance of the negative gradient of the datafit :math:`F` to the subdifferential of the penalty reads
4949

5050
.. math::
51-
\mathrm{dist}(-\nabla_j F(\beta), \partial |\beta_j|) = \begin{cases}
52-
\mathrm{max}(0, \lvert -\nabla_j F(\beta) \rvert - \lambda) \\
53-
\lvert -\nabla_j F(\beta) - \lambda \mathrm{sgn}(\beta_j) \lvert \\
54-
\end{cases}
55-
\enspace .
51+
"dist"(-\nabla_j F(\beta), \partial |\beta_j|) =
52+
{("max"(0, | -\nabla_j F(\beta) | - \lambda),),
53+
(| -\nabla_j F(\beta) - \lambda "sgn"(\beta_j) |,):}
54+
\ .
5655
5756
5857
The method ``is_penalized`` returns a binary mask with the penalized features.

doc/tutorials/intercept.md

Lines changed: 62 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,54 @@
1-
This note gives insights and guidance for the handling of an intercept coefficient within the $\texttt{skglm}$ solvers.
1+
This note gives insights and guidance for the handling of an intercept coefficient within the `skglm` solvers.
22

3-
Let the design matrix be $X\in \mathbb{R}^{n\times p}$ where $n$ is the number of samples and $p$ the number of features.
4-
We denote $\beta\in\mathbb{R}^p$ the coefficients of the Generalized Linear Model and $\beta_0$ its intercept.
5-
In many packages such as `liblinear`, the intercept is handled by adding an extra column of ones in the design matrix. This is costly in memory, and may lead to different solutions if all coefficients are penalized, as the intercept $\beta_0$ is usually not.
3+
Let the design matrix be $X in RR^{n times p}$ where $n$ is the number of samples and $p$ the number of features.
4+
We denote $beta in RR^p$ the coefficients of the Generalized Linear Model and $beta_0$ its intercept.
5+
In many packages such as `liblinear`, the intercept is handled by adding an extra column of ones in the design matrix. This is costly in memory, and may lead to different solutions if all coefficients are penalized, as the intercept $beta_0$ is usually not.
66
`skglm` follows a different route and solves directly:
77

8-
\begin{align}
9-
\beta^\star, \beta_0^\star
10-
\in
11-
\underset{\beta \in \mathbb{R}^p, \beta_0 \in \mathbb{R}}{\text{argmin}}
12-
\Phi(\beta)
13-
\triangleq
14-
\underbrace{F(X\beta + \beta_0\boldsymbol{1}_{n})}_{\triangleq f(\beta, \beta_0)}
15-
+ \sum_{j=1}^p g_j(\beta_j)
16-
\enspace ,
17-
\end{align}
8+
```{math}
9+
beta^star, beta_0^star
10+
in
11+
underset(beta in RR^p, beta_0 in RR)("argmin")
12+
Phi(beta)
13+
triangleq
14+
underbrace(F(X beta + beta_0 bb"1"_n))_(triangleq f(beta, beta_0))
15+
+ sum_(j=1)^p g_j(beta_j)
16+
\ ,
17+
```
1818

1919

20-
where $\boldsymbol{1}_{n}$ is the vector of size $n$ composed only of ones.
20+
where $bb"1"_{n}$ is the vector of size $n$ composed only of ones.
2121

2222

23-
The solvers of `skglm` update the intercept after each update of $\beta$ by doing a (1 dimensional) gradient descent update:
23+
The solvers of `skglm` update the intercept after each update of $beta$ by doing a (1 dimensional) gradient descent update:
2424

25-
\begin{align}
26-
\beta^{(k+1)}_0 = \beta^{(k)}_0 - \frac{1}{L_0}\nabla_{\beta_0}F(X\beta^{(k)} + \beta_0^{(k)}\boldsymbol{1}_{n})
27-
\enspace ,
28-
\end{align}
25+
```{math}
26+
beta_0^((k+1)) = beta_0^((k)) - 1/(L_0) nabla_(beta_0)F(X beta^((k)) + beta_0^((k)) bb"1"_{n})
27+
\ ,
28+
```
2929

3030
where $L_0$ is the Lipschitz constant associated to the intercept.
3131
The local Lipschitz constant $L_0$ statisfies the following inequality
3232

3333
$$
34-
\forall x, x_0\in \mathbb{R}^p \times \mathbb{R}, \forall h \in \mathbb{R}, |\nabla_{x_0} f(x, x_0 + h) - \nabla_{x_0} f(x, x_0)| \leq L_0 |h| \enspace .
34+
\forall x, x_0 in RR^p times RR, \forall h in RR, |nabla_(x_0) f(x, x_0 + h) - nabla_(x_0) f(x, x_0)| <= L_0 |h| \ .
3535
$$
3636

3737
This update rule should be implemented in the `intercept_update_step` method of the datafit class.
3838

39-
The convergence criterion computed for the gradient is then only the absolute value of the gradient with respect to $\beta_0$ since the intercept optimality condition, for a solution $\beta^\star$, $\beta_0^\star$ is:
39+
The convergence criterion computed for the gradient is then only the absolute value of the gradient with respect to $beta_0$ since the intercept optimality condition, for a solution $beta^star$, $beta_0^star$ is:
4040

41-
\begin{align}
42-
\nabla_{\beta_0}F(X\beta^\star + \beta_0^\star\boldsymbol{1}_{n}) = 0
43-
\enspace ,
44-
\end{align}
41+
```{math}
42+
nabla_(beta_0)F(X beta^star + beta_0^star bb"1"_n) = 0
43+
\ ,
44+
```
4545

4646
Moreover, we have that
4747

48-
\begin{align}
49-
\nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \boldsymbol{1}_{n}^\top \nabla_\beta F(X\beta + \beta_0\boldsymbol{1}_{n})
50-
\enspace .
51-
\end{align}
48+
```{math}
49+
nabla_(beta_0) F(X beta + beta_0 bb"1"_n) = bb"1"_n^\top nabla_beta F(X beta + beta_0 bb"1"_n)
50+
\ .
51+
```
5252

5353

5454
We will now derive the update used in Equation 2 for three different datafitting functions.
@@ -59,19 +59,19 @@ We will now derive the update used in Equation 2 for three different datafitting
5959

6060
We define
6161

62-
\begin{align}
63-
F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{2n} \lVert y - X\beta - \beta_0\boldsymbol{1}_{n} \rVert^2_2
64-
\enspace .
65-
\end{align}
62+
```{math}
63+
F(X beta + beta_0 bb"1"_n) = 1/(2n) norm(y - X beta - beta_0 bb"1"_{n})_2^2
64+
\ .
65+
```
6666

67-
In this case $\nabla f(z) = \frac{1}{n}(z - y)$ hence Eq. 4 is equal to:
67+
In this case $nabla f(z) = 1/n (z - y)$ hence Eq. 4 is equal to:
6868

69-
\begin{align}
70-
\nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n}\sum_{i=1}^n(X_{i:}\beta + \beta_0 - y_i)
71-
\enspace .
72-
\end{align}
69+
```{math}
70+
nabla_(beta_0) F(X beta + beta_0 bb"1"_n) = 1/n sum_(i=1)^n (X_( i: ) beta + beta_0 - y_i)
71+
\ .
72+
```
7373

74-
Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$.
74+
Finally, the Lipschitz constant is $L_0 = 1/n sum_(i=1)^n 1^2 = 1$.
7575

7676

7777

@@ -81,46 +81,46 @@ Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$.
8181

8282
In this case,
8383

84-
\begin{align}
85-
F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n \log(1 + \exp(-y_i(X_{i:}\beta + \beta_0\boldsymbol{1}_n))
86-
\end{align}
84+
```{math}
85+
F(X beta + beta_0 bb"1"_{n}) = 1/n sum_(i=1)^n log(1 + exp(-y_i(X_( i: ) beta + beta_0 bb"1"_n))
86+
```
8787

8888

8989
We can then write
9090

91-
\begin{align}
92-
\nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n \frac{-y_i}{1 + \exp(- y_i(X_{i:}\beta + \beta_0\boldsymbol{1}_n))} \enspace .
93-
\end{align}
91+
```{math}
92+
nabla_(beta_0) F(X beta + beta_0 bb"1"_n) = 1/n sum_(i=1)^n (-y_i)/(1 + exp(-y_i(X_( i: ) beta + beta_0 bb"1"_n))) \ .
93+
```
9494

9595

96-
Finally, the Lipschitz constant is $L_0 = \frac{1}{4n}\sum_{i=1}^n 1^2 = \frac{1}{4}$.
96+
Finally, the Lipschitz constant is $L_0 = 1/(4n) sum_(i=1)^n 1^2 = 1/4$.
9797

9898
---
9999

100100
## The Huber datafit
101101

102102
In this case,
103103

104-
\begin{align}
105-
F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n f_{\delta}(y_i - X_{i:}\beta - \beta_0\boldsymbol{1}_n)) \enspace ,
106-
\end{align}
104+
```{math}
105+
F(X beta + beta_0 bb"1"_{n}) = 1/n sum_(i=1)^n f_(delta) (y_i - X_( i: ) beta - beta_0 bb"1"_n) \ ,
106+
```
107107

108108
where
109109

110-
\begin{align}
111-
f_\delta(x) = \begin{cases}
112-
\frac{1}{2}x^2 & \text{if } x \leq \delta \\
113-
\delta |x| - \frac{1}{2}\delta^2 & \text{if } x > \delta
114-
\end{cases} \enspace .
115-
\end{align}
110+
```{math}
111+
f_delta(x) = {
112+
(1/2 x^2, if x <= delta),
113+
(delta |x| - 1/2 delta^2, if x > delta)
114+
:} \ .
115+
```
116116

117117

118-
Let $r_i = y_i - X_{i:}\beta - \beta_0\boldsymbol{1}_n$. We can then write
118+
Let $r_i = y_i - X_( i: ) beta - beta_0 bb"1"_n$. We can then write
119119

120-
\begin{align}
121-
\nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n r_i\mathbb{1}_{\{|r_i|\leq\delta\}} + \text{sign}(r_i)\delta\mathbb{1}_{\{|r_i|>\delta\}} \enspace ,
122-
\end{align}
120+
```{math}
121+
nabla_(beta_0) F(X beta + beta_0 bb"1"_{n}) = 1/n sum_(i=1)^n r_i bbb"1"_({|r_i| <= delta}) + "sign"(r_i) delta bbb"1"_({|r_i| > delta}) \ ,
122+
```
123123

124-
where $1_{x > \delta}$ is the classical indicator function.
124+
where $bbb"1"_({x > delta})$ is the classical indicator function.
125125

126-
Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$.
126+
Finally, the Lipschitz constant is $L_0 = 1/n sum_(i=1)^n 1^2 = 1$.

0 commit comments

Comments
 (0)