Skip to content

Commit 92c2a45

Browse files
Klopfemathurinm
andauthored
[WIP] Doc - Intercept computation (#84)
Co-authored-by: mathurinm <[email protected]>
1 parent 1ee69d4 commit 92c2a45

File tree

5 files changed

+144
-1
lines changed

5 files changed

+144
-1
lines changed

doc/conf.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,9 +58,15 @@
5858
'sphinx.ext.autosectionlabel',
5959
'numpydoc',
6060
'sphinx.ext.linkcode',
61-
'gh_substitutions', # custom ext, see ./sphinxext/gh_substitutions.py
61+
'gh_substitutions',
62+
'myst_parser',
63+
# custom ext, see ./sphinxext/gh_substitutions.py
6264
]
6365

66+
myst_enable_extensions = [
67+
"dollarmath",
68+
"amsmath"
69+
]
6470
# generate autosummary even if no references
6571
autosummary_generate = True
6672

doc/doc-requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
benchopt
22
libsvmdata>=0.2
33
matplotlib>=2.0.0
4+
myst_parser
45
numpydoc
56
pillow
67
sphinx-bootstrap-theme

doc/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,5 @@ Release history
5656
:maxdepth: 1
5757

5858
whats_new.rst
59+
intercept.rst
60+

doc/intercept.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
This note gives insights and guidance for the handling of an intercept coefficient within the $\texttt{skglm}$ solvers.
2+
3+
Let the design matrix be $X\in \mathbb{R}^{n\times p}$ where $n$ is the number of samples and $p$ the number of features.
4+
We denote $\beta\in\mathbb{R}^p$ the coefficients of the Generalized Linear Model and $\beta_0$ its intercept.
5+
In many packages such as `liblinear`, the intercept is handled by adding an extra column of ones in the design matrix. This is costly in memory, and may lead to different solutions if all coefficients are penalized, as the intercept $\beta_0$ is usually not.
6+
`skglm` follows a different route and solves directly:
7+
8+
\begin{align}
9+
\beta^\star, \beta_0^\star
10+
\in
11+
\underset{\beta \in \mathbb{R}^p, \beta_0 \in \mathbb{R}}{\text{argmin}}
12+
\Phi(\beta)
13+
\triangleq
14+
\underbrace{F(X\beta + \beta_0\boldsymbol{1}_{n})}_{\triangleq f(\beta, \beta_0)}
15+
+ \sum_{j=1}^p g_j(\beta_j)
16+
\enspace ,
17+
\end{align}
18+
19+
20+
where $\boldsymbol{1}_{n}$ is the vector of size $n$ composed only of ones.
21+
22+
23+
The solvers of `skglm` update the intercept after each update of $\beta$ by doing a (1 dimensional) gradient descent update:
24+
25+
\begin{align}
26+
\beta^{(k+1)}_0 = \beta^{(k)}_0 - \frac{1}{L_0}\nabla_{\beta_0}F(X\beta^{(k)} + \beta_0^{(k)}\boldsymbol{1}_{n})
27+
\enspace ,
28+
\end{align}
29+
30+
where $L_0$ is the Lipschitz constant associated to the intercept.
31+
The local Lipschitz constant $L_0$ statisfies the following inequality
32+
33+
$$
34+
\forall x, x_0\in \mathbb{R}^p \times \mathbb{R}, \forall h \in \mathbb{R}, |\nabla_{x_0} f(x, x_0 + h) - \nabla_{x_0} f(x, x_0)| \leq L_0 |h| \enspace .
35+
$$
36+
37+
This update rule should be implemented in the `intercept_update_step` method of the datafit class.
38+
39+
The convergence criterion computed for the gradient is then only the absolute value of the gradient with respect to $\beta_0$ since the intercept optimality condition, for a solution $\beta^\star$, $\beta_0^\star$ is:
40+
41+
\begin{align}
42+
\nabla_{\beta_0}F(X\beta^\star + \beta_0^\star\boldsymbol{1}_{n}) = 0
43+
\enspace ,
44+
\end{align}
45+
46+
Moreover, we have that
47+
48+
\begin{align}
49+
\nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \boldsymbol{1}_{n}^\top \nabla_\beta F(X\beta + \beta_0\boldsymbol{1}_{n})
50+
\enspace .
51+
\end{align}
52+
53+
54+
We will now derive the update used in Equation 2 for three different datafitting functions.
55+
56+
---
57+
58+
## The Quadratic datafit
59+
60+
We define
61+
62+
\begin{align}
63+
F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{2n} \lVert y - X\beta - \beta_0\boldsymbol{1}_{n} \rVert^2_2
64+
\enspace .
65+
\end{align}
66+
67+
In this case $\nabla f(z) = \frac{1}{n}(z - y)$ hence Eq. 4 is equal to:
68+
69+
\begin{align}
70+
\nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n}\sum_{i=1}^n(X_{i:}\beta + \beta_0 - y_i)
71+
\enspace .
72+
\end{align}
73+
74+
Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$.
75+
76+
77+
78+
---
79+
80+
## The Logistic datafit
81+
82+
In this case,
83+
84+
\begin{align}
85+
F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n \log(1 + \exp(-y_i(X_{i:}\beta + \beta_0\boldsymbol{1}_n))
86+
\end{align}
87+
88+
89+
We can then write
90+
91+
\begin{align}
92+
\nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n \frac{-y_i}{1 + \exp(- y_i(X_{i:}\beta + \beta_0\boldsymbol{1}_n))} \enspace .
93+
\end{align}
94+
95+
96+
Finally, the Lipschitz constant is $L_0 = \frac{1}{4n}\sum_{i=1}^n 1^2 = \frac{1}{4}$.
97+
98+
---
99+
100+
## The Huber datafit
101+
102+
In this case,
103+
104+
\begin{align}
105+
F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n f_{\delta}(y_i - X_{i:}\beta - \beta_0\boldsymbol{1}_n)) \enspace ,
106+
\end{align}
107+
108+
where
109+
110+
\begin{align}
111+
f_\delta(x) = \begin{cases}
112+
\frac{1}{2}x^2 & \text{if } x \leq \delta \\
113+
\delta |x| - \frac{1}{2}\delta^2 & \text{if } x > \delta
114+
\end{cases} \enspace .
115+
\end{align}
116+
117+
118+
Let $r_i = y_i - X_{i:}\beta - \beta_0\boldsymbol{1}_n$. We can then write
119+
120+
\begin{align}
121+
\nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n r_i\mathbb{1}_{\{|r_i|\leq\delta\}} + \text{sign}(r_i)\delta\mathbb{1}_{\{|r_i|>\delta\}} \enspace ,
122+
\end{align}
123+
124+
where $1_{x > \delta}$ is the classical indicator function.
125+
126+
Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$.

doc/intercept.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
.. _intercept:
2+
3+
Computation of the intercept
4+
============================
5+
.. currentmodule:: skglm
6+
7+
.. include:: intercept.md
8+
:parser: myst_parser.sphinx_

0 commit comments

Comments
 (0)