|
| 1 | +This note gives insights and guidance for the handling of an intercept coefficient within the $\texttt{skglm}$ solvers. |
| 2 | + |
| 3 | +Let the design matrix be $X\in \mathbb{R}^{n\times p}$ where $n$ is the number of samples and $p$ the number of features. |
| 4 | +We denote $\beta\in\mathbb{R}^p$ the coefficients of the Generalized Linear Model and $\beta_0$ its intercept. |
| 5 | +In many packages such as `liblinear`, the intercept is handled by adding an extra column of ones in the design matrix. This is costly in memory, and may lead to different solutions if all coefficients are penalized, as the intercept $\beta_0$ is usually not. |
| 6 | +`skglm` follows a different route and solves directly: |
| 7 | + |
| 8 | +\begin{align} |
| 9 | + \beta^\star, \beta_0^\star |
| 10 | + \in |
| 11 | + \underset{\beta \in \mathbb{R}^p, \beta_0 \in \mathbb{R}}{\text{argmin}} |
| 12 | + \Phi(\beta) |
| 13 | + \triangleq |
| 14 | + \underbrace{F(X\beta + \beta_0\boldsymbol{1}_{n})}_{\triangleq f(\beta, \beta_0)} |
| 15 | + + \sum_{j=1}^p g_j(\beta_j) |
| 16 | + \enspace , |
| 17 | +\end{align} |
| 18 | + |
| 19 | + |
| 20 | +where $\boldsymbol{1}_{n}$ is the vector of size $n$ composed only of ones. |
| 21 | + |
| 22 | + |
| 23 | +The solvers of `skglm` update the intercept after each update of $\beta$ by doing a (1 dimensional) gradient descent update: |
| 24 | + |
| 25 | +\begin{align} |
| 26 | + \beta^{(k+1)}_0 = \beta^{(k)}_0 - \frac{1}{L_0}\nabla_{\beta_0}F(X\beta^{(k)} + \beta_0^{(k)}\boldsymbol{1}_{n}) |
| 27 | + \enspace , |
| 28 | +\end{align} |
| 29 | + |
| 30 | +where $L_0$ is the Lipschitz constant associated to the intercept. |
| 31 | +The local Lipschitz constant $L_0$ statisfies the following inequality |
| 32 | + |
| 33 | +$$ |
| 34 | +\forall x, x_0\in \mathbb{R}^p \times \mathbb{R}, \forall h \in \mathbb{R}, |\nabla_{x_0} f(x, x_0 + h) - \nabla_{x_0} f(x, x_0)| \leq L_0 |h| \enspace . |
| 35 | +$$ |
| 36 | + |
| 37 | +This update rule should be implemented in the `intercept_update_step` method of the datafit class. |
| 38 | + |
| 39 | +The convergence criterion computed for the gradient is then only the absolute value of the gradient with respect to $\beta_0$ since the intercept optimality condition, for a solution $\beta^\star$, $\beta_0^\star$ is: |
| 40 | + |
| 41 | +\begin{align} |
| 42 | + \nabla_{\beta_0}F(X\beta^\star + \beta_0^\star\boldsymbol{1}_{n}) = 0 |
| 43 | + \enspace , |
| 44 | +\end{align} |
| 45 | + |
| 46 | +Moreover, we have that |
| 47 | + |
| 48 | +\begin{align} |
| 49 | + \nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \boldsymbol{1}_{n}^\top \nabla_\beta F(X\beta + \beta_0\boldsymbol{1}_{n}) |
| 50 | + \enspace . |
| 51 | +\end{align} |
| 52 | + |
| 53 | + |
| 54 | +We will now derive the update used in Equation 2 for three different datafitting functions. |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## The Quadratic datafit |
| 59 | + |
| 60 | +We define |
| 61 | + |
| 62 | +\begin{align} |
| 63 | + F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{2n} \lVert y - X\beta - \beta_0\boldsymbol{1}_{n} \rVert^2_2 |
| 64 | + \enspace . |
| 65 | +\end{align} |
| 66 | + |
| 67 | +In this case $\nabla f(z) = \frac{1}{n}(z - y)$ hence Eq. 4 is equal to: |
| 68 | + |
| 69 | +\begin{align} |
| 70 | + \nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n}\sum_{i=1}^n(X_{i:}\beta + \beta_0 - y_i) |
| 71 | + \enspace . |
| 72 | +\end{align} |
| 73 | + |
| 74 | +Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$. |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## The Logistic datafit |
| 81 | + |
| 82 | +In this case, |
| 83 | + |
| 84 | +\begin{align} |
| 85 | + F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n \log(1 + \exp(-y_i(X_{i:}\beta + \beta_0\boldsymbol{1}_n)) |
| 86 | +\end{align} |
| 87 | + |
| 88 | + |
| 89 | +We can then write |
| 90 | + |
| 91 | +\begin{align} |
| 92 | + \nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n \frac{-y_i}{1 + \exp(- y_i(X_{i:}\beta + \beta_0\boldsymbol{1}_n))} \enspace . |
| 93 | +\end{align} |
| 94 | + |
| 95 | + |
| 96 | +Finally, the Lipschitz constant is $L_0 = \frac{1}{4n}\sum_{i=1}^n 1^2 = \frac{1}{4}$. |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## The Huber datafit |
| 101 | + |
| 102 | +In this case, |
| 103 | + |
| 104 | +\begin{align} |
| 105 | + F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n f_{\delta}(y_i - X_{i:}\beta - \beta_0\boldsymbol{1}_n)) \enspace , |
| 106 | +\end{align} |
| 107 | + |
| 108 | +where |
| 109 | + |
| 110 | +\begin{align} |
| 111 | + f_\delta(x) = \begin{cases} |
| 112 | + \frac{1}{2}x^2 & \text{if } x \leq \delta \\ |
| 113 | + \delta |x| - \frac{1}{2}\delta^2 & \text{if } x > \delta |
| 114 | + \end{cases} \enspace . |
| 115 | +\end{align} |
| 116 | + |
| 117 | + |
| 118 | +Let $r_i = y_i - X_{i:}\beta - \beta_0\boldsymbol{1}_n$. We can then write |
| 119 | + |
| 120 | +\begin{align} |
| 121 | + \nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n r_i\mathbb{1}_{\{|r_i|\leq\delta\}} + \text{sign}(r_i)\delta\mathbb{1}_{\{|r_i|>\delta\}} \enspace , |
| 122 | +\end{align} |
| 123 | + |
| 124 | +where $1_{x > \delta}$ is the classical indicator function. |
| 125 | + |
| 126 | +Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$. |
0 commit comments