@@ -233,24 +233,23 @@ Cross-Validation.
233233Lasso
234234=====
235235
236- The :class: `Lasso ` is a linear model that estimates sparse coefficients.
236+ The :class: `Lasso ` is a linear model that estimates sparse coefficients, i.e., it is
237+ able to set coefficients exactly to zero.
237238It is useful in some contexts due to its tendency to prefer solutions
238239with fewer non-zero coefficients, effectively reducing the number of
239240features upon which the given solution is dependent. For this reason,
240241Lasso and its variants are fundamental to the field of compressed sensing.
241- Under certain conditions, it can recover the exact set of non-zero
242- coefficients (see
242+ Under certain conditions, it can recover the exact set of non-zero coefficients (see
243243:ref: `sphx_glr_auto_examples_applications_plot_tomography_l1_reconstruction.py `).
244244
245245Mathematically, it consists of a linear model with an added regularization term.
246246The objective function to minimize is:
247247
248- .. math :: \min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}
248+ .. math :: \min_{w} P(w) = { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}
249249
250- The lasso estimate thus solves the minimization of the
251- least-squares penalty with :math: `\alpha ||w||_1 ` added, where
252- :math: `\alpha ` is a constant and :math: `||w||_1 ` is the :math: `\ell _1 `-norm of
253- the coefficient vector.
250+ The lasso estimate thus solves the least-squares with added penalty
251+ :math: `\alpha ||w||_1 `, where :math: `\alpha ` is a constant and :math: `||w||_1 ` is the
252+ :math: `\ell _1 `-norm of the coefficient vector.
254253
255254The implementation in the class :class: `Lasso ` uses coordinate descent as
256255the algorithm to fit the coefficients. See :ref: `least_angle_regression `
@@ -281,18 +280,86 @@ computes the coefficients along the full path of possible values.
281280
282281.. dropdown :: References
283282
284- The following two references explain the iterations
285- used in the coordinate descent solver of scikit-learn, as well as
286- the duality gap computation used for convergence control.
283+ The following references explain the origin of the Lasso as well as properties
284+ of the Lasso problem and the duality gap computation used for convergence control.
287285
288- * "Regularization Path For Generalized linear Models by Coordinate Descent",
289- Friedman, Hastie & Tibshirani, J Stat Softw, 2010 (` Paper
290- <https://www.jstatsoft.org/article/view/v033i01/v33i01.pdf> `__).
286+ * :doi: ` Robert Tibshirani. (1996) Regression Shrinkage and Selection Via the Lasso.
287+ J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267-288
288+ <10.1111/j.2517-6161.1996.tb02080.x> `
291289 * "An Interior-Point Method for Large-Scale L1-Regularized Least Squares,"
292290 S. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky,
293291 in IEEE Journal of Selected Topics in Signal Processing, 2007
294292 (`Paper <https://web.stanford.edu/~boyd/papers/pdf/l1_ls.pdf >`__)
295293
294+ .. _coordinate_descent :
295+
296+ Coordinate Descent with Gap Safe Screening Rules
297+ ------------------------------------------------
298+
299+ Coordinate descent (CD) is a strategy so solve a minimization problem that considers a
300+ single feature :math: `j` at a time. This way, the optimization problem is reduced to a
301+ 1-dimensional problem which is easier to solve:
302+
303+ .. math :: \min_{w_j} {\frac{1}{2n_{\text{samples}}} ||x_j w_j + X_{-j}w_{-j} - y||_2 ^ 2 + \alpha |w_j|}
304+
305+ with index :math: `-j` meaning all features but :math: `j`. The solution is
306+
307+ .. math :: w_j = \frac{S(x_j^T (y - X_{-j}w_{-j}), \alpha)}{||x_j||_2^2}
308+
309+ with the soft-thresholding function
310+ :math: `S(z, \alpha ) = \operatorname {sign}(z) \max (0 , |z|-\alpha )`.
311+ Note that the soft-thresholding function is exactly zero whenever
312+ :math: `\alpha \geq |z|`.
313+ The CD solver then loops over the features either in a cycle, picking one feature after
314+ the other in the order given by `X ` (`selection="cyclic" `), or by randomly picking
315+ features (`selection="random" `).
316+ It stops if the duality gap is smaller than the provided tolerance `tol `.
317+
318+ .. dropdown :: Mathematical details
319+
320+ The duality gap :math: `G(w, v)` is an upper bound of the difference between the
321+ current primal objective function of the Lasso, :math: `P(w)`, and its minimum
322+ :math: `P(w^\star )`, i.e. :math: `G(w, v) \leq P(w) - P(w^\star )`. It is given by
323+ :math: `G(w, v) = P(w) - D(v)` with dual objective function
324+
325+ .. math :: D(v) = \frac{1}{2n_{\text{samples}}}(y^Tv - ||v||_2^2)
326+
327+ subject to :math: `v \in ||X^Tv||_{\infty } \leq n_{\text {samples}}\alpha `.
328+ With (scaled) dual variable :math: `v = c r`, current residual :math: `r = y - Xw` and
329+ dual scaling
330+
331+ .. math ::
332+ c = \begin {cases}
333+ 1 , & ||X^Tr||_{\infty } \leq n_{\text {samples}}\alpha , \\
334+ \frac {n_{\text {samples}}\alpha }{||X^Tr||_{\infty }}, & \text {otherwise}
335+ \end {cases}
336+
337+ the stopping criterion is
338+
339+ .. math :: \text{tol} \frac{||y||_2^2}{n_{\text{samples}}} < G(w, cr)\,.
340+
341+ A clever method to speedup the coordinate descent algorithm is to screen features such
342+ that at optimum :math: `w_j = 0 `. Gap safe screening rules are such a
343+ tool. Anywhere during the optimization algorithm, they can tell which feature we can
344+ safely exclude, i.e., set to zero with certainty.
345+
346+ .. dropdown :: References
347+
348+ The first reference explains the coordinate descent solver used in scikit-learn, the
349+ others treat gap safe screening rules.
350+
351+ * :doi: `Friedman, Hastie & Tibshirani. (2010).
352+ Regularization Path For Generalized linear Models by Coordinate Descent.
353+ J Stat Softw 33(1), 1-22 <10.18637/jss.v033.i01> `
354+ * :arxiv: `O. Fercoq, A. Gramfort, J. Salmon. (2015).
355+ Mind the duality gap: safer rules for the Lasso.
356+ Proceedings of Machine Learning Research 37:333-342, 2015.
357+ <1505.03410> `
358+ * :arxiv: `E. Ndiaye, O. Fercoq, A. Gramfort, J. Salmon. (2017).
359+ Gap Safe Screening Rules for Sparsity Enforcing Penalties.
360+ Journal of Machine Learning Research 18(128):1-33, 2017.
361+ <1611.05780> `
362+
296363Setting regularization parameter
297364--------------------------------
298365
0 commit comments