scikit-learn-contrib
diff --git a/‎doc/_static/scripts/asciimath-defines.js
Lines changed: 9 additions & 0 deletions b/‎doc/_static/scripts/asciimath-defines.js
Lines changed: 9 additions & 0 deletions
diff --git a/‎doc/conf.py
Lines changed: 16 additions & 0 deletions b/‎doc/conf.py
Lines changed: 16 additions & 0 deletions
diff --git a/‎doc/doc-requirements.txt
Lines changed: 1 addition & 1 deletion b/‎doc/doc-requirements.txt
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/index.rst
Lines changed: 6 additions & 6 deletions b/‎doc/index.rst
Lines changed: 6 additions & 6 deletions
diff --git a/‎doc/tutorials/add_datafit.rst
Lines changed: 14 additions & 14 deletions b/‎doc/tutorials/add_datafit.rst
Lines changed: 14 additions & 14 deletions
diff --git a/‎doc/tutorials/add_penalty.rst
Lines changed: 8 additions & 9 deletions b/‎doc/tutorials/add_penalty.rst
Lines changed: 8 additions & 9 deletions
diff --git a/‎doc/tutorials/intercept.md
Lines changed: 62 additions & 62 deletions b/‎doc/tutorials/intercept.md
Lines changed: 62 additions & 62 deletions
@@ -0,0 +1,9 @@
+window.MathJax.startup = {
+    ready: () => {
+        AM = MathJax.InputJax.AsciiMath.AM;
+        AM.newsymbol({ input: "ell", tag: "mi", output: "\u2113", tex: "ell", ttype: AM.TOKEN.CONST });
+        AM.newsymbol({ input: "||", tag: "mi", output: "\u2225", tex: "Vert", ttype: AM.TOKEN.CONST });
+        AM.newsymbol({ input: "triangleq", tag: "mo", output: "\u225C", tex: "triangleq", ttype: AM.TOKEN.CONST });
+        MathJax.startup.defaultReady();
+    }
+};
@@ -179,6 +179,22 @@
     ],
 }
 
+# Enable asciimath parsing in MathJax and configure the HTML renderer to output
+# the default asciimath delimiters. Asciimath will not be correctly rendered in
+# other output formats, but can likely be fixed using py-asciimath[1] to convert
+# to Latex.
+# [1]: https://pypi.org/project/py-asciimath/
+mathjax3_config = {
+  "loader": {
+    "load": ['input/asciimath']
+  },
+}
+mathjax_inline = ['`', '`']
+mathjax_display = ['`', '`']
+
+html_static_path = ['_static']
+html_js_files = ["scripts/asciimath-defines.js"]
+
 # -- Options for copybutton ---------------------------------------------
 # complete explanation of the regex expression can be found here
 # https://sphinx-copybutton.readthedocs.io/en/latest/use.html#using-regexp-prompt-identifiers
 
@@ -8,4 +8,4 @@ sphinx-bootstrap-theme
 sphinx_copybutton
 sphinx-gallery
 pytest
-furo
+furo
@@ -3,16 +3,16 @@
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-========
+=========
 ``skglm``
-========
+=========
 *— A fast and modular scikit-learn replacement for sparse GLMs —*
 
 --------
 
 
-``skglm`` is a Python package that offers **fast estimators** for sparse Generalized Linear Models (GLMs) 
-that are **100% compatible with** ``scikit-learn``. It is **highly flexible** and supports a wide range of GLMs. 
+``skglm`` is a Python package that offers **fast estimators** for sparse Generalized Linear Models (GLMs)
+that are **100% compatible with** ``scikit-learn``. It is **highly flexible** and supports a wide range of GLMs.
 You get to choose from ``skglm``'s already-made estimators or **customize your own** by combining the available datafits and penalties.
 
 Get a hands-on glimpse on ``skglm`` through the :ref:`Getting started page <getting_started>`.
@@ -40,7 +40,7 @@ There are several reasons to opt for ``skglm`` among which:
 
 
 Installing ``skglm``
--------------------
+--------------------
 
 ``skglm`` is available on PyPi. Get the latest version of the package by running
 
@@ -59,7 +59,7 @@ Other advanced topics and uses-cases are covered in :ref:`Tutorials <tutorials>`
 
 .. note::
 
-  - Currently, ``skglm`` is unavailable on Conda but will be released very soon... 
+  - Currently, ``skglm`` is unavailable on Conda but will be released very soon...
 
 
 Cite
 
@@ -12,7 +12,7 @@ Motivated by generalized linear models but not limited to it, ``skglm`` solves p
       \arg\min_{\beta \in \mathbb{R}^p}
       F(X\beta) + \Omega(\beta)
       := \sum_{i=1}^n f_i([X\beta]_i) + \sum_{j=1}^p \Omega_j(\beta_j)
-      \enspace .
+      \ .
 
 
 Here, :math:`X \in \mathbb{R}^{n \times p}` denotes the design matrix with :math:`n` samples and :math:`p` features,
@@ -49,50 +49,50 @@ First, this requires deriving some quantities used by the solvers like the gradi
 With :math:`y \in \mathbb{R}^n` the target vector, the Poisson datafit reads
 
 .. math::
-    f(X\beta) = \frac{1}{n}\sum_{i=1}^n \exp([X\beta]_i) - y_i[X\beta]_i 
-    \enspace .
+    f(X\beta) = \frac{1}{n}\sum_{i=1}^n \exp([X\beta]_i) - y_i[X\beta]_i
+    \ .
 
 
 Let's define some useful quantities to simplify our computations. For :math:`z \in \mathbb{R}^n` and :math:`\beta \in \mathbb{R}^p`,
 
 .. math::
    f(z) = \sum_{i=1}^n f_i(z_i)  \qquad  F(\beta) = f(X\beta)
-   \enspace .
+   \ .
 
 
 Computing the gradient of :math:`F` and its Hessian matrix yields
 
 .. math::
-   \nabla F(\beta) = X^{\top} \underbrace{\nabla f(X\beta)}_\textrm{raw grad} \qquad \nabla^2 F(\beta) = X^{\top} \underbrace{\nabla^2 f(X\beta)}_\textrm{raw hessian} X
-   \enspace .
+   \nabla F(\beta) = X^{\top} \underbrace{\nabla f(X\beta)}_"raw grad" \qquad \nabla^2 F(\beta) = X^{\top} \underbrace{\nabla^2 f(X\beta)}_"raw hessian" X
+   \ .
 
 
 Besides, it directly follows that
 
 .. math::
-   \nabla f(z) = (f_i'(z_i))_{1 \leq i \leq n} \qquad \nabla^2 f(z) = \textrm{diag}(f_i''(z_i))_{1 \leq i \leq n}
-   \enspace .
+   \nabla f(z) = (f_i^'(z_i))_{1 \leq i \leq n} \qquad \nabla^2 f(z) = "diag"(f_i^('')(z_i))_{1 \leq i \leq n}
+   \ .
 
 
 We can now apply these definitions to the Poisson datafit:
 
 .. math::
     f_i(z_i) = \frac{1}{n} \left(\exp(z_i) - y_iz_i\right)
-    \enspace .
+    \ .
 
 
 Therefore,
 
 .. math::
-   f_i'(z_i) = \frac{1}{n}(\exp(z_i) - y_i) \qquad f_i''(z_i) = \frac{1}{n}\exp(z_i)
-   \enspace .
+   f_i^('')(z_i) = \frac{1}{n}(\exp(z_i) - y_i) \qquad f^''_i(z_i) = \frac{1}{n}\exp(z_i)
+   \ .
 
 
 Computing ``raw_grad`` and ``raw_hessian`` for the Poisson datafit yields
 
 .. math::
-   \nabla f(X\beta) = \frac{1}{n}(\exp([X\beta]_i) - y_i)_{1 \leq i \leq n} \qquad \nabla^2 f(X\beta) = \frac{1}{n}\textrm{diag}(\exp([X\beta]_i))_{1 \leq i \leq n}
-   \enspace .
+   \nabla f(X\beta) = \frac{1}{n}(\exp([X\beta]_i) - y_i)_{1 \leq i \leq n} \qquad \nabla^2 f(X\beta) = \frac{1}{n}"diag"(\exp([X\beta]_i))_{1 \leq i \leq n}
+   \ .
 
 
 Both ``raw_grad`` and ``raw_hessian`` are methods used by the ``ProxNewton`` solver.
@@ -106,7 +106,7 @@ For the Poisson datafit, this yields
       \sum_{i=1}^n X_{i,j} \left(
          \exp([X\beta]_i) - y 
       \right)
-      \enspace .
+      \ .
 
 
 When implementing these quantities in the ``Poisson`` datafit class, this gives:
 
@@ -22,20 +22,20 @@ We detail how the :math:`\ell_1` penalty is implemented in skglm.
 For a vector :math:`\beta \in \mathbb{R}^p`, the :math:`\ell_1` penalty is defined as follows:
 
 .. math::
-   \lvert\lvert \beta \rvert\rvert_1 = \sum_{i=1}^p |\beta _i| \enspace .
+   || \beta ||_1 = \sum_{i=1}^p |\beta _i| \ .
 
 
-The regularization level is controlled by the hyperparameter :math:`\lambda \in \mathbb{R}^+`, that is defined and initialized in the constructor of the class.
+The regularization level is controlled by the hyperparameter :math:`\lambda \in bb(R)^+`, that is defined and initialized in the constructor of the class.
 
 The method ``get_spec`` allows to strongly type the attributes of the penalty object, thus allowing Numba to JIT-compile the class.
-It should return an iterable of tuples, the first element being the name of the attribute, the second its Numba type (e.g. `float64`, `bool_`).
+It should return an iterable of tuples, the first element being the name of the attribute, the second its Numba type (e.g. ``float64``, ``bool_``).
 Additionally, a penalty should implement ``params_to_dict``, a helper method to get all the parameters of a penalty returned in a dictionary.
 
 To optimize an objective with a given penalty, skglm needs at least the proximal operator of the penalty applied to the :math:`j`-th coordinate.
 For the ``L1`` penalty, it is the well-known soft-thresholding operator:
 
 .. math::
-    \textrm{ST}(\beta , \lambda) = \mathrm{max}(0, \lvert \beta \rvert - \lambda) \mathrm{sgn}(\beta) \enspace .
+    "ST"(\beta , \lambda) = "max"(0, |\beta| - \lambda) "sgn"(\beta)\ .
 
 
 Note that skglm expects the threshold level to be the regularization hyperparameter :math:`\lambda \in \mathbb{R}^+` **scaled by** the stepsize.
@@ -48,11 +48,10 @@ If not implemented, the user should set ``ws_strategy`` to ``fixpoint``.
 For the :math:`\ell_1` penalty, the distance of the negative gradient of the datafit :math:`F` to the subdifferential of the penalty reads
 
 .. math::
-   \mathrm{dist}(-\nabla_j F(\beta), \partial |\beta_j|) = \begin{cases}
-        \mathrm{max}(0, \lvert -\nabla_j F(\beta) \rvert - \lambda) \\
-        \lvert -\nabla_j F(\beta) - \lambda \mathrm{sgn}(\beta_j) \lvert \\
-    \end{cases}
-   \enspace .
+   "dist"(-\nabla_j F(\beta), \partial |\beta_j|) =
+        {("max"(0, | -\nabla_j F(\beta) | - \lambda),),
+         (| -\nabla_j F(\beta) - \lambda "sgn"(\beta_j) |,):}
+   \ .
 
 
 The method ``is_penalized`` returns a binary mask with the penalized features.
 
@@ -1,54 +1,54 @@
-This note gives insights and guidance for the handling of an intercept coefficient within the $\texttt{skglm}$ solvers.
+This note gives insights and guidance for the handling of an intercept coefficient within the `skglm` solvers.
 
-Let the design matrix be $X\in \mathbb{R}^{n\times p}$ where $n$ is the number of samples and $p$ the number of features.
-We denote $\beta\in\mathbb{R}^p$ the coefficients of the Generalized Linear Model and $\beta_0$ its intercept.
-In many packages such as `liblinear`, the intercept is handled by adding an extra column of ones in the design matrix. This is costly in memory, and may lead to different solutions if all coefficients are penalized, as the intercept $\beta_0$ is usually not.
+Let the design matrix be $X in RR^{n times p}$ where $n$ is the number of samples and $p$ the number of features.
+We denote $beta in RR^p$ the coefficients of the Generalized Linear Model and $beta_0$ its intercept.
+In many packages such as `liblinear`, the intercept is handled by adding an extra column of ones in the design matrix. This is costly in memory, and may lead to different solutions if all coefficients are penalized, as the intercept $beta_0$ is usually not.
 `skglm` follows a different route and solves directly:
 
-\begin{align}
-    \beta^\star, \beta_0^\star
-    \in
-    \underset{\beta \in \mathbb{R}^p, \beta_0 \in \mathbb{R}}{\text{argmin}}
-    \Phi(\beta)
-    \triangleq
-    \underbrace{F(X\beta + \beta_0\boldsymbol{1}_{n})}_{\triangleq f(\beta, \beta_0)}
-    + \sum_{j=1}^p g_j(\beta_j)
-    \enspace ,
-\end{align}
+```{math}
+    beta^star, beta_0^star
+    in
+    underset(beta in RR^p, beta_0 in RR)("argmin")
+    Phi(beta)
+    triangleq
+    underbrace(F(X beta + beta_0 bb"1"_n))_(triangleq f(beta, beta_0))
+    + sum_(j=1)^p g_j(beta_j)
+    \ ,
+```
 
 
-where $\boldsymbol{1}_{n}$ is the vector of size $n$ composed only of ones.
+where $bb"1"_{n}$ is the vector of size $n$ composed only of ones.
 
 
-The solvers of `skglm` update the intercept after each update of $\beta$ by doing a (1 dimensional) gradient descent update:
+The solvers of `skglm` update the intercept after each update of $beta$ by doing a (1 dimensional) gradient descent update:
 
-\begin{align}
-    \beta^{(k+1)}_0 = \beta^{(k)}_0 - \frac{1}{L_0}\nabla_{\beta_0}F(X\beta^{(k)} + \beta_0^{(k)}\boldsymbol{1}_{n})
-    \enspace ,
-\end{align}
+```{math}
+    beta_0^((k+1)) = beta_0^((k)) - 1/(L_0) nabla_(beta_0)F(X beta^((k)) + beta_0^((k)) bb"1"_{n})
+    \ ,
+```
 
 where $L_0$ is the Lipschitz constant associated to the intercept.
 The local Lipschitz constant $L_0$ statisfies the following inequality
 
 $$
-\forall x, x_0\in \mathbb{R}^p \times \mathbb{R}, \forall h \in \mathbb{R}, |\nabla_{x_0} f(x, x_0 + h) - \nabla_{x_0} f(x, x_0)| \leq L_0 |h| \enspace .
+\forall x, x_0 in RR^p times RR, \forall h in RR, |nabla_(x_0) f(x, x_0 + h) - nabla_(x_0) f(x, x_0)| <= L_0 |h| \ .
 $$
 
 This update rule should be implemented in the `intercept_update_step` method of the datafit class.
 
-The convergence criterion computed for the gradient is then only the absolute value of the gradient with respect to $\beta_0$ since the intercept optimality condition, for a solution $\beta^\star$, $\beta_0^\star$ is:
+The convergence criterion computed for the gradient is then only the absolute value of the gradient with respect to $beta_0$ since the intercept optimality condition, for a solution $beta^star$, $beta_0^star$ is:
 
-\begin{align}
-    \nabla_{\beta_0}F(X\beta^\star + \beta_0^\star\boldsymbol{1}_{n}) = 0
-    \enspace ,
-\end{align}
+```{math}
+    nabla_(beta_0)F(X beta^star + beta_0^star bb"1"_n) = 0
+    \ ,
+```
 
 Moreover, we have that
 
-\begin{align}
-    \nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \boldsymbol{1}_{n}^\top \nabla_\beta F(X\beta + \beta_0\boldsymbol{1}_{n})
-    \enspace .
-\end{align}
+```{math}
+    nabla_(beta_0) F(X beta + beta_0 bb"1"_n) = bb"1"_n^\top nabla_beta F(X beta + beta_0 bb"1"_n)
+    \ .
+```
 
 
 We will now derive the update used in Equation 2 for three different datafitting functions.
@@ -59,19 +59,19 @@ We will now derive the update used in Equation 2 for three different datafitting
 
 We define
 
-\begin{align}
-    F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{2n} \lVert y - X\beta - \beta_0\boldsymbol{1}_{n} \rVert^2_2
-    \enspace .
-\end{align}
+```{math}
+    F(X beta + beta_0 bb"1"_n) = 1/(2n) norm(y - X beta -  beta_0 bb"1"_{n})_2^2
+    \ .
+```
 
-In this case $\nabla f(z) = \frac{1}{n}(z - y)$ hence Eq. 4 is equal to:
+In this case $nabla f(z) = 1/n (z - y)$ hence Eq. 4 is equal to:
 
-\begin{align}
-    \nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n}\sum_{i=1}^n(X_{i:}\beta + \beta_0 - y_i)
-    \enspace .
-\end{align}
+```{math}
+    nabla_(beta_0) F(X beta + beta_0 bb"1"_n) = 1/n sum_(i=1)^n (X_( i: ) beta + beta_0 - y_i)
+    \ .
+```
 
-Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$.
+Finally, the Lipschitz constant is $L_0 = 1/n sum_(i=1)^n 1^2 = 1$.
 
 
 
@@ -81,46 +81,46 @@ Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$.
 
 In this case,
 
-\begin{align}
-    F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n \log(1 + \exp(-y_i(X_{i:}\beta + \beta_0\boldsymbol{1}_n))
-\end{align}
+```{math}
+    F(X beta + beta_0 bb"1"_{n}) = 1/n sum_(i=1)^n log(1 + exp(-y_i(X_( i: ) beta + beta_0 bb"1"_n))
+```
 
 
 We can then write
 
-\begin{align}
- \nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n  \frac{-y_i}{1 + \exp(- y_i(X_{i:}\beta + \beta_0\boldsymbol{1}_n))} \enspace .
-\end{align}
+```{math}
+    nabla_(beta_0) F(X beta + beta_0 bb"1"_n) = 1/n sum_(i=1)^n  (-y_i)/(1 + exp(-y_i(X_( i: ) beta + beta_0 bb"1"_n))) \ .
+```
 
 
-Finally, the Lipschitz constant is $L_0 = \frac{1}{4n}\sum_{i=1}^n 1^2 = \frac{1}{4}$.
+Finally, the Lipschitz constant is $L_0 = 1/(4n) sum_(i=1)^n 1^2 = 1/4$.
 
 ---
 
 ## The Huber datafit
 
 In this case,
 
-\begin{align}
-    F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n f_{\delta}(y_i - X_{i:}\beta - \beta_0\boldsymbol{1}_n)) \enspace ,
-\end{align}
+```{math}
+    F(X beta + beta_0 bb"1"_{n}) = 1/n sum_(i=1)^n f_(delta) (y_i - X_( i: ) beta - beta_0 bb"1"_n) \ ,
+```
 
 where
 
-\begin{align}
-    f_\delta(x) = \begin{cases}
-            \frac{1}{2}x^2 & \text{if } x \leq \delta \\
-            \delta |x| - \frac{1}{2}\delta^2 & \text{if } x > \delta
-           \end{cases} \enspace .
-\end{align}
+```{math}
+    f_delta(x) = {
+        (1/2 x^2, if x <= delta),
+        (delta |x| - 1/2 delta^2, if x > delta)
+    :} \ .
+```
 
 
-Let $r_i = y_i - X_{i:}\beta - \beta_0\boldsymbol{1}_n$. We can then write
+Let $r_i = y_i - X_( i: ) beta - beta_0 bb"1"_n$. We can then write
 
-\begin{align}
- \nabla_{\beta_0}F(X\beta + \beta_0\boldsymbol{1}_{n}) = \frac{1}{n} \sum_{i=1}^n r_i\mathbb{1}_{\{|r_i|\leq\delta\}} + \text{sign}(r_i)\delta\mathbb{1}_{\{|r_i|>\delta\}} \enspace ,
-\end{align}
+```{math}
+ nabla_(beta_0) F(X beta + beta_0 bb"1"_{n}) = 1/n sum_(i=1)^n r_i bbb"1"_({|r_i| <= delta}) + "sign"(r_i) delta bbb"1"_({|r_i| > delta}) \ ,
+```
 
-where $1_{x > \delta}$ is the classical indicator function.
+where $bbb"1"_({x > delta})$ is the classical indicator function.
 
-Finally, the Lipschitz constant is $L_0 = \frac{1}{n}\sum_{i=1}^n 1^2 = 1$.
+Finally, the Lipschitz constant is $L_0 = 1/n sum_(i=1)^n 1^2 = 1$.