|
| 1 | +.. _getting_started: |
| 2 | + |
| 3 | +=============== |
| 4 | +Getting started |
| 5 | +=============== |
| 6 | +--------------- |
| 7 | + |
| 8 | +This page provides a starter example to get familiar with ``skglm`` and explore some of its features. |
| 9 | + |
| 10 | +In the first section, we fit a Lasso estimator on a high dimensional |
| 11 | +toy dataset (number of features is largely greater than the number of samples). Linear models don't generalize well |
| 12 | +for unseen dataset. By adding a penalty, :math:`\ell_1` penalty, we can train estimator that overcome this drawback. |
| 13 | + |
| 14 | +The last section, we explore other combinations of datafit and penalty to create a custom estimator that achieves a lower prediction error, |
| 15 | +in the sequel :math:`\ell_1` Huber regression. We show that ``skglm`` is perfectly adapted to these experiments thanks to its modular design. |
| 16 | + |
| 17 | +Beforehand, make sure that you have already installed ``skglm`` |
| 18 | + |
| 19 | +.. code-block:: shell |
| 20 | +
|
| 21 | + # using pip |
| 22 | + pip install -U skglm |
| 23 | +
|
| 24 | + # using conda |
| 25 | + conda install skglm |
| 26 | +
|
| 27 | +------------------------- |
| 28 | + |
| 29 | + |
| 30 | +Fitting a Lasso estimator |
| 31 | +------------------------- |
| 32 | + |
| 33 | +Let's start first by generating a toy dataset and splitting it to train and test sets. |
| 34 | +For that, we will use ``scikit-learn`` |
| 35 | +`make_regression <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression>`_ |
| 36 | + |
| 37 | +.. code-block:: python |
| 38 | +
|
| 39 | + # imports |
| 40 | + from sklearn.datasets import make_regression |
| 41 | + from sklearn.model_selection import train_test_split |
| 42 | +
|
| 43 | + # generate toy data |
| 44 | + X, y = make_regression(n_samples=100, n_features=1000) |
| 45 | + |
| 46 | + # split data |
| 47 | + X_train, X_test, y_train, y_test = train_test_split(X, y) |
| 48 | +
|
| 49 | +Then let's fit ``skglm`` :ref:`Lasso <skglm.Lasso>` estimator and prints its score on the test set. |
| 50 | + |
| 51 | +.. code-block:: python |
| 52 | +
|
| 53 | + # import estimator |
| 54 | + from skglm import Lasso |
| 55 | + |
| 56 | + # init and fit |
| 57 | + estimator = Lasso() |
| 58 | + estimator.fit(X_train, y_train) |
| 59 | +
|
| 60 | + # compute R² |
| 61 | + estimator.score(X_test, y_test) |
| 62 | +
|
| 63 | +
|
| 64 | +.. note:: |
| 65 | + |
| 66 | + - The first fit after importing ``skglm`` has an overhead as ``skglm`` uses `Numba <https://numba.pydata.org/>`_ |
| 67 | + The subsequent fits will achieve top speed since Numba compilation is cached. |
| 68 | + |
| 69 | +``skglm`` has several other ``scikit-learn`` compatible estimators. |
| 70 | +Check the :ref:`API <Estimators>` for more information about the available estimators. |
| 71 | + |
| 72 | + |
| 73 | +Fitting :math:`\ell_1` Huber regression |
| 74 | +--------------------------------------- |
| 75 | + |
| 76 | +Suppose that the latter dataset contains outliers and we would like to mitigate their effects on the learned coefficients |
| 77 | +while having an estimator that generalizes well to unseen data. Ideally, we would like to fit a :math:`\ell_1` Huber regressor. |
| 78 | + |
| 79 | +``skglm`` offers high flexibility to compose custom estimators. Through a simple API, it is possible to combine any |
| 80 | +``skglm`` :ref:`datafit <Datafits>` and :ref:`penalty <Penalties>`. |
| 81 | + |
| 82 | +.. note:: |
| 83 | + |
| 84 | + - :math:`\ell_1` regularization is not supported in ``scikit-learn`` for HuberRegressor |
| 85 | + |
| 86 | +Let's explore how to achieve that. |
| 87 | + |
| 88 | + |
| 89 | +Generate corrupt data |
| 90 | +********************* |
| 91 | + |
| 92 | +We will use the same script as before except that we will take 10 samples and corrupt their values. |
| 93 | + |
| 94 | +.. code-block:: python |
| 95 | +
|
| 96 | + # imports |
| 97 | + import numpy as np |
| 98 | + from sklearn.datasets import make_regression |
| 99 | + from sklearn.model_selection import train_test_split |
| 100 | +
|
| 101 | + # generate toy data |
| 102 | + X, y = make_regression(n_samples=100, n_features=1000) |
| 103 | +
|
| 104 | + # select and corrupt 10 random samples |
| 105 | + y[np.random.choice(n_samples, 10)] = 100 * y.max() |
| 106 | +
|
| 107 | + # split data |
| 108 | + X_train, X_test, y_train, y_test = train_test_split(X, y) |
| 109 | +
|
| 110 | +
|
| 111 | +Now let's compose a custom estimator using :ref:`GeneralizedLinearEstimator <skglm.GeneralizedLinearEstimator>`. |
| 112 | +It's the go-to way to create custom estimator by combining a datafit and a penalty. |
| 113 | + |
| 114 | +.. code-block:: python |
| 115 | +
|
| 116 | + # import penalty and datafit |
| 117 | + from skglm.penalties import L1 |
| 118 | + from skglm.datafits import Huber |
| 119 | +
|
| 120 | + # import GLM estimator |
| 121 | + from skglm import GeneralizedLinearEstimator |
| 122 | +
|
| 123 | + # build and fit estimator |
| 124 | + estimator = GeneralizedLinearEstimator( |
| 125 | + Huber(1.), |
| 126 | + L1(alpha=1.) |
| 127 | + ) |
| 128 | + estimator.fit(X_train, y_train) |
| 129 | +
|
| 130 | +
|
| 131 | +.. note:: |
| 132 | + |
| 133 | + - Here the arguments given to the datafit and penalty are arbitrary and given just for sake of illustration. |
| 134 | + |
| 135 | +``GeneralizedLinearEstimator`` allows to combine any penalties and datafits implemented in ``skglm``. |
| 136 | +If you don't find an estimator in the ``estimators`` module, you can build it by combining the appropriate datafit and penalty |
| 137 | +and pass it to ``GeneralizedLinearEstimator``. Explore the list of supported :ref:`datafits <Datafits>` and :ref:`penalties <Penalties>`. |
| 138 | + |
| 139 | +.. important:: |
| 140 | + |
| 141 | + - It is possible to create your own datafit and penalties. Check the tutorials on :ref:`how to add a custom datafit <how_to_add_custom_datafit>` |
| 142 | + and :ref:`how to add a custom penalty <how_to_add_custom_penalty>`. |
| 143 | + |
| 144 | + |
| 145 | +Explore further advanced topics and get hands-on examples on the :ref:`tutorials page <tutorials>` |
0 commit comments