Skip to content

Commit f89578f

Browse files
TimotheeMathieurth
andauthored
FIX typos in doc, add kernel approximations doc, add doc precomputed kmedoids (#97)
Co-authored-by: Roman Yurchak <[email protected]>
1 parent aa9880c commit f89578f

File tree

7 files changed

+119
-17
lines changed

7 files changed

+119
-17
lines changed

CONTRIBUTING.rst

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
..
2+
Contribution code partially copied from https://github.com/scikit-learn-contrib/category_encoders
3+
4+
Contributing
5+
============
6+
7+
We welcome and in fact would love some help.
8+
9+
How to Contribute
10+
=================
11+
12+
The preferred workflow to contribute is:
13+
14+
1. Fork this repository into your own github account.
15+
2. Clone the fork on your account onto your local disk:
16+
17+
.. code-block:: console
18+
19+
git clone [email protected]:YourLogin/scikit-learn-extra.git
20+
cd scikit-learn-extra
21+
22+
3. Create a branch for your new feature, do not work in the master branch:
23+
24+
.. code-block:: console
25+
26+
git checkout -b new-feature
27+
28+
4. Write some code, or docs, or tests.
29+
5. When you are done, submit a pull request.
30+
31+
Guidelines
32+
==========
33+
34+
This is still a very young project, but we do have a few guiding principles:
35+
36+
1. Maintain semantics of the scikit-learn API
37+
2. Write detailed docstrings in numpy format
38+
3. Support pandas dataframes and numpy arrays as inputs
39+
4. Write tests
40+
5. Format with black
41+
42+
Running Tests
43+
=============
44+
45+
To run the tests, use:
46+
47+
.. code-block:: console
48+
49+
pytest
50+
51+
Easy Issues / Getting Started
52+
=============================
53+
54+
There are usually some issues in the project github page looking for contributors, if not you're welcome to propose some
55+
ideas there, or a great first step is often to just use the library, and add to the examples directory. This helps us
56+
with documentation, and often helps to find things that would make the library better to use.
57+

README.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,13 @@ The developement version can be installed with,
6161
6262
pip install https://github.com/scikit-learn-contrib/scikit-learn-extra/archive/master.zip
6363
64+
Contributing
65+
-------------
66+
We appreciate and welcome contributions. If you would like to take part in scikit-learn development, take a look at the file `CONTRIBUTING.rst`_.
6467

68+
.. _CONTRIBUTING.rst : https://github.com/scikit-learn-contrib/scikit-learn-extra/CONTRIBUTING.rst
6569
License
6670
-------
6771

6872
This package is released under the 3-Clause BSD license.
73+

doc/modules/cluster.rst

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,36 +4,38 @@
44
Clustering with KMedoids and Common-nearest-neighbors
55
=====================================================
66
.. _k_medoids:
7+
.. currentmodule:: sklearn_extra.cluster
78

89
K-Medoids
910
=========
1011

11-
:class:`KMedoids` is related to the :class:`KMeans` algorithm. While
12-
:class:`KMeans` tries to minimize the within cluster sum-of-squares,
12+
13+
:class:`KMedoids` is related to the :class:`KMeans <sklearn.cluster.KMeans>` algorithm. While
14+
:class:`KMeans <sklearn.cluster.KMeans>` tries to minimize the within cluster sum-of-squares,
1315
:class:`KMedoids` tries to minimize the sum of distances between each point and
1416
the medoid of its cluster. The medoid is a data point (unlike the centroid)
15-
which has least total distance to the other members of its cluster. The use of
17+
which has the least total distance to the other members of its cluster. The use of
1618
a data point to represent each cluster's center allows the use of any distance
1719
metric for clustering. It may also be a practical advantage, for instance K-Medoids
1820
algorithms have been used for facial recognition for which the medoid is a
1921
typical photo of the person to recognize while K-Means would have obtained a blurry
2022
image that mixed several pictures of the person to recognize.
2123

22-
:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans`
24+
:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans <sklearn.cluster.KMeans>`
2325
as it will choose one of the cluster members as the medoid while
24-
:class:`KMeans` will move the center of the cluster towards the outlier which
26+
:class:`KMeans <sklearn.cluster.KMeans>` will move the center of the cluster towards the outlier which
2527
might in turn move other points away from the cluster centre.
2628

27-
:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans`
29+
:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans <sklearn.cluster.KMeans>`
2830
except that the Manhattan Median is used for each cluster center instead of
2931
the centroid. K-Medians is robust to outliers, but it is limited to the
30-
Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee
32+
Manhattan Distance metric and, similar to :class:`KMeans <sklearn.cluster.KMeans>`, it does not guarantee
3133
that the center of each cluster will be a member of the original dataset.
3234

3335
The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number
3436
of samples, :math:`T` is the number of iterations and :math:`K` is the number of
3537
clusters. This makes it more suitable for smaller datasets in comparison to
36-
:class:`KMeans` which is :math:`O(N K T)`.
38+
:class:`KMeans <sklearn.cluster.KMeans>` which is :math:`O(N K T)`.
3739

3840
.. topic:: Examples:
3941

@@ -60,12 +62,12 @@ when speed is an issue.
6062
* PAM method works as follows:
6163

6264
* Initialize: Greedy initialization of ``n_clusters``. First select the point
63-
in the dataset that minimize the sum of distances to a point. Then, add one
64-
point that minimize the cost and loop until ``n_clusters`` point are selected.
65+
in the dataset that minimizes the sum of distances to a point. Then, add one
66+
point that minimizes the cost and loop until ``n_clusters`` points are selected.
6567
This is the ``init`` parameter called ``build``.
66-
* Swap Step: for all medoids already selected, compute the cost of swaping this
67-
medoid with any non-medoid point. Then, make the swap that decrease the cost
68-
the moste. Loop and stop when there is no change anymore.
68+
* Swap Step: for all medoids already selected, compute the cost of swapping this
69+
medoid with any non-medoid point. Then, make the swap that decreases the cost
70+
the most. Loop and stop when there is no change anymore.
6971

7072
.. topic:: References:
7173

doc/modules/kernel_approximation.rst

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
.. _kernel_approximation:
2+
3+
==================================================
4+
Kernel map approximation for faster kernel methods
5+
==================================================
6+
.. _kernel_approximation:
7+
8+
.. currentmodule:: sklearn_extra.kernel_approximation
9+
10+
Kernel methods, which are among the most flexible and influential tools in
11+
machine learning with applications in virtually all areas of the field, rely
12+
on high-dimensional feature spaces in order to construct powerfull classifiers or
13+
regressors or clustering algorithms. The main drawback of kernel methods
14+
is their prohibitive computational complexity. Both spatial and temporal complexity
15+
is at least quadratic because we have to compute the whole kernel matrix.
16+
17+
One of the popular way to improve the computational scalability of kernel methods is
18+
to approximate the feature map impicit behind the kernel method. In practice,
19+
this means that we will compute a low dimensional approximation of the
20+
the otherwise high-dimensional embedding used to define the kernel method.
21+
22+
:class:`Fastfood` approximates feature map of an RBF kernel by Monte Carlo approximation
23+
of its Fourier transform.
24+
25+
Fastfood replaces the random matrix of Random Kitchen Sinks
26+
(`RBFSampler <https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.RBFSampler.html#sklearn.kernel_approximation.RBFSampler>`_)
27+
with an approximation that uses the Walsh-Hadamard transformation to gain
28+
significant speed and storage advantages. The computational complexity for
29+
mapping a single example is O(n_components log d). The space complexity is
30+
O(n_components).
31+
32+
See `scikit-learn User-guide <https://scikit-learn.org/stable/modules/kernel_approximation.html#kernel-approximation>`_ for more general informations on kernel approximations.
33+
34+
See also :class:`EigenProRegressor <sklearn_extra.kernel_methods.EigenProRegressor>` and :class:`EigenProClassifier <sklearn_extra.kernel_methods.EigenProClassifier>` for another
35+
way to compute fast kernel methods algorithms.

doc/modules/robust.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ What is an outlier ?
2626

2727
The term "outlier" refers to a discordant minority of the dataset. It is
2828
generally assumed to be a set of points situated outside the bulk of the data
29-
but there exists more complex cases as illustrated in the figure below.
29+
but there exist more complex cases as illustrated in the figure below.
3030

3131
Formally, we define outliers for a given task by considering points for
3232
which the loss function takes unusually high values.
@@ -140,7 +140,7 @@ important to do it for SGD). In the context of a corrupted dataset, please use
140140

141141
This algorithm has been studied in the context of "mom" weights in the
142142
article [1]_, the context of "huber" weights has been mentioned in [2]_.
143-
Both weighting scheme can be seen as special cases of the algorithm in [3]_.
143+
Both weighting schemes can be seen as special cases of the algorithm in [3]_.
144144

145145
Comparison with other robust estimators
146146
---------------------------------------
@@ -161,7 +161,7 @@ regressions are robust only to outliers in the label Y but not in X.
161161
Pro: RANSACRegressor and TheilSenRegressor both use a hard rejection of
162162
outlier. This can be interpreted as though there was an outlier detection
163163
step and then a regression step whereas RobustWeightedRegressor is directly
164-
robust to outliers. This often increase the performance on moderatly corrupted
164+
robust to outliers. This often increases the performance on moderately corrupted
165165
datasets.
166166

167167
Con: In general, this algorithm is slower than both TheilSenRegressor and
@@ -172,7 +172,7 @@ Speed and limits of the algorithm
172172

173173
Most of the time, it is interesting to do robust statistics only when there
174174
are outliers and notice that a lot of dataset have previously been "cleaned"
175-
of an outliers in which case this algorithm is not better than base_estimator.
175+
of outliers in which case this algorithm is not better than base_estimator.
176176

177177
In high dimension, the algorithm is expected to be as good
178178
(or as bad) as base_estimator do in high dimension.

doc/user_guide.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ User guide
1313
modules/eigenpro.rst
1414
modules/cluster.rst
1515
modules/robust.rst
16+
modules/kernel_approximation.rst

sklearn_extra/cluster/_k_medoids.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ class KMedoids(BaseEstimator, ClusterMixin, TransformerMixin):
3737
3838
metric : string, or callable, optional, default: 'euclidean'
3939
What distance metric to use. See :func:metrics.pairwise_distances
40+
metric can be 'precomputed', the user must then feed the fit method
41+
with a precomputed kernel matrix and not the design matrix X.
4042
4143
method : {'alternate', 'pam'}, default: 'alternate'
4244
Which algorithm to use. 'alternate' is faster while 'pam' is more accurate.

0 commit comments

Comments
 (0)