FIX typos in doc, add kernel approximations doc, add doc precomputed kmedoids (#97)

TimotheeMathieu · rth · web-flow · commit f89578f6da52 · 2021-04-14T13:09:39.000+02:00
Co-authored-by: Roman Yurchak &lt;rth.yurchak@gmail.com&gt;
diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -0,0 +1,57 @@
+..
+    Contribution code partially copied from https://github.com/scikit-learn-contrib/category_encoders
+
+Contributing
+============
+
+We welcome and in fact would love some help.
+
+How to Contribute
+=================
+
+The preferred workflow to contribute is:
+
+1. Fork this repository into your own github account.
+2. Clone the fork on your account onto your local disk:
+ 
+.. code-block:: console
+
+    git clone git@github.com:YourLogin/scikit-learn-extra.git
+    cd scikit-learn-extra
+    
+3. Create a branch for your new feature, do not work in the master branch:
+
+.. code-block:: console
+
+    git checkout -b new-feature
+    
+4. Write some code, or docs, or tests.
+5. When you are done, submit a pull request.
+ 
+Guidelines
+==========
+
+This is still a very young project, but we do have a few guiding principles:
+
+1. Maintain semantics of the scikit-learn API
+2. Write detailed docstrings in numpy format
+3. Support pandas dataframes and numpy arrays as inputs
+4. Write tests
+5. Format with black
+
+Running Tests
+=============
+
+To run the tests, use:
+
+.. code-block:: console
+
+    pytest
+    
+Easy Issues / Getting Started
+=============================
+
+There are usually some issues in the project github page looking for contributors, if not you're welcome to propose some
+ideas there, or a great first step is often to just use the library, and add to the examples directory. This helps us 
+with documentation, and often helps to find things that would make the library better to use.
+
diff --git a/README.rst b/README.rst
@@ -61,8 +61,13 @@ The developement version can be installed with,
 
     pip install https://github.com/scikit-learn-contrib/scikit-learn-extra/archive/master.zip
 
+Contributing
+-------------
+We appreciate and welcome contributions. If you would like to take part in scikit-learn development, take a look at the file `CONTRIBUTING.rst`_.
 
+.. _CONTRIBUTING.rst : https://github.com/scikit-learn-contrib/scikit-learn-extra/CONTRIBUTING.rst
 License
 -------
 
 This package is released under the 3-Clause BSD license.
+
diff --git a/doc/modules/cluster.rst b/doc/modules/cluster.rst
@@ -4,36 +4,38 @@
 Clustering with KMedoids and Common-nearest-neighbors
 =====================================================
 .. _k_medoids:
+.. currentmodule:: sklearn_extra.cluster
 
 K-Medoids
 =========
 
-:class:`KMedoids` is related to the :class:`KMeans` algorithm. While
-:class:`KMeans` tries to minimize the within cluster sum-of-squares,
+
+:class:`KMedoids` is related to the :class:`KMeans <sklearn.cluster.KMeans>` algorithm. While
+:class:`KMeans <sklearn.cluster.KMeans>` tries to minimize the within cluster sum-of-squares,
 :class:`KMedoids` tries to minimize the sum of distances between each point and
 the medoid of its cluster. The medoid is a data point (unlike the centroid)
-which has least total distance to the other members of its cluster. The use of
+which has the least total distance to the other members of its cluster. The use of
 a data point to represent each cluster's center allows the use of any distance
 metric for clustering. It may also be a practical advantage, for instance K-Medoids
 algorithms have been used for facial recognition for which the medoid is a
 typical photo of the person to recognize while K-Means would have obtained a blurry
 image that mixed several pictures of the person to recognize.
 
-:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans`
+:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans <sklearn.cluster.KMeans>`
 as it will choose one of the cluster members as the medoid while
-:class:`KMeans` will move the center of the cluster towards the outlier which
+:class:`KMeans <sklearn.cluster.KMeans>` will move the center of the cluster towards the outlier which
 might in turn move other points away from the cluster centre.
 
-:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans`
+:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans <sklearn.cluster.KMeans>`
 except that the Manhattan Median is used for each cluster center instead of
 the centroid. K-Medians is robust to outliers, but it is limited to the
-Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee
+Manhattan Distance metric and, similar to :class:`KMeans <sklearn.cluster.KMeans>`, it does not guarantee
 that the center of each cluster will be a member of the original dataset.
 
 The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number
 of samples, :math:`T` is the number of iterations and :math:`K` is the number of
 clusters. This makes it more suitable for smaller datasets in comparison to
-:class:`KMeans` which is :math:`O(N K T)`.
+:class:`KMeans <sklearn.cluster.KMeans>` which is :math:`O(N K T)`.
 
 .. topic:: Examples:
 
@@ -60,12 +62,12 @@ when speed is an issue.
 * PAM method works as follows:
 
     * Initialize: Greedy initialization of ``n_clusters``. First select the point
-      in the dataset that minimize the sum of distances to a point. Then, add one
-      point that minimize the cost and loop until ``n_clusters`` point are selected.
+      in the dataset that minimizes the sum of distances to a point. Then, add one
+      point that minimizes the cost and loop until ``n_clusters`` points are selected.
       This is the ``init`` parameter called ``build``.
-    * Swap Step: for all medoids already selected, compute the cost of swaping this
-      medoid with any non-medoid point. Then, make the swap that decrease the cost
-      the moste. Loop and stop when there is no change anymore.
+    * Swap Step: for all medoids already selected, compute the cost of swapping this
+      medoid with any non-medoid point. Then, make the swap that decreases the cost
+      the most. Loop and stop when there is no change anymore.
 
 .. topic:: References:
 
diff --git a/doc/modules/kernel_approximation.rst b/doc/modules/kernel_approximation.rst
@@ -0,0 +1,35 @@
+.. _kernel_approximation:
+
+==================================================
+Kernel map approximation for faster kernel methods
+==================================================
+.. _kernel_approximation:
+
+.. currentmodule:: sklearn_extra.kernel_approximation
+
+Kernel methods, which are among the most flexible and influential tools in 
+machine learning with applications in virtually all areas of the field, rely 
+on high-dimensional feature spaces in order to construct powerfull classifiers or
+regressors or clustering algorithms. The main drawback of kernel methods
+is their prohibitive computational complexity. Both spatial and temporal complexity
+ is at least quadratic because we have to compute the whole kernel matrix.
+
+One of the popular way to improve the computational scalability of kernel methods is
+to approximate the feature map impicit behind the kernel method. In practice,
+this means that we will compute a low dimensional approximation of the 
+the otherwise high-dimensional embedding used to define the kernel method.
+
+:class:`Fastfood` approximates feature map of an RBF kernel by Monte Carlo approximation
+of its Fourier transform.
+
+Fastfood replaces the random matrix of Random Kitchen Sinks 
+(`RBFSampler <https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.RBFSampler.html#sklearn.kernel_approximation.RBFSampler>`_)
+with an approximation that uses the Walsh-Hadamard transformation to gain
+significant speed and storage advantages.  The computational complexity for
+mapping a single example is O(n_components log d).  The space complexity is
+O(n_components).  
+
+See `scikit-learn User-guide <https://scikit-learn.org/stable/modules/kernel_approximation.html#kernel-approximation>`_ for more general informations on kernel approximations.
+
+See also :class:`EigenProRegressor <sklearn_extra.kernel_methods.EigenProRegressor>` and :class:`EigenProClassifier <sklearn_extra.kernel_methods.EigenProClassifier>` for another 
+way to compute fast kernel methods algorithms.
diff --git a/doc/modules/robust.rst b/doc/modules/robust.rst
@@ -26,7 +26,7 @@ What is an outlier ?
 
 The term "outlier" refers to a discordant minority of the dataset. It is
 generally assumed to be a set of points situated outside the bulk of the data
-but there exists more complex cases as illustrated in the figure below.
+but there exist more complex cases as illustrated in the figure below.
 
 Formally, we define outliers for a given task by considering  points for
 which the loss function takes unusually high values.
@@ -140,7 +140,7 @@ important to do it for SGD). In the context of a corrupted dataset, please use
 
 This algorithm has been studied in the context of "mom" weights in the
 article [1]_, the context of "huber" weights has been mentioned in [2]_.
-Both weighting scheme can be seen as special cases of the algorithm in [3]_.
+Both weighting schemes can be seen as special cases of the algorithm in [3]_.
 
 Comparison with other robust estimators
 ---------------------------------------
@@ -161,7 +161,7 @@ regressions are robust only to outliers in the label Y but not in X.
 Pro: RANSACRegressor and TheilSenRegressor both use a hard rejection of
 outlier. This can be interpreted as though there was an outlier detection
 step and then a regression step whereas RobustWeightedRegressor is directly
-robust to outliers. This often increase the performance on moderatly corrupted
+robust to outliers. This often increases the performance on moderately corrupted
 datasets.
 
 Con: In general, this algorithm is slower than both  TheilSenRegressor and
@@ -172,7 +172,7 @@ Speed and limits of the algorithm
 
 Most of the time, it is interesting to do robust statistics only when there
 are outliers and notice that a lot of dataset have previously been "cleaned"
-of an outliers in which case this algorithm is not better than base_estimator.
+of outliers in which case this algorithm is not better than base_estimator.
 
 In high dimension, the algorithm is expected to be as good
 (or as bad) as base_estimator do in high dimension.
diff --git a/doc/user_guide.rst b/doc/user_guide.rst
@@ -13,3 +13,4 @@ User guide
   modules/eigenpro.rst
   modules/cluster.rst
   modules/robust.rst
+  modules/kernel_approximation.rst
diff --git a/sklearn_extra/cluster/_k_medoids.py b/sklearn_extra/cluster/_k_medoids.py
@@ -37,6 +37,8 @@ class KMedoids(BaseEstimator, ClusterMixin, TransformerMixin):
 
     metric : string, or callable, optional, default: 'euclidean'
         What distance metric to use. See :func:metrics.pairwise_distances
+        metric can be 'precomputed', the user must then feed the fit method
+        with a precomputed kernel matrix and not the design matrix X.
 
     method : {'alternate', 'pam'}, default: 'alternate'
         Which algorithm to use. 'alternate' is faster while 'pam' is more accurate.