|
1 | | -""" |
2 | | -Sequential selection |
| 1 | +r""" |
| 2 | +This module contains data sub-selection modules primarily corresponding to |
| 3 | +methods derived from CUR matrix decomposition and Farthest Point Sampling. In |
| 4 | +their classical form, CUR and FPS determine a data subset that maximizes the |
| 5 | +variance (CUR) or distribution (FPS) of the features or samples. These methods |
| 6 | +can be modified to combine supervised target information denoted by the methods |
| 7 | +`PCov-CUR` and `PCov-FPS`. For further reading, refer to [Imbalzano2018]_ and |
| 8 | +[Cersonsky2021]_. These selectors can be used for both feature and sample |
| 9 | +selection, with similar instantiations. All sub-selection methods scores each |
| 10 | +feature or sample (without an estimator) and chooses that with the maximum |
| 11 | +score. A simple example of usage: |
| 12 | +
|
| 13 | +.. doctest:: |
| 14 | +
|
| 15 | + >>> # feature selection |
| 16 | + >>> import numpy as np |
| 17 | + >>> from skmatter.feature_selection import CUR, FPS, PCovCUR, PCovFPS |
| 18 | + >>> selector = CUR( |
| 19 | + ... # the number of selections to make |
| 20 | + ... # if None, set to half the samples or features |
| 21 | + ... # if float, fraction of the total dataset to select |
| 22 | + ... # if int, absolute number of selections to make |
| 23 | + ... n_to_select=2, |
| 24 | + ... # option to use `tqdm <https://tqdm.github.io/>`_ progress bar |
| 25 | + ... progress_bar=True, |
| 26 | + ... # float, cutoff score to stop selecting |
| 27 | + ... score_threshold=1e-12, |
| 28 | + ... # boolean, whether to select randomly after non-redundant selections |
| 29 | + ... # are exhausted |
| 30 | + ... full=False, |
| 31 | + ... ) |
| 32 | + >>> X = np.array( |
| 33 | + ... [ |
| 34 | + ... [0.12, 0.21, 0.02], # 3 samples, 3 features |
| 35 | + ... [-0.09, 0.32, -0.10], |
| 36 | + ... [-0.03, -0.53, 0.08], |
| 37 | + ... ] |
| 38 | + ... ) |
| 39 | + >>> y = np.array([0.0, 0.0, 1.0]) # classes of each sample |
| 40 | + >>> selector.fit(X) |
| 41 | + CUR(n_to_select=2, progress_bar=True, score_threshold=1e-12) |
| 42 | + >>> Xr = selector.transform(X) |
| 43 | + >>> print(Xr.shape) |
| 44 | + (3, 2) |
| 45 | + >>> selector = PCovCUR(n_to_select=2) |
| 46 | + >>> selector.fit(X, y) |
| 47 | + PCovCUR(n_to_select=2) |
| 48 | + >>> Xr = selector.transform(X) |
| 49 | + >>> print(Xr.shape) |
| 50 | + (3, 2) |
| 51 | + >>> |
| 52 | + >>> # Now sample selection |
| 53 | + >>> from skmatter.sample_selection import CUR, FPS, PCovCUR, PCovFPS |
| 54 | + >>> selector = CUR(n_to_select=2) |
| 55 | + >>> selector.fit(X) |
| 56 | + CUR(n_to_select=2) |
| 57 | + >>> Xr = X[selector.selected_idx_] |
| 58 | + >>> print(Xr.shape) |
| 59 | + (2, 3) |
| 60 | +
|
| 61 | +These selectors are available: |
| 62 | +
|
| 63 | +* :ref:`CUR-api`: a decomposition: an iterative feature selection method based upon the |
| 64 | + singular value decoposition. |
| 65 | +* :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left |
| 66 | + singular vectors inspired by Principal Covariates Regression. |
| 67 | +* :ref:`FPS-api`: a common selection technique intended to exploit the diversity of |
| 68 | + the input space. The selection of the first point is made at random or by a |
| 69 | + separate metric |
| 70 | +* :ref:`PCov-FPS-api` extends upon FPS much like PCov-CUR does to CUR. |
| 71 | +* :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi |
| 72 | + tessellations to accelerate selection. |
| 73 | +* :ref:`DCH-api`: selects samples by constructing a directional convex hull and |
| 74 | + determining which samples lie on the bounding surface. |
3 | 75 | """ |
4 | 76 |
|
5 | 77 | import numbers |
|
0 commit comments