Skip to content

Commit b0b7451

Browse files
agoscinskiPicoCentauri
authored andcommitted
separate getting-started page from intro in doc
* put intro of scikit-matter into the index page and abbreviate * add getting started page which gives an overview of important implementations * include existing introductory text for reconstruction measures into API reference so all introductory texts from the API are included into the getting started * reword text a bit for more soundness within the getting started * move doc and code corresponding to a module to __init__files to include them by automodule into the doc
1 parent 0e1d581 commit b0b7451

File tree

9 files changed

+196
-157
lines changed

9 files changed

+196
-157
lines changed

docs/src/getting-started.rst

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
Getting started
2+
===============
3+
4+
This guide illustrates the main functionalities that ``scikit-matter`` provides. It
5+
assumes a very basic working knowledge of how ``scikit-learn`` works. Please refer to
6+
our :ref:`installation` instructions for installing ``scikit-matter``.
7+
8+
For a detailed explaination of the functionalities, please look at the
9+
:ref:`selection-api`
10+
11+
Features and Samples Selection
12+
------------------------------
13+
14+
.. automodule:: skmatter._selection
15+
:noindex:
16+
17+
Notebook Examples
18+
^^^^^^^^^^^^^^^^^
19+
20+
.. include:: examples/selection/index.rst
21+
:start-line: 4
22+
23+
24+
Reconstruction Measures
25+
-----------------------
26+
27+
.. automodule:: skmatter.metrics
28+
:noindex:
29+
30+
Notebook Examples
31+
^^^^^^^^^^^^^^^^^
32+
33+
.. include:: examples/reconstruction/index.rst
34+
:start-line: 4
35+
36+
Principal Covariates Regression
37+
-------------------------------
38+
39+
.. automodule:: skmatter.decomposition
40+
:noindex:
41+
42+
Notebook Examples
43+
^^^^^^^^^^^^^^^^^
44+
45+
.. include:: examples/pcovr/index.rst
46+
:start-line: 4

docs/src/gfrm.rst

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,45 @@
11
.. _gfrm:
22

33
Reconstruction Measures
4-
======================================
4+
=======================
55

6-
.. currentmodule:: skmatter.metrics
6+
.. marker-reconstruction-introduction-begin
7+
8+
.. automodule:: skmatter.metrics
9+
10+
These reconstruction measures are available:
711

12+
* :ref:`GRE-api` (GRE) computes the amount of linearly-decodable information
13+
recovered through a global linear reconstruction.
14+
* :ref:`GRD-api` (GRD) computes the amount of distortion contained in a global linear
15+
reconstruction.
16+
* :ref:`LRE-api` (LRE) computes the amount of decodable information recovered through
17+
a local linear reconstruction for the k-nearest neighborhood of each sample.
18+
19+
.. marker-reconstruction-introduction-end
20+
21+
.. currentmodule:: skmatter.metrics
822

923
.. _GRE-api:
1024

1125
Global Reconstruction Error
12-
###########################
26+
---------------------------
1327

1428
.. autofunction:: pointwise_global_reconstruction_error
1529
.. autofunction:: global_reconstruction_error
1630

1731
.. _GRD-api:
1832

1933
Global Reconstruction Distortion
20-
################################
34+
--------------------------------
2135

2236
.. autofunction:: pointwise_global_reconstruction_distortion
2337
.. autofunction:: global_reconstruction_distortion
2438

2539
.. _LRE-api:
2640

2741
Local Reconstruction Error
28-
##########################
42+
--------------------------
2943

3044
.. autofunction:: pointwise_local_reconstruction_error
3145
.. autofunction:: local_reconstruction_error

docs/src/index.rst

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,11 @@
1-
scikit-matter documentation
2-
===========================
1+
scikit-matter
2+
=============
33

4-
``scikit-matter`` is a collection of `scikit-learn <https://scikit.org/>`_ compatible
5-
utilities that implement methods born out of the materials science and chemistry
6-
communities.
7-
8-
Convenient-to-use libraries such as scikit-learn have accelerated the adoption and
9-
application of machine learning (ML) workflows and data-driven methods. Such libraries
10-
have gained great popularity partly because the implemented methods are generally
11-
applicable in multiple domains. While developments in the atomistic learning community
12-
have put forward general-use machine learning methods, their deployment is commonly
13-
entangled with domain-specific functionalities, preventing access to a wider audience.
14-
15-
scikit-matter targets domain-agnostic implementations of methods developed in the
16-
computational chemical and materials science community, following the scikit-learn API
4+
scikit-matter is a toolbox of methods developed in the
5+
computational chemical and materials science community, following the
6+
`scikit-learn <https://scikit.org/>`_ API
177
and coding guidelines to promote usability and interoperability with existing workflows.
18-
scikit-matter contains a toolbox of methods for unsupervised and supervised analysis of
19-
ML datasets, including the comparison, decomposition, and selection of features and
20-
samples.
8+
219

2210
.. include:: ../../README.rst
2311
:start-after: marker-issues
@@ -27,9 +15,13 @@ samples.
2715
:maxdepth: 1
2816
:caption: Contents:
2917

30-
intro
18+
getting-started
3119
installation
3220
reference
3321
tutorials
3422
contributing
3523
bibliography
24+
25+
26+
If you would like to contribute to scikit-matter, check out our :ref:`contributing`
27+
page!

docs/src/installation.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _installation:
2+
13
.. include:: ../../README.rst
24
:start-after: marker-installation
35
:end-before: marker-ci-tests

docs/src/intro.rst

Lines changed: 0 additions & 68 deletions
This file was deleted.

docs/src/selection.rst

Lines changed: 2 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,9 @@
1+
.. _selection-api:
12

23
Feature and Sample Selection
34
============================
45

5-
`scikit-matter` contains multiple data sub-selection modules,
6-
primarily corresponding to methods derived from CUR matrix decomposition
7-
and Farthest Point Sampling. In their classical form, CUR and FPS determine
8-
a data subset that maximizes the
9-
variance (CUR) or distribution (FPS) of the features or samples. These methods
10-
can be modified to combine supervised and unsupervised learning, in a formulation
11-
denoted `PCov-CUR` and `PCov-FPS`.
12-
For further reading, refer to [Imbalzano2018]_ and [Cersonsky2021]_.
13-
14-
These selectors can be used for both feature and sample selection, with similar
15-
instantiations. This can be executed using:
16-
17-
.. doctest::
18-
19-
>>> # feature selection
20-
>>> import numpy as np
21-
>>> from skmatter.feature_selection import CUR, FPS, PCovCUR, PCovFPS
22-
>>> selector = CUR(
23-
... # the number of selections to make
24-
... # if None, set to half the samples or features
25-
... # if float, fraction of the total dataset to select
26-
... # if int, absolute number of selections to make
27-
... n_to_select=2,
28-
... # option to use `tqdm <https://tqdm.github.io/>`_ progress bar
29-
... progress_bar=True,
30-
... # float, cutoff score to stop selecting
31-
... score_threshold=1e-12,
32-
... # boolean, whether to select randomly after non-redundant selections
33-
... # are exhausted
34-
... full=False,
35-
... )
36-
>>> X = np.array(
37-
... [
38-
... [0.12, 0.21, 0.02], # 3 samples, 3 features
39-
... [-0.09, 0.32, -0.10],
40-
... [-0.03, -0.53, 0.08],
41-
... ]
42-
... )
43-
>>> y = np.array([0.0, 0.0, 1.0]) # classes of each sample
44-
>>> selector.fit(X)
45-
CUR(n_to_select=2, progress_bar=True, score_threshold=1e-12)
46-
>>> Xr = selector.transform(X)
47-
>>> print(Xr.shape)
48-
(3, 2)
49-
>>> selector = PCovCUR(n_to_select=2)
50-
>>> selector.fit(X, y)
51-
PCovCUR(n_to_select=2)
52-
>>> Xr = selector.transform(X)
53-
>>> print(Xr.shape)
54-
(3, 2)
55-
>>>
56-
>>> # Now sample selection
57-
>>> from skmatter.sample_selection import CUR, FPS, PCovCUR, PCovFPS
58-
>>> selector = CUR(n_to_select=2)
59-
>>> selector.fit(X)
60-
CUR(n_to_select=2)
61-
>>> Xr = X[selector.selected_idx_]
62-
>>> print(Xr.shape)
63-
(2, 3)
64-
6+
.. automodule:: skmatter._selection
657

668
.. _CUR-api:
679

src/skmatter/_selection.py

Lines changed: 74 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,77 @@
1-
"""
2-
Sequential selection
1+
r"""
2+
This module contains data sub-selection modules primarily corresponding to
3+
methods derived from CUR matrix decomposition and Farthest Point Sampling. In
4+
their classical form, CUR and FPS determine a data subset that maximizes the
5+
variance (CUR) or distribution (FPS) of the features or samples. These methods
6+
can be modified to combine supervised target information denoted by the methods
7+
`PCov-CUR` and `PCov-FPS`. For further reading, refer to [Imbalzano2018]_ and
8+
[Cersonsky2021]_. These selectors can be used for both feature and sample
9+
selection, with similar instantiations. All sub-selection methods scores each
10+
feature or sample (without an estimator) and chooses that with the maximum
11+
score. A simple example of usage:
12+
13+
.. doctest::
14+
15+
>>> # feature selection
16+
>>> import numpy as np
17+
>>> from skmatter.feature_selection import CUR, FPS, PCovCUR, PCovFPS
18+
>>> selector = CUR(
19+
... # the number of selections to make
20+
... # if None, set to half the samples or features
21+
... # if float, fraction of the total dataset to select
22+
... # if int, absolute number of selections to make
23+
... n_to_select=2,
24+
... # option to use `tqdm <https://tqdm.github.io/>`_ progress bar
25+
... progress_bar=True,
26+
... # float, cutoff score to stop selecting
27+
... score_threshold=1e-12,
28+
... # boolean, whether to select randomly after non-redundant selections
29+
... # are exhausted
30+
... full=False,
31+
... )
32+
>>> X = np.array(
33+
... [
34+
... [0.12, 0.21, 0.02], # 3 samples, 3 features
35+
... [-0.09, 0.32, -0.10],
36+
... [-0.03, -0.53, 0.08],
37+
... ]
38+
... )
39+
>>> y = np.array([0.0, 0.0, 1.0]) # classes of each sample
40+
>>> selector.fit(X)
41+
CUR(n_to_select=2, progress_bar=True, score_threshold=1e-12)
42+
>>> Xr = selector.transform(X)
43+
>>> print(Xr.shape)
44+
(3, 2)
45+
>>> selector = PCovCUR(n_to_select=2)
46+
>>> selector.fit(X, y)
47+
PCovCUR(n_to_select=2)
48+
>>> Xr = selector.transform(X)
49+
>>> print(Xr.shape)
50+
(3, 2)
51+
>>>
52+
>>> # Now sample selection
53+
>>> from skmatter.sample_selection import CUR, FPS, PCovCUR, PCovFPS
54+
>>> selector = CUR(n_to_select=2)
55+
>>> selector.fit(X)
56+
CUR(n_to_select=2)
57+
>>> Xr = X[selector.selected_idx_]
58+
>>> print(Xr.shape)
59+
(2, 3)
60+
61+
These selectors are available:
62+
63+
* :ref:`CUR-api`: a decomposition: an iterative feature selection method based upon the
64+
singular value decoposition.
65+
* :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left
66+
singular vectors inspired by Principal Covariates Regression.
67+
* :ref:`FPS-api`: a common selection technique intended to exploit the diversity of
68+
the input space. The selection of the first point is made at random or by a
69+
separate metric
70+
* :ref:`PCov-FPS-api` extends upon FPS much like PCov-CUR does to CUR.
71+
* :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi
72+
tessellations to accelerate selection.
73+
* :ref:`DCH-api`: selects samples by constructing a directional convex hull and
74+
determining which samples lie on the bounding surface.
375
"""
476

577
import numbers

src/skmatter/decomposition/__init__.py

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,28 @@
1-
"""
2-
The :mod:`skmatter.decomposition` module includes the two distance
3-
measures, as defined by Principal Covariates Regression (PCovR)
1+
r"""
2+
Often, one wants to construct new ML features from their current representation
3+
in order to compress data or visualise trends in the dataset. In the archetypal
4+
method for this dimensionality reduction, principal components analysis (PCA),
5+
features are transformed into the latent space which best preserves the
6+
variance of the original data. This module provides the Principal Covariates
7+
Regression (PCovR), as introduced by [deJong1992]_, is a modification to PCA
8+
that incorporates target information, such that the resulting embedding could
9+
be tuned using a mixing parameter α to improve performance in regression tasks
10+
(:math:`\alpha = 0` corresponding to linear regression and :math:`\alpha = 1`
11+
corresponding to PCA). [Helfrecht2020]_ introduced the non-linear version,
12+
Kernel Principal Covariates Regression (KPCovR), where the mixing parameter α
13+
now interpolates between kernel ridge regression (:math:`\alpha = 0`) and
14+
kernel principal components analysis (KPCA, :math:`\alpha = 1`).
15+
16+
The module includes:
17+
18+
* :ref:`PCovR-api` the standard Principal Covariates Regression. Utilises a
19+
combination between a PCA-like and an LR-like loss, and therefore attempts to find
20+
a low-dimensional projection of the feature vectors that simultaneously minimises
21+
information loss and error in predicting the target properties using only the
22+
latent space vectors :math:`\mathbf{T}`.
23+
* :ref:`KPCovR-api` the Kernel Principal Covariates Regression
24+
a kernel-based variation on the
25+
original PCovR method, proposed in [Helfrecht2020]_.
426
"""
527

628
from ._kernel_pcovr import KernelPCovR

0 commit comments

Comments
 (0)