Skip to content

Commit 7ef05eb

Browse files
committed
overhaul of documentation
* adding general description of the library * adding intro page for new users with an overview of the functionalities * adding contribution guidelines for adding datasets submitting issues and bugfixes * tweaking reconstruction measures doc inside the code a bit
1 parent 724993b commit 7ef05eb

File tree

8 files changed

+173
-4
lines changed

8 files changed

+173
-4
lines changed

docs/source/contributing.rst

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,122 @@ You may want to setup your editor to automatically apply the
3939
files, there are plugins to do this with `all major
4040
editors <https://black.readthedocs.io/en/stable/editor_integration.html>`_.
4141

42+
43+
Issues and Pull Requests
44+
########################
45+
46+
Having a problem with scikit-COSMO? Please let us know by `submitting an issue <https://github.com/lab-cosmo/scikit-cosmo/issues>`_.
47+
48+
Submit new features or bug fixes through a `pull request <https://github.com/lab-cosmo/scikit-cosmo/pulls>`_.
49+
50+
51+
Contributing Datasets
52+
#####################
53+
54+
Have an example dataset that would fit into scikit-COSMO?
55+
56+
Contributing a dataset is easy. First, copy your numpy file into
57+
``skcosmo/datasets/data/`` with an informative name. Here, we'll call it ``my-dataset.npz``.
58+
59+
Next, create a documentation file in ``skcosmo/datasets/data/my-dataset.rst``.
60+
This file should look like this:
61+
62+
.. code-block::
63+
64+
.. _my-dataset:
65+
66+
My Dataset
67+
##########
68+
69+
This is a summary of my dataset. My dataset was originally published in My Paper.
70+
71+
Function Call
72+
-------------
73+
74+
.. function:: skcosmo.datasets.load_my_dataset
75+
76+
Data Set Characteristics
77+
------------------------
78+
79+
:Number of Instances: ______
80+
81+
:Number of Features: ______
82+
83+
The representations were computed using the _____ package using the hyperparameters:
84+
85+
86+
+------------------------+------------+
87+
| key | value |
88+
+------------------------+------------+
89+
| hyperparameter 1 | _____ |
90+
+------------------------+------------+
91+
| hyperparameter 2 | _____ |
92+
+------------------------+------------+
93+
94+
Of the ____ resulting features, ____ were selected via _____.
95+
96+
References
97+
----------
98+
99+
Reference Code
100+
--------------
101+
102+
103+
Then, show ``scikit-cosmo`` how to load your data by adding a loader function to
104+
``skcosmo/datasets/_base.py``. It should look like this:
105+
106+
.. code-block:: python
107+
108+
def load_my_dataset():
109+
"""Load and returns my dataset.
110+
111+
Returns
112+
-------
113+
my_data : sklearn.utils.Bunch
114+
Dictionary-like object, with the following attributes:
115+
116+
data : `sklearn.utils.Bunch` --
117+
contains the keys ``X`` and ``y``.
118+
My input vectors and properties, respectively.
119+
120+
DESCR: `str` --
121+
The full description of the dataset.
122+
"""
123+
module_path = dirname(__file__)
124+
target_filename = join(module_path, "data", "my-dataset.npz")
125+
raw_data = np.load(target_filename)
126+
data = Bunch(
127+
X=raw_data["X"],
128+
y=raw_data["y"],
129+
)
130+
with open(join(module_path, "descr", "my-dataset.rst")) as rst_file:
131+
fdescr = rst_file.read()
132+
133+
return Bunch(data=data, DESCR=fdescr)
134+
135+
Add this function to ``skcosmo/datasets/__init__.py``.
136+
137+
Finally, add a test to ``skcosmo/tests/test_datasets.py`` to see that your dataset
138+
loads properly. It should look something like this:
139+
140+
.. code-block:: python
141+
142+
class MyDatasetTests(unittest.TestCase):
143+
@classmethod
144+
def setUpClass(cls):
145+
cls.my_data = load_my_data()
146+
147+
def test_load_my_data(self):
148+
# test if representations and properties have commensurate shape
149+
self.assertTrue(self.my_data.data.X.shape[0] == self.my_data.data.y.shape[0])
150+
151+
def test_load_my_data_descr(self):
152+
self.my_data.DESCR
153+
154+
155+
You're good to go! Time to submit a `pull request. <https://github.com/lab-cosmo/scikit-cosmo/pulls>`_
156+
157+
42158
License
43159
#######
44160

docs/source/datasets.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
Datasets
2-
================
2+
========
33

44
.. include:: ../../skcosmo/datasets/descr/degenerate_CH4_manifold.rst
55

66
.. include:: ../../skcosmo/datasets/descr/csd-1000r.rst
7+

docs/source/gfrm.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,24 @@ Reconstruction Measures
66
.. currentmodule:: skcosmo.metrics
77

88

9+
.. _GRE-api:
10+
911
Global Reconstruction Error
1012
###########################
1113

1214
.. autofunction:: pointwise_global_reconstruction_error
1315
.. autofunction:: global_reconstruction_error
1416

17+
.. _GRD-api:
18+
1519
Global Reconstruction Distortion
1620
################################
1721

1822
.. autofunction:: pointwise_global_reconstruction_distortion
1923
.. autofunction:: global_reconstruction_distortion
2024

25+
.. _LRE-api:
26+
2127
Local Reconstruction Error
2228
##########################
2329

docs/source/index.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,20 @@ scikit-cosmo documentation
55
compatible utilities that implement methods developed in the `COSMO laboratory
66
<https://cosmo.epfl.ch>`_.
77

8+
Convenient-to-use libraries such as scikit-learn have accelerated the adoption and application
9+
of machine learning (ML) workflows and data-driven methods. Such libraries have gained great
10+
popularity partly because the implemented methods are generally applicable in multiple domains.
11+
While developments in the atomistic learning community have put forward general-use machine
12+
learning methods, their deployment is commonly entangled with domain-specific functionalities,
13+
preventing access to a wider audience.
14+
15+
scikit-COSMO targets domain-agnostic implementations of methods developed in the
16+
computational chemical and materials science community, following the
17+
scikit-learn API and coding guidelines to promote usability and interoperability
18+
with existing workflows. scikit-COSMO contains a toolbox of methods for
19+
unsupervised and supervised analysis of ML datasets, including the comparison,
20+
decomposition, and selection of features and samples.
21+
822
.. toctree::
923
:maxdepth: 1
1024
:caption: Contents:

docs/source/intro.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,31 @@ Currently, scikit-COSMO contains models described in [Imbalzano2018]_, [Helfrech
1212
as some modifications to sklearn functionalities and minimal datasets that are useful within the field
1313
of computational materials science and chemistry.
1414

15+
16+
17+
- Fingerprint Selection:
18+
Multiple data sub-selection modules, for selecting the most relevant features and samples out of a large set of candidates [Imbalzano2018]_, [Helfrecht2020]_ and [Cersonsky2021]_.
19+
20+
* :ref:`CUR-api` decomposition: an iterative feature selection method based upon the singular value decoposition.
21+
* :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left singular vectors inspired by Principal Covariates Regression.
22+
* :ref:`FPS-api`: a common selection technique intended to exploit the diversity of the input space. The selection of the first point is made at random or by a separate metric.
23+
* :ref:`PCov-FPS-api` extends upon FPS much like PCov-CUR does to CUR.
24+
* :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi tessellations to accelerate selection.
25+
26+
- Reconstruction Measures:
27+
A set of easily-interpretable error measures of the relative information capacity of feature space `F` with respect to feature space `F'`.
28+
The methods returns a value between 0 and 1, where 0 means that `F` and `F'` are completey distinct in terms of linearly-decodable information, and where 1 means that `F'` is contained in `F`.
29+
All methods are implemented as the root mean-square error for the regression of the feature matrix `X_F'` (or sometimes called `Y` in the doc) from `X_F` (or sometimes called `X` in the doc) for transformations with different constraints (linear, orthogonal, locally-linear).
30+
By default a custom 2-fold cross-validation :py:class:`skosmo.linear_model.RidgeRegression2FoldCV` is used to ensure the generalization of the transformation and efficiency of the computation, since we deal with a multi-target regression problem.
31+
Methods were applied to compare different forms of featurizations through different hyperparameters and induced metrics and kernels [Goscinski2021]_ .
32+
33+
* :ref:`GRE-api` (GRE) computes the amount of linearly-decodable information recovered through a global linear reconstruction.
34+
* :ref:`GRD-api` (GRD) computes the amount of distortion contained in a global linear reconstruction.
35+
* :ref:`LRE-api` (LRE) computes the amount of decodable information recovered through a local linear reconstruction for the k-nearest neighborhood of each sample.
36+
37+
- Principal Covariates Regression
38+
39+
* PCovR: the standard Principal Covariates Regression [deJong1992]_. Utilises a combination between a PCA-like and an LR-like loss, and therefore attempts to find a low-dimensional projection of the feature vectors that simultaneously minimises information loss and error in predicting the target properties using only the latent space vectors $\mathbf{T}$ :ref:`PCovR-api`.
40+
* Kernel Principal Covariates Regression (KPCovR) a kernel-based variation on the original PCovR method, proposed in [Helfrecht2020]_ :ref:`KPCovR-api`.
41+
1542
If you would like to contribute to scikit-COSMO, check out our :ref:`contributing` page!

docs/source/selection.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ They are instantiated using
112112
113113
Xr = selector.transform(X)
114114
115+
.. _PCov-CUR-api:
115116

116117
PCov-CUR
117118
########
@@ -204,6 +205,8 @@ These selectors can be instantiated using
204205
205206
Xr = selector.transform(X)
206207
208+
.. _PCov-FPS-api:
209+
207210
PCov-FPS
208211
########
209212
PCov-FPS extends upon FPS much like PCov-CUR does to CUR. Instead of using the
@@ -247,6 +250,8 @@ be instantiated using
247250
248251
Xr = selector.transform(X)
249252
253+
.. _Voronoi-FPS-api:
254+
250255
Voronoi FPS
251256
###########
252257

docs/source/tutorials.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,5 @@ check out the pedagogic notebooks in our companion project `kernel-tutorials <ht
2929
:Caption: Feature Reconstruction Measures
3030

3131
read-only-examples/PlotGFRE
32+
read-only-examples/PlotPointwiseGFRE.ipynb
3233
read-only-examples/PlotLFRE

skcosmo/linear_model/_ridge.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,8 @@ class RidgeRegression2FoldCV(MultiOutputMixin, RegressorMixin):
2626
and in general more accurate, see issue #40. However, it is constraint to a svd
2727
solver for the matrix inversion.
2828
It offers additional functionalities in comparison to :obj:`sklearn.linear_model.Ridge`:
29-
The regularaization parameters can be chosen to be relative to the largest eigenvalue
30-
of the inverted matrix, and a cutoff regularization method is offered which is explained
31-
in the `Parameters` in detail.
29+
The regularaization parameters can be chosen relative to the largest eigenvalue of the feature matrix
30+
as well as regularization method. Details are explained in the `Parameters` section.
3231
3332
Parameters
3433
----------

0 commit comments

Comments
 (0)