overhaul of documentation

agoscinski · agoscinski · commit 7ef05ebc73e5 · 2023-01-07T13:59:59.000+01:00
* adding general description of the library

* adding intro page for new users with an overview of the
  functionalities

* adding contribution guidelines for adding datasets submitting issues
  and bugfixes

* tweaking reconstruction measures doc inside the code a bit
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -39,6 +39,122 @@ You may want to setup your editor to automatically apply the
 files, there are plugins to do this with `all major
 editors <https://black.readthedocs.io/en/stable/editor_integration.html>`_.
 
+
+Issues and Pull Requests
+########################
+
+Having a problem with scikit-COSMO? Please let us know by `submitting an issue <https://github.com/lab-cosmo/scikit-cosmo/issues>`_.
+
+Submit new features or bug fixes through a `pull request <https://github.com/lab-cosmo/scikit-cosmo/pulls>`_.
+
+
+Contributing Datasets
+#####################
+
+Have an example dataset that would fit into scikit-COSMO?
+
+Contributing a dataset is easy. First, copy your numpy file into
+``skcosmo/datasets/data/`` with an informative name. Here, we'll call it ``my-dataset.npz``.
+
+Next, create a documentation file in ``skcosmo/datasets/data/my-dataset.rst``.
+This file should look like this:
+
+.. code-block::
+
+  .. _my-dataset:
+
+  My Dataset
+  ##########
+
+  This is a summary of my dataset. My dataset was originally published in My Paper.
+
+  Function Call
+  -------------
+
+  .. function:: skcosmo.datasets.load_my_dataset
+
+  Data Set Characteristics
+  ------------------------
+
+  :Number of Instances: ______
+
+  :Number of Features: ______
+
+  The representations were computed using the _____ package using the hyperparameters:
+
+
+  +------------------------+------------+
+  | key                    |   value    |
+  +------------------------+------------+
+  | hyperparameter 1       |    _____   |
+  +------------------------+------------+
+  | hyperparameter 2       |    _____   |
+  +------------------------+------------+
+
+  Of the ____ resulting features, ____ were selected via _____.
+
+  References
+  ----------
+
+  Reference Code
+  --------------
+
+
+Then, show ``scikit-cosmo`` how to load your data by adding a loader function to
+``skcosmo/datasets/_base.py``. It should look like this:
+
+.. code-block:: python
+
+  def load_my_dataset():
+      """Load and returns my dataset.
+
+      Returns
+      -------
+      my_data : sklearn.utils.Bunch
+          Dictionary-like object, with the following attributes:
+
+          data : `sklearn.utils.Bunch` --
+          contains the keys ``X`` and ``y``.
+          My input vectors and properties, respectively.
+
+          DESCR: `str` --
+          The full description of the dataset.
+      """
+      module_path = dirname(__file__)
+      target_filename = join(module_path, "data", "my-dataset.npz")
+      raw_data = np.load(target_filename)
+      data = Bunch(
+          X=raw_data["X"],
+          y=raw_data["y"],
+      )
+      with open(join(module_path, "descr", "my-dataset.rst")) as rst_file:
+          fdescr = rst_file.read()
+
+      return Bunch(data=data, DESCR=fdescr)
+
+Add this function to ``skcosmo/datasets/__init__.py``.
+
+Finally, add a test to ``skcosmo/tests/test_datasets.py`` to see that your dataset
+loads properly. It should look something like this:
+
+.. code-block:: python
+
+  class MyDatasetTests(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.my_data = load_my_data()
+
+    def test_load_my_data(self):
+        # test if representations and properties have commensurate shape
+        self.assertTrue(self.my_data.data.X.shape[0] == self.my_data.data.y.shape[0])
+
+    def test_load_my_data_descr(self):
+        self.my_data.DESCR
+
+
+You're good to go! Time to submit a `pull request. <https://github.com/lab-cosmo/scikit-cosmo/pulls>`_
+
+
 License
 #######
 
diff --git a/docs/source/datasets.rst b/docs/source/datasets.rst
@@ -1,6 +1,7 @@
 Datasets
-================
+========
 
 .. include:: ../../skcosmo/datasets/descr/degenerate_CH4_manifold.rst
 
 .. include:: ../../skcosmo/datasets/descr/csd-1000r.rst
+
diff --git a/docs/source/gfrm.rst b/docs/source/gfrm.rst
@@ -6,18 +6,24 @@ Reconstruction Measures
 .. currentmodule:: skcosmo.metrics
 
 
+.. _GRE-api:
+
 Global Reconstruction Error
 ###########################
 
 .. autofunction:: pointwise_global_reconstruction_error 
 .. autofunction:: global_reconstruction_error 
 
+.. _GRD-api:
+
 Global Reconstruction Distortion 
 ################################
 
 .. autofunction:: pointwise_global_reconstruction_distortion
 .. autofunction:: global_reconstruction_distortion
 
+.. _LRE-api:
+
 Local Reconstruction Error
 ##########################
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -5,6 +5,20 @@ scikit-cosmo documentation
 compatible utilities that implement methods developed in the `COSMO laboratory
 <https://cosmo.epfl.ch>`_.
 
+Convenient-to-use libraries such as scikit-learn have accelerated the adoption and application
+of machine learning (ML) workflows and data-driven methods. Such libraries have gained great
+popularity partly because the implemented methods are generally applicable in multiple domains.
+While developments in the atomistic learning community have put forward general-use machine
+learning methods, their deployment is commonly entangled with domain-specific functionalities,
+preventing access to a wider audience.
+
+scikit-COSMO targets domain-agnostic implementations of methods developed in the
+computational chemical and materials science community, following the
+scikit-learn API and coding guidelines to promote usability and interoperability
+with existing workflows. scikit-COSMO contains a toolbox of methods for
+unsupervised and supervised analysis of ML datasets, including the comparison,
+decomposition, and selection of features and samples.
+
 .. toctree::
   :maxdepth: 1
   :caption: Contents:
diff --git a/docs/source/intro.rst b/docs/source/intro.rst
@@ -12,4 +12,31 @@ Currently, scikit-COSMO contains models described in [Imbalzano2018]_, [Helfrech
 as some modifications to sklearn functionalities and minimal datasets that are useful within the field
 of computational materials science and chemistry.
 
+
+
+- Fingerprint Selection:
+   Multiple data sub-selection modules, for selecting the most relevant features and samples out of a large set of candidates [Imbalzano2018]_, [Helfrecht2020]_ and [Cersonsky2021]_.
+
+   * :ref:`CUR-api` decomposition: an iterative feature selection method based upon the singular value decoposition.
+   * :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left singular vectors inspired by Principal Covariates Regression.
+   * :ref:`FPS-api`: a common selection technique intended to exploit the diversity of the input space. The selection of the first point is made at random or by a separate metric.
+   * :ref:`PCov-FPS-api` extends upon FPS much like PCov-CUR does to CUR.
+   * :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi tessellations to accelerate selection.
+
+- Reconstruction Measures:
+   A set of easily-interpretable error measures of the relative information capacity of feature space `F` with respect to feature space `F'`.
+   The methods returns a value between 0 and 1, where 0 means that `F` and `F'` are completey distinct in terms of linearly-decodable information, and where 1 means that `F'` is contained in `F`.
+   All methods are implemented as the root mean-square error for the regression of the feature matrix `X_F'` (or sometimes called `Y` in the doc) from `X_F` (or sometimes called `X` in the doc) for transformations with different constraints (linear, orthogonal, locally-linear).
+   By default a custom 2-fold cross-validation :py:class:`skosmo.linear_model.RidgeRegression2FoldCV` is used to ensure the generalization of the transformation and efficiency of the computation, since we deal with a multi-target regression problem.
+   Methods were applied to compare different forms of featurizations through different hyperparameters and induced metrics and kernels [Goscinski2021]_ .
+
+   * :ref:`GRE-api` (GRE) computes the amount of linearly-decodable information recovered through a global linear reconstruction.
+   * :ref:`GRD-api` (GRD) computes the amount of distortion contained in a global linear reconstruction. 
+   * :ref:`LRE-api` (LRE) computes the amount of decodable information recovered through a local linear reconstruction for the k-nearest neighborhood of each sample.
+
+- Principal Covariates Regression
+
+   * PCovR: the standard Principal Covariates Regression [deJong1992]_. Utilises a combination between a PCA-like and an LR-like loss, and therefore attempts to find a low-dimensional projection of the feature vectors that simultaneously minimises information loss and error in predicting the target properties using only the latent space vectors $\mathbf{T}$ :ref:`PCovR-api`.
+   * Kernel Principal Covariates Regression (KPCovR) a kernel-based variation on the original PCovR method, proposed in [Helfrecht2020]_ :ref:`KPCovR-api`.
+  
 If you would like to contribute to scikit-COSMO, check out our :ref:`contributing` page!
diff --git a/docs/source/selection.rst b/docs/source/selection.rst
@@ -112,6 +112,7 @@ They are instantiated using
 
     Xr = selector.transform(X)
 
+.. _PCov-CUR-api:
 
 PCov-CUR
 ########
@@ -204,6 +205,8 @@ These selectors can be instantiated using
 
     Xr = selector.transform(X)
 
+.. _PCov-FPS-api:
+
 PCov-FPS
 ########
 PCov-FPS extends upon FPS much like PCov-CUR does to CUR. Instead of using the
@@ -247,6 +250,8 @@ be instantiated using
 
     Xr = selector.transform(X)
 
+.. _Voronoi-FPS-api:
+
 Voronoi FPS
 ###########
 
diff --git a/docs/source/tutorials.rst b/docs/source/tutorials.rst
@@ -29,4 +29,5 @@ check out the pedagogic notebooks in our companion project `kernel-tutorials <ht
   :Caption: Feature Reconstruction Measures
 
   read-only-examples/PlotGFRE
+  read-only-examples/PlotPointwiseGFRE.ipynb
   read-only-examples/PlotLFRE
diff --git a/skcosmo/linear_model/_ridge.py b/skcosmo/linear_model/_ridge.py
@@ -26,9 +26,8 @@ class RidgeRegression2FoldCV(MultiOutputMixin, RegressorMixin):
     and in general more accurate, see issue #40. However, it is constraint to a svd
     solver for the matrix inversion.
     It offers additional functionalities in comparison to :obj:`sklearn.linear_model.Ridge`:
-    The regularaization parameters can be chosen to be relative to the largest eigenvalue
-    of the inverted matrix, and a cutoff regularization method is offered which is explained
-    in the `Parameters` in detail.
+    The regularaization parameters can be chosen relative to the largest eigenvalue of the feature matrix
+    as well as regularization method. Details are explained in the `Parameters` section.
 
     Parameters
     ----------