scikit-learn-contrib
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 2 additions & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.rst‎
Lines changed: 104 additions & 132 deletions b/‎README.rst‎
Lines changed: 104 additions & 132 deletions
@@ -9,6 +9,7 @@ __pycache__/
 # documentation specific
 docs/_build/
 docs/generated/
+docs/examples/tutorials/
 
 # Distribution / packaging
 
 
@@ -23,3 +23,5 @@ repos:
     rev: v1.1.1
     hooks:
       - id: mypy
+        args: [--ignore-missing-imports]
+        additional_dependencies: [types-requests]
@@ -39,165 +39,141 @@ Python 3.8+
 🛠 Installation
 ===============
 
-Install via `pip`:
+Qolmat can be installed in different ways:
 
 .. code:: sh
 
-    $ pip install qolmat
-
-If you need to use tensorflow, you can install it with the following 'pip' command:
-
-.. code:: sh
-
-    $ pip install qolmat[tensorflow]
-
-To install directly from the github repository :
-
-.. code:: sh
-
-    $ pip install git+https://github.com/Quantmetry/qolmat
+    $ pip install qolmat  # installation via `pip`
+    $ pip install qolmat[tensorflow] # if you need tensorflow
+    $ pip install git+https://github.com/Quantmetry/qolmat  # or directly from the github repository
 
 ⚡️ Quickstart
 ==============
 
-Let us start with a basic imputation problem. Here, we generate one-dimensional noisy time series.
-
-.. code-block:: python
+Let us start with a basic imputation problem.
+We generate one-dimensional noisy time series with missing values.
+With just these few lines of code, you can see how easy it is to
 
-    import matplotlib.pyplot as plt
-    import numpy as np
-    import pandas as pd
-
-    np.random.seed(42)
-    t = np.linspace(0,1,1000)
-    y = np.cos(2*np.pi*t*10)+np.random.randn(1000)/2
-    df = pd.DataFrame({'y': y}, index=pd.Series(t, name='index'))
-
-For this demonstration, let us create artificial holes in our dataset.
+- impute missing values with one particular imputer;
+- benchmark multiple imputation methods with different metrics.
 
 .. code-block:: python
 
-    from qolmat.utils.data import add_holes
-    plt.rcParams.update({'font.size': 18})
-
-    ratio_masked = 0.1
-    mean_size = 20
-    df_with_nan = add_holes(df, ratio_masked=ratio_masked, mean_size=mean_size)
-    is_na = df_with_nan['y'].isna()
+  import numpy as np
+  import pandas as pd
 
-    plt.figure(figsize=(25,4))
-    plt.plot(df_with_nan['y'],'.')
-    plt.plot(df.loc[is_na, 'y'],'.')
-    plt. grid()
-    plt.xlim(0,1)
+  from qolmat.benchmark import comparator, missing_patterns
+  from qolmat.imputations import imputers
+  from qolmat.utils import data
 
-    plt.legend(['Data', 'Missing data'])
-    plt.savefig('readme1.png')
-    plt.show()
-
-.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme1.png
-    :align: center
+  # load and prepare csv data
 
-To impute missing data, there are several methods that can be imported with ``from qolmat.imputations import imputers``.
-The creation of an imputation dictionary will enable us to benchmark the various imputations.
+  df_data = data.get_data("Beijing")
+  columns = ["TEMP", "PRES", "WSPM"]
+  df_data = df_data[columns]
+  df_with_nan = data.add_holes(df_data, ratio_masked=0.2, mean_size=120)
 
-.. code-block:: python
-
-    from sklearn.linear_model import LinearRegression
-    from qolmat.imputations import imputers
-
-    imputer_mean = imputers.ImputerMean()
-    imputer_median = imputers.ImputerMedian()
-    imputer_mode = imputers.ImputerMode()
-    imputer_locf = imputers.ImputerLOCF()
-    imputer_nocb = imputers.ImputerNOCB()
-    imputer_interpol = imputers.ImputerInterpolation(method="linear")
-    imputer_spline = imputers.ImputerInterpolation(method="spline", order=2)
-    imputer_shuffle = imputers.ImputerShuffle()
-    imputer_residuals = imputers.ImputerResiduals(period=10, model_tsa="additive", extrapolate_trend="freq", method_interpolation="linear")
-    imputer_rpca = imputers.ImputerRPCA(columnwise=True, period=10, max_iter=200, tau=2, lam=.3)
-    imputer_rpca_opti = imputers.ImputerRPCA(columnwise=True, period = 10, max_iter=100)
-    imputer_ou = imputers.ImputerEM(model="multinormal", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
-    imputer_tsou = imputers.ImputerEM(model="VAR1", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
-    imputer_tsmle = imputers.ImputerEM(model="VAR1", method="mle", max_iter_em=34, n_iter_ou=15, dt=1e-3)
-    imputer_knn = imputers.ImputerKNN(k=10)
-    imputer_mice = imputers.ImputerMICE(estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan)
-    imputer_regressor = imputers.ImputerRegressor(estimator=LinearRegression())
-
-    dict_imputers = {
+  # impute and compare
+  imputer_mean = imputers.ImputerMean(groups=("station",))
+  imputer_interpol = imputers.ImputerInterpolation(method="linear", groups=("station",))
+  imputer_var1 = imputers.ImputerEM(model="VAR", groups=("station",), method="mle", max_iter_em=50, n_iter_ou=15, dt=1e-3, p=1)
+  dict_imputers = {
         "mean": imputer_mean,
-        "median": imputer_median,
-        "mode": imputer_mode,
         "interpolation": imputer_interpol,
-        "spline": imputer_spline,
-        "shuffle": imputer_shuffle,
-        "residuals": imputer_residuals,
-        "OU": imputer_ou,
-        "TSOU": imputer_tsou,
-        "TSMLE": imputer_tsmle,
-        "RPCA": imputer_rpca,
-        "RPCA_opti": imputer_rpca_opti,
-        "locf": imputer_locf,
-        "nocb": imputer_nocb,
-        "knn": imputer_knn,
-        "ols": imputer_regressor,
-        "mice_ols": imputer_mice,
-    }
-
-It is possible to define a parameter dictionary for an imputer with three pieces of information: min, max and type. The aim of the dictionary is to determine the optimal parameters for data imputation. Here, we call this dictionary ``dict_config_opti``.
-
-.. code-block:: python
-
-    search_params = {
-        "RPCA_opti": {
-            "tau": {"min": .5, "max": 5, "type":"Real"},
-            "lam": {"min": .1, "max": 1, "type":"Real"},
-        }
+        "VAR(1) process": imputer_var1
     }
-
-Then with the comparator function in ``from qolmat.benchmark import comparator``, we can compare the different imputation methods. This **does not use knowledge on missing values**, but it relies data masking instead. For more details on how imputors and comparator work, please see the following `link <https://qolmat.readthedocs.io/en/latest/explanation.html>`_.
-
-.. code-block:: python
-
-    from qolmat.benchmark import comparator
-
-    generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
-
-    comparison = comparator.Comparator(
+  generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
+  comparison = comparator.Comparator(
         dict_imputers,
-        ['y'],
+        columns,
         generator_holes = generator_holes,
         metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
-        n_calls_opt = 10,
-        dict_config_opti = dict_config_opti,
     )
-    results = comparison.compare(df_with_nan)
-
-We can observe the benchmark results.
+  results = comparison.compare(df_with_nan)
+  results.style.highlight_min(color="lightsteelblue", axis=1)
 
-.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme2.png
+.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_tabular_comparison.png
     :align: center
 
-Finally, we keep the best ``TSMLE`` imputor we represent.
+📘 Documentation
+================
 
-.. code-block:: python
+The full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.
+
+**How does Qolmat work ?**
 
-    dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)
+Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
+1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_,
+2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
+3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
 
-    plt.figure(figsize=(25,5))
-    plt.plot(df['y'],'.g')
-    plt.plot(dfs_imputed['y'],'.r')
-    plt.plot(df_with_nan['y'],'.b')
-    plt.show()
+This is very similar in spirit to the `cross_val_score <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html>`_ function for scikit-learn.
 
-.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme3.png
+.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/schema_qolmat.png
     :align: center
 
+**Imputation methods**
+
+The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic).
+
+.. list-table::
+   :widths: 25 70 15 15
+   :header-rows: 1
+
+   * - Method
+     - Description
+     - Tabular or Time series
+     - Single or Multiple
+   * - mean
+     - Imputes the missing values using the mean along each column
+     - tabular
+     - single
+   * - median
+     - Imputes the missing values using the median along each column
+     - tabular
+     - single
+   * - LOCF
+     - Imputes missing entries by carrying the last observation forward for each columns
+     - time series
+     - single
+   * - shuffle
+     - Imputes missing entries with the random value of each column
+     - tabular
+     - multiple
+   * - interpolation
+     - Imputes missing using some interpolation strategies supported by pd.Series.interpolate
+     - time series
+     - single
+   * - impute on residuals
+     - The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised
+     - time series
+     - single
+   * - MICE
+     - Multiple Imputation by Chained Equation
+     - tabular
+     - both
+   * - RPCA
+     - Robust Principal Component Analysis
+     - both
+     - single
+   * - SoftImpute
+     - Iterative method for matrix completion that uses nuclear-norm regularization
+     - tabular
+     - single
+   * - KNN
+     - K-nearest kneighbors
+     - tabular
+     - single
+   * - EM sampler
+     - Imputes missing values via EM algorithm
+     - both
+     - both
+   * - TabDDPM
+     - Imputer based on Denoising Diffusion Probabilistic Models
+     - both
+     - both
 
-📘 Documentation
-================
 
-The full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.
 
 📝 Contributing
 ===============
@@ -222,8 +198,6 @@ Qolmat has been developed by Quantmetry.
 🔍  References
 ==============
 
-Qolmat methods belong to the field of conformal inference.
-
 [1] Candès, Emmanuel J., et al. “Robust principal component analysis?.”
 Journal of the ACM (JACM) 58.3 (2011): 1-37,
 (`pdf <https://arxiv.org/abs/0912.3599>`__)
@@ -234,15 +208,13 @@ Journal of advanced transportation 2018 (2018).
 (`pdf <https://www.hindawi.com/journals/jat/2018/7191549/>`__)
 
 [3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in
-robust PCA: Noise, outliers, and missing data.” arXiv preprint
-arXiv:2001.05484 (2020), (`pdf <https://arxiv.org/abs/2001.05484>`__)
+robust PCA: Noise, outliers, and missing data.” Annals of statistics, 49(5), 2948 (2021), (`pdf <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9491514/pdf/nihms-1782570.pdf>`__)
 
 [4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of
 Selected Topics in Signal Processing 10.4 (2016): 740-756.
 (`pdf <https://arxiv.org/abs/1507.08173>`__)
 
-[5] Jiashi Feng, et al. “Online robust pca via stochastic opti-
-mization.“ Advances in neural information processing systems, 26, 2013.
+[5] Jiashi Feng, et al. “Online robust pca via stochastic optimization.“ Advances in neural information processing systems, 26, 2013.
 (`pdf <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.721.7506&rep=rep1&type=pdf>`__)
 
 [6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.