scikit-learn-contrib
diff --git a/‎.gitignore‎
Lines changed: 5 additions & 0 deletions b/‎.gitignore‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 29 additions & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎.vscode/settings.json‎
Lines changed: 1 addition & 1 deletion b/‎.vscode/settings.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 42 additions & 76 deletions b/‎README.md‎
Lines changed: 42 additions & 76 deletions
diff --git a/‎README.rst‎
Lines changed: 9 additions & 9 deletions b/‎README.rst‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎conda.yml‎
Lines changed: 35 additions & 28 deletions b/‎conda.yml‎
Lines changed: 35 additions & 28 deletions
@@ -4,6 +4,11 @@ __pycache__/
 .ipynb_checkpoints
 /data
 /figures
+qolmat/notebooks/figures
+qolmat/notebooks/*.ipynb
 *.egg-info
 /dist
 /build
+*SOURCES.txt
+TimeSynth*
+/scripts
@@ -0,0 +1,29 @@
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v2.3.0
+    hooks:
+      - id: check-yaml
+        exclude: (docs/)
+      - id: end-of-file-fixer
+        exclude: (docs/)
+      - id: trailing-whitespace
+        exclude: (docs/)
+  - repo: https://github.com/psf/black
+    rev: 22.8.0
+    hooks:
+      - id: black
+        exclude: (tests/)
+        args:
+          - "-l 99"
+  # Flake8
+  - repo: https://gitlab.com/pycqa/flake8
+    rev: 4.0.1
+    hooks:
+      - id: flake8
+        exclude: (tests/)
+        args:
+          - --max-line-length=99
+          - --ignore=E302,E305,W503,E203,E203,E731,E402
+          - --per-file-ignores=
+            - */__init__.py:F401
+            - qolmat/imputations/models.py:F401
@@ -1,3 +1,3 @@
 {
     "editor.formatOnSave": true
-}
+}
@@ -1,96 +1,62 @@
-RPCA for anomaly detection and data imputation
+Qolmat
 =
 
+The Qolmat package is created for the implementation and comparison of imputation methods. It can be divided into two main parts:
 
-## **Robust Principal Component Analysis**
+1. Impute missing values via multiple algorithms;
+2. Compare the imputation methods in a supervised manner.
 
-<details>
-<summary> What is robust principal component analysis? </summary>
+### **Imputation methods**
 
+For univariate time series:
 
-Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) which allows to work with grossly corrupted observations.
+* ```ImputeByMean```/```ImputeByMedian```/```ImputeByMode``` : Replaces missing entries with the mean, median or mode of each column. It uses ```pd.DataFrame.fillna()```.
+* ```RandomImpute``` : Replaces missing entries with the random value of each column.
+* ```ImputeLOCF```/```ImputeNOCB``` : Replaces missing entries by carrying the last observation forward/ next observation backward, for each columns.
+* ```ImputeByInterpolation```: Replaces missing using some interpolation strategies
+supported by ```pd.Series.interpolate````.
+* ```ImputeRPCA``` : Imputes values via a RPCA method.
 
-Suppose we are given a large data matrix $`\mathbf{D}`$, and know that it may be decomposed as
-```math
-\mathbf{D} = \mathbf{X}^* + \mathbf{A}^*
-```
-where $`\mathbf{X}^*`$ has low-rank and $`\mathbf{A}^*`$ is sparse. We do not know the low-dimensional column and row space of $`\mathbf{X}^*`$, not even their dimension. Similarly, for the non-zero entries of $`\mathbf{A}^*`$, we do not know their location, magnitude or even their number. Are the low-rank and sparse parts possible to recover both *accurately* and *efficiently*?
+For multivariate time series:
 
-Of course, for the separation problem to make sense, the low-rank part cannot be sparse and analogously, the sparse part cannot be low-rank. See [here](https://arxiv.org/abs/0912.3599) for more details.
+* ```ImputeKNN``` : Replaces missing entries with the k-nearest neighbors. It uses the ```sklearn.impute.KNNImputer```.
+* ```ImputeIterative``` : Imputes each Series within a DataFrame multiple times using an iteration of fits and transformations to reach a stable state of imputation each time.It uses ```sklearn.impute.IterativeImputer```
+* ```ImputeRegressor```:  It imputes each Series with missing value within a DataFrame using a regression model whose features are based on the complete ones only.
+* ```ImputeStochasticRegressor```:  It imputes each Series with missing value within a DataFrame using a stochastic regression model whose features are based on the complete ones only.
+* ```ImputeRPCA``` : Imputes values via a RPCA method.
+* ```ImputeEM``` : Imputation of missing values using a multivariate Gaussian model through EM optimization and using a projected (Ornstein-Uhlenbeck) process.
 
-Formally, the problem is expressed as
-```math
-\begin{align*}
-& \text{minimise} \quad \text{rank} (\mathbf{X}) + \lambda \Vert \mathbf{A} \Vert_0 \\
-& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A}
-\end{align*}
-```
-Unfortunately this optimization problem is a NP-hard problem due to its nonconvexity and discontinuity. So then, a widely used solving scheme is replacing rank($`\mathbf{X}`$) by its convex envelope —the nuclear norm $`\Vert \mathbf{X} \Vert_*`$— and the $`\ell_0`$ penalty is replaced with the $`\ell_1`$-norm, which is good at modeling the sparse noise and has high efficient solution. Therefore, the problem becomes
-```math
-\begin{align*}
-& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\
-& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A}
-\end{align*}
-```
+### **Comparator**
 
-Theoretically, this is guaranteed to work even if the rank of $`\mathbf{X}^*`$ grows almost linearly in the dimension of the matrix, and the errors in $`\mathbf{A}^*`$ are up to a constant fraction of all entries. Algorithmically, the above problem can be solved by efficient and scalable algorithms, at a cost not so much higher than the classical PCA. Empirically, a number of simulations and experiments suggest this works under surprisingly broad conditions for many types of real data.
+The ```Comparator``` class implements a way to compare multiple imputation methods.
+It is based on the standard approach to select some observations, set their status to missing, and compare
+their imputation with their true values.
 
-Some examples of real-life applications are background modelling from video surveillance, face recognition, speech recognition. We here focus on anomaly detection in time series.
-</details>
+More specifically, from the initial dataframe with missing value, we generate additional missing values (N samples/times).
+MIssing values can be generated following three mechanisms: MCAR, MAR and MNAR.
 
+* In the MCAR setting, each value is masked according to the realisation of a Bernoulli random variable with a fixed parameter.
+* In the MAR setting, for each experiment, a fixed subset of variables that cannot have missing values is sampled. Then, the remaining variables have missing values according to a logistic model with random weights, which takes the non-missing variables as inputs. A bias term is fitted using line search to attain the desired proportion of missing values.
+* Finally, two different mechanisms are implemented in the MNAR setting.
 
-<details>
-<summary>What's in this repo ?</summary>
-
-Some classes are implemented: 
-1. ```RPCA``` (see p.29 of this [paper](https://arxiv.org/abs/0912.3599)).
-The optimisation problem is the following 
-```math
-\begin{align*}
-& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\
-& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A}
-\end{align*}
-```
-2. ```ImprovedRPCA``` (based on this [paper](https://www.hindawi.com/journals/jat/2018/7191549/)). The optimisation problem is the following 
-```math
-\begin{align*}
-& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{i=1}^p \eta_i \Vert \mathbf{H_iX} \Vert_1\\
-& \text{s.t.} \quad P_{\Omega}(\mathbf{D}) = P_{\Omega}(\mathbf{X} + \mathbf{A})
-\end{align*}
-```
-
-3. ```NoisyRPCA``` (based on this [paper](https://arxiv.org/abs/2001.05484) and this [paper](https://www.hindawi.com/journals/jat/2018/7191549/)). The optimisation problem is the following 
-```math
-\text{minimise} \quad \Vert P_{\Omega}(\mathbf{X}+\mathbf{A}-\mathbf{D}) \Vert_F^2 + \tau \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{i=1}^p \eta_i \Vert \mathbf{H_iX} \Vert_1
-```
-4. ```GraphRPCA``` (based on this [paper](https://arxiv.org/abs/1507.08173)). The optimisation problem is the following 
-```math
-\begin{align*}
-& \text{minimise} \quad  \Vert \mathbf{A} \Vert_1 + \gamma_1 \text{tr}(\mathbf{X} \mathbf{\mathcal{L}_1} \mathbf{X}^T) + \gamma_2 \text{tr}(\mathbf{X}^T \mathbf{\mathcal{L}_2} \mathbf{X}) \\
-& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A}
-\end{align*}
-```
+    * The first is identical to the previously described MAR mechanism, but the inputs of the logistic model are then masked by a MCAR mechanism. Hence, the logistic model’s outcome now depends on potentially missing values.
+    * The second mechanism, ``self masked``, samples a subset of variables whose values in the lower and upper p-th percentiles are masked according to a Bernoulli random variable, and the values in-between are left not missing.
 
-The operator $`P_{\Omega}(\mathbf{M})`$ is the projection of $`\mathbf{M}`$ on the set of observed data $`\Omega`$. This allows to deal with missing values. Each of these classes is adapted to take as input either a time series or a matrix directly. If a time series is passed, a pre-processing is done.
-</details>
+On each sample, different imputation models are tested and reconstruction errors are computed on these artificially missing entries. Then the errors of each imputation model are averaged and we eventually obtained a unique error score per model. This procedure allows the comparison of different models on the same dataset.
 
+<p align="center" width="100%">
+<img src="docs/images/comparator.png" alt="comparator" width="60%"/>
+</p>
 
-
-**TL;DR** RPCA can be described as the decomposition of a matrix of observations $`\mathbf{D}`$ into two matrices: a low-rank matrix $`\mathbf{X}`$ and a sparse matrix $`\mathbf{A}`$. See the examples folder for a first overview of the implemented classes. 
-
-
-## **Installation**
+### **Installation**
 
 ```
-conda env create -f environment.yml
-conda activate robustpcaEnv
+conda env create -f conda.yml
+conda activate env_qolmat
 ```
+### Install pre-commit
 
-## **References**
-[1] Candès, Emmanuel J., et al. "Robust principal component analysis?." Journal of the ACM (JACM) 58.3 (2011): 1-37, ([pdf](https://arxiv.org/abs/0912.3599))
-
-[2] Wang, Xuehui, et al. "An improved robust principal component analysis model for anomalies detection of subway passenger flow." Journal of advanced transportation 2018 (2018). ([pdf](https://www.hindawi.com/journals/jat/2018/7191549/))
-
-[3] Chen, Yuxin, et al. "Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data." arXiv preprint arXiv:2001.05484 (2020), ([pdf](https://arxiv.org/abs/2001.05484))
-
-[4] Shahid, Nauman, et al. "Fast robust PCA on graphs." IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. ([pdf](https://arxiv.org/abs/1507.08173))
+Once the environment is installed, pre-commit is installed, but need to be activated using the following command:
+```
+pre-commit install
+```
@@ -13,7 +13,7 @@ which allows to work with grossly corrupted observations.
 Suppose we are given a large data matrix :math:`\mathbf{D}`, and know
 that it may be decomposed as
 
-.. math:: 
+.. math::
 
    \mathbf{D} = \mathbf{X}^* + \mathbf{A}^*
 
@@ -30,7 +30,7 @@ See `here <https://arxiv.org/abs/0912.3599>`__ for more details.
 
 Formally, the problem is expressed as
 
-.. math:: 
+.. math::
 
    \begin{align*}
    & \text{minimise} \quad \text{rank} (\mathbf{X}) + \lambda \Vert \mathbf{A} \Vert_0 \\
@@ -45,7 +45,7 @@ penalty is replaced with the :math:`\ell_1`-norm, which is good at
 modeling the sparse noise and has high efficient solution. Therefore,
 the problem becomes
 
-.. math:: 
+.. math::
 
    \begin{align*}
    & \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\
@@ -69,11 +69,11 @@ on anomaly detection in time series.
 What’s in this repo?
 ====================
 
-Some classes are implemented: 
+Some classes are implemented:
 
 * :class:`RPCA` class (see p.29 of this `paper <https://arxiv.org/abs/0912.3599>`__). The optimisation problem is the following
 
-.. math:: 
+.. math::
 
    \begin{align*}
    & \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\
@@ -82,7 +82,7 @@ Some classes are implemented:
 
 * :class:`GraphRPCA` class (based on this `paper <https://arxiv.org/abs/1507.08173>`__). The optimisation problem is the following
 
-.. math:: 
+.. math::
 
    \begin{align*}
    & \text{minimise} \quad  \Vert \mathbf{A} \Vert_1 + \gamma_1 \text{tr}(\mathbf{X} \mathbf{\mathcal{L}_1} \mathbf{X}^T) + \gamma_2 \text{tr}(\mathbf{X}^T \mathbf{\mathcal{L}_2} \mathbf{X}) \\
@@ -91,14 +91,14 @@ Some classes are implemented:
 
 * :class:`TemporalRPCA` class (based on this `paper <https://arxiv.org/abs/2001.05484>`__ and this `paper <https://www.hindawi.com/journals/jat/2018/7191549/>`__). The optimisation problem is the following
 
-.. math:: 
+.. math::
 
    \text{minimise} \quad \Vert P_{\Omega}(\mathbf{X}+\mathbf{A}-\mathbf{D}) \Vert_F^2 + \lambda_1 \Vert \mathbf{X} \Vert_* + \lambda_2 \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{XH_k} \Vert_p
 
 where :math:`\Vert \mathbf{XH_k} \Vert_p` is either :math:`\Vert \mathbf{XH_k} \Vert_1` or  :math:`\Vert \mathbf{XH_k} \Vert_F^2`.
 
 
-The operator :math:`P_{\Omega}` is the projection operator such that 
+The operator :math:`P_{\Omega}` is the projection operator such that
 :math:`P_{\Omega}(\mathbf{M})` is the projection of
 :math:`\mathbf{M}` on the set of observed data :math:`\Omega`. This
 allows to deal with missing values. Each of these classes is adapted to
@@ -115,7 +115,7 @@ Install directly from the gitlab repository:
 Contributing
 ============
 
-Feel free to open an issue or contact us at [email protected] 
+Feel free to open an issue or contact us at [email protected]
 
 References
 ==========
 
@@ -1,31 +1,38 @@
 name: env_qolmat
 channels:
-  - conda-forge
-  - defaults
+      - conda-forge
+      - defaults
 dependencies:
-  - bump2version
-  - flake8
-  - ipykernel
-  - jupyter
-  - mypy
-  - numpy
-  - numpydoc
-  - pandas
-  - pytest
-  - pytest-cov
-  - scikit-learn
-  - sphinx
-  - sphinx-gallery
-  - sphinx_rtd_theme
-  - twine
-  - wheel
-  - jupyterlab
-#   - fbprophet
-  - s3fs
-  - scikit-optimize
-  - pyarrow
-  - matplotlib
-  - plotly
-  - pip
-  - pip:
-    - -e .
+      - bump2version
+      - ipykernel
+      - jupyter
+      - seaborn
+      - sphinx
+      - sphinx-gallery
+      - sphinx_rtd_theme
+      - twine
+      - wheel
+      - jupyterlab
+      - jupytext
+      - s3fs
+      - pyarrow
+      - pip
+      - pip:
+              - -e .
+              - flake8==6.0.0
+              - jupytext==1.14.4
+              - matplotlib==3.6.2
+              - missingpy==0.2.0
+              - numpy==1.24.1
+              - numpydoc==1.5.0
+              - mypy=0.911
+              - pandas==1.5.2
+              - plotly==5.11.0
+              - pre-commit==2.21.0
+              - pytest==7.2.0
+              - pytest-cov==4.0.0
+              - seaborn==0.12.2
+              - scikit-learn==1.1.3
+              - scikit-optimize==0.9.0
+      - forge:
+              - gcc=12.1.0
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`	`1`	`{`
`2`	`2`	`"editor.formatOnSave": true`
`3`		`-}`
	`3`	`+}`