Skip to content

Commit 6003dd4

Browse files
Julien RousselJulien Roussel
authored andcommitted
Merge branch 'dev' into 'main'
New stable version released, including grouped hole generator and grouped imputers See merge request quantmetry/retd/qolmat!8
2 parents a5246df + b735615 commit 6003dd4

File tree

330 files changed

+439774
-68838
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

330 files changed

+439774
-68838
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@ __pycache__/
44
.ipynb_checkpoints
55
/data
66
/figures
7+
qolmat/notebooks/figures
8+
qolmat/notebooks/*.ipynb
79
*.egg-info
810
/dist
911
/build
12+
*SOURCES.txt
13+
TimeSynth*
14+
/scripts

.pre-commit-config.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v2.3.0
4+
hooks:
5+
- id: check-yaml
6+
exclude: (docs/)
7+
- id: end-of-file-fixer
8+
exclude: (docs/)
9+
- id: trailing-whitespace
10+
exclude: (docs/)
11+
- repo: https://github.com/psf/black
12+
rev: 22.8.0
13+
hooks:
14+
- id: black
15+
exclude: (tests/)
16+
args:
17+
- "-l 99"
18+
# Flake8
19+
- repo: https://gitlab.com/pycqa/flake8
20+
rev: 4.0.1
21+
hooks:
22+
- id: flake8
23+
exclude: (tests/)
24+
args:
25+
- --max-line-length=99
26+
- --ignore=E302,E305,W503,E203,E203,E731,E402
27+
- --per-file-ignores=
28+
- */__init__.py:F401
29+
- qolmat/imputations/models.py:F401

.vscode/settings.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
{
22
"editor.formatOnSave": true
3-
}
3+
}

README.md

Lines changed: 42 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,96 +1,62 @@
1-
RPCA for anomaly detection and data imputation
1+
Qolmat
22
=
33

4+
The Qolmat package is created for the implementation and comparison of imputation methods. It can be divided into two main parts:
45

5-
## **Robust Principal Component Analysis**
6+
1. Impute missing values via multiple algorithms;
7+
2. Compare the imputation methods in a supervised manner.
68

7-
<details>
8-
<summary> What is robust principal component analysis? </summary>
9+
### **Imputation methods**
910

11+
For univariate time series:
1012

11-
Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) which allows to work with grossly corrupted observations.
13+
* ```ImputeByMean```/```ImputeByMedian```/```ImputeByMode``` : Replaces missing entries with the mean, median or mode of each column. It uses ```pd.DataFrame.fillna()```.
14+
* ```RandomImpute``` : Replaces missing entries with the random value of each column.
15+
* ```ImputeLOCF```/```ImputeNOCB``` : Replaces missing entries by carrying the last observation forward/ next observation backward, for each columns.
16+
* ```ImputeByInterpolation```: Replaces missing using some interpolation strategies
17+
supported by ```pd.Series.interpolate````.
18+
* ```ImputeRPCA``` : Imputes values via a RPCA method.
1219

13-
Suppose we are given a large data matrix $`\mathbf{D}`$, and know that it may be decomposed as
14-
```math
15-
\mathbf{D} = \mathbf{X}^* + \mathbf{A}^*
16-
```
17-
where $`\mathbf{X}^*`$ has low-rank and $`\mathbf{A}^*`$ is sparse. We do not know the low-dimensional column and row space of $`\mathbf{X}^*`$, not even their dimension. Similarly, for the non-zero entries of $`\mathbf{A}^*`$, we do not know their location, magnitude or even their number. Are the low-rank and sparse parts possible to recover both *accurately* and *efficiently*?
20+
For multivariate time series:
1821

19-
Of course, for the separation problem to make sense, the low-rank part cannot be sparse and analogously, the sparse part cannot be low-rank. See [here](https://arxiv.org/abs/0912.3599) for more details.
22+
* ```ImputeKNN``` : Replaces missing entries with the k-nearest neighbors. It uses the ```sklearn.impute.KNNImputer```.
23+
* ```ImputeIterative``` : Imputes each Series within a DataFrame multiple times using an iteration of fits and transformations to reach a stable state of imputation each time.It uses ```sklearn.impute.IterativeImputer```
24+
* ```ImputeRegressor```: It imputes each Series with missing value within a DataFrame using a regression model whose features are based on the complete ones only.
25+
* ```ImputeStochasticRegressor```: It imputes each Series with missing value within a DataFrame using a stochastic regression model whose features are based on the complete ones only.
26+
* ```ImputeRPCA``` : Imputes values via a RPCA method.
27+
* ```ImputeEM``` : Imputation of missing values using a multivariate Gaussian model through EM optimization and using a projected (Ornstein-Uhlenbeck) process.
2028

21-
Formally, the problem is expressed as
22-
```math
23-
\begin{align*}
24-
& \text{minimise} \quad \text{rank} (\mathbf{X}) + \lambda \Vert \mathbf{A} \Vert_0 \\
25-
& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A}
26-
\end{align*}
27-
```
28-
Unfortunately this optimization problem is a NP-hard problem due to its nonconvexity and discontinuity. So then, a widely used solving scheme is replacing rank($`\mathbf{X}`$) by its convex envelope —the nuclear norm $`\Vert \mathbf{X} \Vert_*`$— and the $`\ell_0`$ penalty is replaced with the $`\ell_1`$-norm, which is good at modeling the sparse noise and has high efficient solution. Therefore, the problem becomes
29-
```math
30-
\begin{align*}
31-
& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\
32-
& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A}
33-
\end{align*}
34-
```
29+
### **Comparator**
3530

36-
Theoretically, this is guaranteed to work even if the rank of $`\mathbf{X}^*`$ grows almost linearly in the dimension of the matrix, and the errors in $`\mathbf{A}^*`$ are up to a constant fraction of all entries. Algorithmically, the above problem can be solved by efficient and scalable algorithms, at a cost not so much higher than the classical PCA. Empirically, a number of simulations and experiments suggest this works under surprisingly broad conditions for many types of real data.
31+
The ```Comparator``` class implements a way to compare multiple imputation methods.
32+
It is based on the standard approach to select some observations, set their status to missing, and compare
33+
their imputation with their true values.
3734

38-
Some examples of real-life applications are background modelling from video surveillance, face recognition, speech recognition. We here focus on anomaly detection in time series.
39-
</details>
35+
More specifically, from the initial dataframe with missing value, we generate additional missing values (N samples/times).
36+
MIssing values can be generated following three mechanisms: MCAR, MAR and MNAR.
4037

38+
* In the MCAR setting, each value is masked according to the realisation of a Bernoulli random variable with a fixed parameter.
39+
* In the MAR setting, for each experiment, a fixed subset of variables that cannot have missing values is sampled. Then, the remaining variables have missing values according to a logistic model with random weights, which takes the non-missing variables as inputs. A bias term is fitted using line search to attain the desired proportion of missing values.
40+
* Finally, two different mechanisms are implemented in the MNAR setting.
4141

42-
<details>
43-
<summary>What's in this repo ?</summary>
44-
45-
Some classes are implemented:
46-
1. ```RPCA``` (see p.29 of this [paper](https://arxiv.org/abs/0912.3599)).
47-
The optimisation problem is the following
48-
```math
49-
\begin{align*}
50-
& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\
51-
& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A}
52-
\end{align*}
53-
```
54-
2. ```ImprovedRPCA``` (based on this [paper](https://www.hindawi.com/journals/jat/2018/7191549/)). The optimisation problem is the following
55-
```math
56-
\begin{align*}
57-
& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{i=1}^p \eta_i \Vert \mathbf{H_iX} \Vert_1\\
58-
& \text{s.t.} \quad P_{\Omega}(\mathbf{D}) = P_{\Omega}(\mathbf{X} + \mathbf{A})
59-
\end{align*}
60-
```
61-
62-
3. ```NoisyRPCA``` (based on this [paper](https://arxiv.org/abs/2001.05484) and this [paper](https://www.hindawi.com/journals/jat/2018/7191549/)). The optimisation problem is the following
63-
```math
64-
\text{minimise} \quad \Vert P_{\Omega}(\mathbf{X}+\mathbf{A}-\mathbf{D}) \Vert_F^2 + \tau \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{i=1}^p \eta_i \Vert \mathbf{H_iX} \Vert_1
65-
```
66-
4. ```GraphRPCA``` (based on this [paper](https://arxiv.org/abs/1507.08173)). The optimisation problem is the following
67-
```math
68-
\begin{align*}
69-
& \text{minimise} \quad \Vert \mathbf{A} \Vert_1 + \gamma_1 \text{tr}(\mathbf{X} \mathbf{\mathcal{L}_1} \mathbf{X}^T) + \gamma_2 \text{tr}(\mathbf{X}^T \mathbf{\mathcal{L}_2} \mathbf{X}) \\
70-
& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A}
71-
\end{align*}
72-
```
42+
* The first is identical to the previously described MAR mechanism, but the inputs of the logistic model are then masked by a MCAR mechanism. Hence, the logistic model’s outcome now depends on potentially missing values.
43+
* The second mechanism, ``self masked``, samples a subset of variables whose values in the lower and upper p-th percentiles are masked according to a Bernoulli random variable, and the values in-between are left not missing.
7344

74-
The operator $`P_{\Omega}(\mathbf{M})`$ is the projection of $`\mathbf{M}`$ on the set of observed data $`\Omega`$. This allows to deal with missing values. Each of these classes is adapted to take as input either a time series or a matrix directly. If a time series is passed, a pre-processing is done.
75-
</details>
45+
On each sample, different imputation models are tested and reconstruction errors are computed on these artificially missing entries. Then the errors of each imputation model are averaged and we eventually obtained a unique error score per model. This procedure allows the comparison of different models on the same dataset.
7646

47+
<p align="center" width="100%">
48+
<img src="docs/images/comparator.png" alt="comparator" width="60%"/>
49+
</p>
7750

78-
79-
**TL;DR** RPCA can be described as the decomposition of a matrix of observations $`\mathbf{D}`$ into two matrices: a low-rank matrix $`\mathbf{X}`$ and a sparse matrix $`\mathbf{A}`$. See the examples folder for a first overview of the implemented classes.
80-
81-
82-
## **Installation**
51+
### **Installation**
8352

8453
```
85-
conda env create -f environment.yml
86-
conda activate robustpcaEnv
54+
conda env create -f conda.yml
55+
conda activate env_qolmat
8756
```
57+
### Install pre-commit
8858

89-
## **References**
90-
[1] Candès, Emmanuel J., et al. "Robust principal component analysis?." Journal of the ACM (JACM) 58.3 (2011): 1-37, ([pdf](https://arxiv.org/abs/0912.3599))
91-
92-
[2] Wang, Xuehui, et al. "An improved robust principal component analysis model for anomalies detection of subway passenger flow." Journal of advanced transportation 2018 (2018). ([pdf](https://www.hindawi.com/journals/jat/2018/7191549/))
93-
94-
[3] Chen, Yuxin, et al. "Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data." arXiv preprint arXiv:2001.05484 (2020), ([pdf](https://arxiv.org/abs/2001.05484))
95-
96-
[4] Shahid, Nauman, et al. "Fast robust PCA on graphs." IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. ([pdf](https://arxiv.org/abs/1507.08173))
59+
Once the environment is installed, pre-commit is installed, but need to be activated using the following command:
60+
```
61+
pre-commit install
62+
```

README.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ which allows to work with grossly corrupted observations.
1313
Suppose we are given a large data matrix :math:`\mathbf{D}`, and know
1414
that it may be decomposed as
1515

16-
.. math::
16+
.. math::
1717
1818
\mathbf{D} = \mathbf{X}^* + \mathbf{A}^*
1919
@@ -30,7 +30,7 @@ See `here <https://arxiv.org/abs/0912.3599>`__ for more details.
3030

3131
Formally, the problem is expressed as
3232

33-
.. math::
33+
.. math::
3434
3535
\begin{align*}
3636
& \text{minimise} \quad \text{rank} (\mathbf{X}) + \lambda \Vert \mathbf{A} \Vert_0 \\
@@ -45,7 +45,7 @@ penalty is replaced with the :math:`\ell_1`-norm, which is good at
4545
modeling the sparse noise and has high efficient solution. Therefore,
4646
the problem becomes
4747

48-
.. math::
48+
.. math::
4949
5050
\begin{align*}
5151
& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\
@@ -69,11 +69,11 @@ on anomaly detection in time series.
6969
What’s in this repo?
7070
====================
7171

72-
Some classes are implemented:
72+
Some classes are implemented:
7373

7474
* :class:`RPCA` class (see p.29 of this `paper <https://arxiv.org/abs/0912.3599>`__). The optimisation problem is the following
7575

76-
.. math::
76+
.. math::
7777
7878
\begin{align*}
7979
& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\
@@ -82,7 +82,7 @@ Some classes are implemented:
8282
8383
* :class:`GraphRPCA` class (based on this `paper <https://arxiv.org/abs/1507.08173>`__). The optimisation problem is the following
8484

85-
.. math::
85+
.. math::
8686
8787
\begin{align*}
8888
& \text{minimise} \quad \Vert \mathbf{A} \Vert_1 + \gamma_1 \text{tr}(\mathbf{X} \mathbf{\mathcal{L}_1} \mathbf{X}^T) + \gamma_2 \text{tr}(\mathbf{X}^T \mathbf{\mathcal{L}_2} \mathbf{X}) \\
@@ -91,14 +91,14 @@ Some classes are implemented:
9191
9292
* :class:`TemporalRPCA` class (based on this `paper <https://arxiv.org/abs/2001.05484>`__ and this `paper <https://www.hindawi.com/journals/jat/2018/7191549/>`__). The optimisation problem is the following
9393

94-
.. math::
94+
.. math::
9595
9696
\text{minimise} \quad \Vert P_{\Omega}(\mathbf{X}+\mathbf{A}-\mathbf{D}) \Vert_F^2 + \lambda_1 \Vert \mathbf{X} \Vert_* + \lambda_2 \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{XH_k} \Vert_p
9797
9898
where :math:`\Vert \mathbf{XH_k} \Vert_p` is either :math:`\Vert \mathbf{XH_k} \Vert_1` or :math:`\Vert \mathbf{XH_k} \Vert_F^2`.
9999

100100

101-
The operator :math:`P_{\Omega}` is the projection operator such that
101+
The operator :math:`P_{\Omega}` is the projection operator such that
102102
:math:`P_{\Omega}(\mathbf{M})` is the projection of
103103
:math:`\mathbf{M}` on the set of observed data :math:`\Omega`. This
104104
allows to deal with missing values. Each of these classes is adapted to
@@ -115,7 +115,7 @@ Install directly from the gitlab repository:
115115
Contributing
116116
============
117117

118-
Feel free to open an issue or contact us at [email protected]
118+
Feel free to open an issue or contact us at [email protected]
119119

120120
References
121121
==========

conda.yml

Lines changed: 35 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,38 @@
11
name: env_qolmat
22
channels:
3-
- conda-forge
4-
- defaults
3+
- conda-forge
4+
- defaults
55
dependencies:
6-
- bump2version
7-
- flake8
8-
- ipykernel
9-
- jupyter
10-
- mypy
11-
- numpy
12-
- numpydoc
13-
- pandas
14-
- pytest
15-
- pytest-cov
16-
- scikit-learn
17-
- sphinx
18-
- sphinx-gallery
19-
- sphinx_rtd_theme
20-
- twine
21-
- wheel
22-
- jupyterlab
23-
# - fbprophet
24-
- s3fs
25-
- scikit-optimize
26-
- pyarrow
27-
- matplotlib
28-
- plotly
29-
- pip
30-
- pip:
31-
- -e .
6+
- bump2version
7+
- ipykernel
8+
- jupyter
9+
- seaborn
10+
- sphinx
11+
- sphinx-gallery
12+
- sphinx_rtd_theme
13+
- twine
14+
- wheel
15+
- jupyterlab
16+
- jupytext
17+
- s3fs
18+
- pyarrow
19+
- pip
20+
- pip:
21+
- -e .
22+
- flake8==6.0.0
23+
- jupytext==1.14.4
24+
- matplotlib==3.6.2
25+
- missingpy==0.2.0
26+
- numpy==1.24.1
27+
- numpydoc==1.5.0
28+
- mypy=0.911
29+
- pandas==1.5.2
30+
- plotly==5.11.0
31+
- pre-commit==2.21.0
32+
- pytest==7.2.0
33+
- pytest-cov==4.0.0
34+
- seaborn==0.12.2
35+
- scikit-learn==1.1.3
36+
- scikit-optimize==0.9.0
37+
- forge:
38+
- gcc=12.1.0

0 commit comments

Comments
 (0)