|
1 | | -RPCA for anomaly detection and data imputation |
| 1 | +Qolmat |
2 | 2 | = |
3 | 3 |
|
| 4 | +The Qolmat package is created for the implementation and comparison of imputation methods. It can be divided into two main parts: |
4 | 5 |
|
5 | | -## **Robust Principal Component Analysis** |
| 6 | +1. Impute missing values via multiple algorithms; |
| 7 | +2. Compare the imputation methods in a supervised manner. |
6 | 8 |
|
7 | | -<details> |
8 | | -<summary> What is robust principal component analysis? </summary> |
| 9 | +### **Imputation methods** |
9 | 10 |
|
| 11 | +For univariate time series: |
10 | 12 |
|
11 | | -Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) which allows to work with grossly corrupted observations. |
| 13 | +* ```ImputeByMean```/```ImputeByMedian```/```ImputeByMode``` : Replaces missing entries with the mean, median or mode of each column. It uses ```pd.DataFrame.fillna()```. |
| 14 | +* ```RandomImpute``` : Replaces missing entries with the random value of each column. |
| 15 | +* ```ImputeLOCF```/```ImputeNOCB``` : Replaces missing entries by carrying the last observation forward/ next observation backward, for each columns. |
| 16 | +* ```ImputeByInterpolation```: Replaces missing using some interpolation strategies |
| 17 | +supported by ```pd.Series.interpolate````. |
| 18 | +* ```ImputeRPCA``` : Imputes values via a RPCA method. |
12 | 19 |
|
13 | | -Suppose we are given a large data matrix $`\mathbf{D}`$, and know that it may be decomposed as |
14 | | -```math |
15 | | -\mathbf{D} = \mathbf{X}^* + \mathbf{A}^* |
16 | | -``` |
17 | | -where $`\mathbf{X}^*`$ has low-rank and $`\mathbf{A}^*`$ is sparse. We do not know the low-dimensional column and row space of $`\mathbf{X}^*`$, not even their dimension. Similarly, for the non-zero entries of $`\mathbf{A}^*`$, we do not know their location, magnitude or even their number. Are the low-rank and sparse parts possible to recover both *accurately* and *efficiently*? |
| 20 | +For multivariate time series: |
18 | 21 |
|
19 | | -Of course, for the separation problem to make sense, the low-rank part cannot be sparse and analogously, the sparse part cannot be low-rank. See [here](https://arxiv.org/abs/0912.3599) for more details. |
| 22 | +* ```ImputeKNN``` : Replaces missing entries with the k-nearest neighbors. It uses the ```sklearn.impute.KNNImputer```. |
| 23 | +* ```ImputeIterative``` : Imputes each Series within a DataFrame multiple times using an iteration of fits and transformations to reach a stable state of imputation each time.It uses ```sklearn.impute.IterativeImputer``` |
| 24 | +* ```ImputeRegressor```: It imputes each Series with missing value within a DataFrame using a regression model whose features are based on the complete ones only. |
| 25 | +* ```ImputeStochasticRegressor```: It imputes each Series with missing value within a DataFrame using a stochastic regression model whose features are based on the complete ones only. |
| 26 | +* ```ImputeRPCA``` : Imputes values via a RPCA method. |
| 27 | +* ```ImputeEM``` : Imputation of missing values using a multivariate Gaussian model through EM optimization and using a projected (Ornstein-Uhlenbeck) process. |
20 | 28 |
|
21 | | -Formally, the problem is expressed as |
22 | | -```math |
23 | | -\begin{align*} |
24 | | -& \text{minimise} \quad \text{rank} (\mathbf{X}) + \lambda \Vert \mathbf{A} \Vert_0 \\ |
25 | | -& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A} |
26 | | -\end{align*} |
27 | | -``` |
28 | | -Unfortunately this optimization problem is a NP-hard problem due to its nonconvexity and discontinuity. So then, a widely used solving scheme is replacing rank($`\mathbf{X}`$) by its convex envelope —the nuclear norm $`\Vert \mathbf{X} \Vert_*`$— and the $`\ell_0`$ penalty is replaced with the $`\ell_1`$-norm, which is good at modeling the sparse noise and has high efficient solution. Therefore, the problem becomes |
29 | | -```math |
30 | | -\begin{align*} |
31 | | -& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\ |
32 | | -& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A} |
33 | | -\end{align*} |
34 | | -``` |
| 29 | +### **Comparator** |
35 | 30 |
|
36 | | -Theoretically, this is guaranteed to work even if the rank of $`\mathbf{X}^*`$ grows almost linearly in the dimension of the matrix, and the errors in $`\mathbf{A}^*`$ are up to a constant fraction of all entries. Algorithmically, the above problem can be solved by efficient and scalable algorithms, at a cost not so much higher than the classical PCA. Empirically, a number of simulations and experiments suggest this works under surprisingly broad conditions for many types of real data. |
| 31 | +The ```Comparator``` class implements a way to compare multiple imputation methods. |
| 32 | +It is based on the standard approach to select some observations, set their status to missing, and compare |
| 33 | +their imputation with their true values. |
37 | 34 |
|
38 | | -Some examples of real-life applications are background modelling from video surveillance, face recognition, speech recognition. We here focus on anomaly detection in time series. |
39 | | -</details> |
| 35 | +More specifically, from the initial dataframe with missing value, we generate additional missing values (N samples/times). |
| 36 | +MIssing values can be generated following three mechanisms: MCAR, MAR and MNAR. |
40 | 37 |
|
| 38 | +* In the MCAR setting, each value is masked according to the realisation of a Bernoulli random variable with a fixed parameter. |
| 39 | +* In the MAR setting, for each experiment, a fixed subset of variables that cannot have missing values is sampled. Then, the remaining variables have missing values according to a logistic model with random weights, which takes the non-missing variables as inputs. A bias term is fitted using line search to attain the desired proportion of missing values. |
| 40 | +* Finally, two different mechanisms are implemented in the MNAR setting. |
41 | 41 |
|
42 | | -<details> |
43 | | -<summary>What's in this repo ?</summary> |
44 | | - |
45 | | -Some classes are implemented: |
46 | | -1. ```RPCA``` (see p.29 of this [paper](https://arxiv.org/abs/0912.3599)). |
47 | | -The optimisation problem is the following |
48 | | -```math |
49 | | -\begin{align*} |
50 | | -& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\ |
51 | | -& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A} |
52 | | -\end{align*} |
53 | | -``` |
54 | | -2. ```ImprovedRPCA``` (based on this [paper](https://www.hindawi.com/journals/jat/2018/7191549/)). The optimisation problem is the following |
55 | | -```math |
56 | | -\begin{align*} |
57 | | -& \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{i=1}^p \eta_i \Vert \mathbf{H_iX} \Vert_1\\ |
58 | | -& \text{s.t.} \quad P_{\Omega}(\mathbf{D}) = P_{\Omega}(\mathbf{X} + \mathbf{A}) |
59 | | -\end{align*} |
60 | | -``` |
61 | | - |
62 | | -3. ```NoisyRPCA``` (based on this [paper](https://arxiv.org/abs/2001.05484) and this [paper](https://www.hindawi.com/journals/jat/2018/7191549/)). The optimisation problem is the following |
63 | | -```math |
64 | | -\text{minimise} \quad \Vert P_{\Omega}(\mathbf{X}+\mathbf{A}-\mathbf{D}) \Vert_F^2 + \tau \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{i=1}^p \eta_i \Vert \mathbf{H_iX} \Vert_1 |
65 | | -``` |
66 | | -4. ```GraphRPCA``` (based on this [paper](https://arxiv.org/abs/1507.08173)). The optimisation problem is the following |
67 | | -```math |
68 | | -\begin{align*} |
69 | | -& \text{minimise} \quad \Vert \mathbf{A} \Vert_1 + \gamma_1 \text{tr}(\mathbf{X} \mathbf{\mathcal{L}_1} \mathbf{X}^T) + \gamma_2 \text{tr}(\mathbf{X}^T \mathbf{\mathcal{L}_2} \mathbf{X}) \\ |
70 | | -& \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A} |
71 | | -\end{align*} |
72 | | -``` |
| 42 | + * The first is identical to the previously described MAR mechanism, but the inputs of the logistic model are then masked by a MCAR mechanism. Hence, the logistic model’s outcome now depends on potentially missing values. |
| 43 | + * The second mechanism, ``self masked``, samples a subset of variables whose values in the lower and upper p-th percentiles are masked according to a Bernoulli random variable, and the values in-between are left not missing. |
73 | 44 |
|
74 | | -The operator $`P_{\Omega}(\mathbf{M})`$ is the projection of $`\mathbf{M}`$ on the set of observed data $`\Omega`$. This allows to deal with missing values. Each of these classes is adapted to take as input either a time series or a matrix directly. If a time series is passed, a pre-processing is done. |
75 | | -</details> |
| 45 | +On each sample, different imputation models are tested and reconstruction errors are computed on these artificially missing entries. Then the errors of each imputation model are averaged and we eventually obtained a unique error score per model. This procedure allows the comparison of different models on the same dataset. |
76 | 46 |
|
| 47 | +<p align="center" width="100%"> |
| 48 | +<img src="docs/images/comparator.png" alt="comparator" width="60%"/> |
| 49 | +</p> |
77 | 50 |
|
78 | | - |
79 | | -**TL;DR** RPCA can be described as the decomposition of a matrix of observations $`\mathbf{D}`$ into two matrices: a low-rank matrix $`\mathbf{X}`$ and a sparse matrix $`\mathbf{A}`$. See the examples folder for a first overview of the implemented classes. |
80 | | - |
81 | | - |
82 | | -## **Installation** |
| 51 | +### **Installation** |
83 | 52 |
|
84 | 53 | ``` |
85 | | -conda env create -f environment.yml |
86 | | -conda activate robustpcaEnv |
| 54 | +conda env create -f conda.yml |
| 55 | +conda activate env_qolmat |
87 | 56 | ``` |
| 57 | +### Install pre-commit |
88 | 58 |
|
89 | | -## **References** |
90 | | -[1] Candès, Emmanuel J., et al. "Robust principal component analysis?." Journal of the ACM (JACM) 58.3 (2011): 1-37, ([pdf](https://arxiv.org/abs/0912.3599)) |
91 | | - |
92 | | -[2] Wang, Xuehui, et al. "An improved robust principal component analysis model for anomalies detection of subway passenger flow." Journal of advanced transportation 2018 (2018). ([pdf](https://www.hindawi.com/journals/jat/2018/7191549/)) |
93 | | - |
94 | | -[3] Chen, Yuxin, et al. "Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data." arXiv preprint arXiv:2001.05484 (2020), ([pdf](https://arxiv.org/abs/2001.05484)) |
95 | | - |
96 | | -[4] Shahid, Nauman, et al. "Fast robust PCA on graphs." IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. ([pdf](https://arxiv.org/abs/1507.08173)) |
| 59 | +Once the environment is installed, pre-commit is installed, but need to be activated using the following command: |
| 60 | +``` |
| 61 | +pre-commit install |
| 62 | +``` |
0 commit comments