You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: HISTORY.rst
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,16 @@
2
2
History
3
3
=======
4
4
5
+
0.1.4 (2024-04-15)
6
+
------------------
7
+
8
+
* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
9
+
* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
10
+
* Tutorial plot_tuto_categorical showcasing mixed type imputation
11
+
* Titanic dataset added
12
+
* accuracy metric implemented
13
+
* metrics.py rationalized, and split with algebra.py
[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
Copy file name to clipboardExpand all lines: docs/imputers.rst
+23-18Lines changed: 23 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,32 +3,36 @@ Imputers
3
3
4
4
All imputers can be found in the ``qolmat.imputations`` folder.
5
5
6
-
1. mean/median/shuffle
7
-
----------------------
8
-
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerMean`, :class:`~qolmat.imputations.imputers.ImputerMedian` and:class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
6
+
1. Simple (mean/median/mode)
7
+
----------------------------
8
+
Imputes the missing values using a basic simple statistics: the mode (most frequent value) for the categorical columns, and the mean,median or mode (depending on the user parameter) for the numerical columns. See:class:`~qolmat.imputations.imputers.ImputerSimple`.
9
9
10
-
2. LOCF
10
+
2. Shuffle
11
+
----------
12
+
Imputes the missing values using a random value sampled in the same column. See :class:`~qolmat.imputations.imputers.ImputerShuffle`.
13
+
14
+
3. LOCF
11
15
-------
12
-
Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
16
+
Imputes the missing values using the last observation carried forward. See :class:`~qolmat.imputations.imputers.ImputerLOCF`.
13
17
14
-
3. interpolation (on residuals)
15
-
-------------------------------
16
-
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
18
+
4. Time interpolation and TSA decomposition
19
+
-------------------------------------------
20
+
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See :class:`~qolmat.imputations.imputers.ImputerResiduals`.
17
21
18
22
19
-
4. MICE
23
+
5. MICE
20
24
-------
21
25
Multiple Imputation by Chained Equation: multiple imputations based on ICE. It uses `IterativeImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_. See the :class:`~qolmat.imputations.imputers.ImputerMICE` class.
22
26
23
-
5. RPCA
27
+
6. RPCA
24
28
-------
25
29
Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of PCA which allows to work with a data matrix :math:`\mathbf{D} \in\mathbb{R}^{n \times d}` containing missing values and grossly corrupted observations. We consider here the imputation task alone, but these methods can also tackle anomaly correction.
26
30
27
31
Two cases are considered.
28
32
29
33
**RPCA via Principal Component Pursuit (PCP)** [1, 12]
30
34
31
-
The class :class:`RPCAPCP` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
35
+
The class :class:`RpcaPcp` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
@@ -38,15 +42,15 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
38
42
39
43
**Noisy RPCA** [2, 3, 4]
40
44
41
-
The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert\mathbf{MH_k} \Vert_p` is either :math:`\Vert\mathbf{MH_k} \Vert_1` or :math:`\Vert\mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
45
+
The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert\mathbf{MH_k} \Vert_p` is either :math:`\Vert\mathbf{MH_k} \Vert_1` or :math:`\Vert\mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
with :math:`\mathbf{E} = \mathbf{D} - \mathbf{M} - \mathbf{A}`.
47
51
See the :class:`~qolmat.imputations.imputers.ImputerRpcaNoisy` class for implementation details.
48
52
49
-
6. SoftImpute
53
+
7. SoftImpute
50
54
-------------
51
55
SoftImpute is an iterative method for matrix completion that uses nuclear-norm regularization [11]. It is a faster alternative to RPCA, although it is much less robust due to the quadratic penalization. Given a matrix :math:`\mathbf{D} \in\mathbb{R}^{n \times d}` with observed entries indexed by the set :math:`\Omega`, this algorithm solves the following problem:
52
56
@@ -56,11 +60,11 @@ SoftImpute is an iterative method for matrix completion that uses nuclear-norm r
56
60
The imputed values are then given by the matrix :math:`M=LQ` on the unobserved data.
57
61
See the :class:`~qolmat.imputations.imputers.ImputerSoftImpute` class for implementation details.
58
62
59
-
7. KNN
63
+
8. KNN
60
64
------
61
65
K-nearest neighbors, based on `KNNImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html>`_. See the :class:`~qolmat.imputations.imputers.ImputerKNN` class.
62
66
63
-
8. EM sampler
67
+
9. EM sampler
64
68
-------------
65
69
Imputes missing values via EM algorithm [5], and more precisely via MCEM algorithm [6]. See the :class:`~qolmat.imputations.imputers.ImputerEM` class.
66
70
Suppose the data :math:`\mathbf{X}` has a density :math:`p_\theta` parametrized by some parameter :math:`\theta`. The EM algorithm allows to draw samples from this distribution by alternating between the expectation and maximization steps.
@@ -91,6 +95,7 @@ We estimate the distribution parameter :math:`\theta` by likelihood maximization
91
95
Once the parameter :math:`\theta^*` has been estimated the final data imputation can be done in two different ways, depending on the value of the argument `method`:
92
96
93
97
* `mle`: Returns the maximum likelihood estimator
98
+
94
99
.. math::
95
100
X^* = \mathrm{argmax}_X L(X, \theta^*)
96
101
@@ -103,7 +108,7 @@ Two parametric distributions are implemented:
103
108
* :class:`~qolmat.imputations.em_sampler.VARpEM`: [7]: :math:`\mathbf{X} \in\mathbb{R}^{n \times d} \sim VAR_p(\nu, B_1, ..., B_p)` is generated by a VAR(p) process such that :math:`X_t = \nu + B_1 X_{t-1} + ... + B_p X_{t-p} + u_t` where :math:`\nu\in\mathbb{R}^d` is a vector of intercept terms, the :math:`B_i \in\mathbb{R}^{d \times d}` are the lags coefficient matrices and :math:`u_t` is white noise nonsingular covariance matrix :math:`\Sigma_u \mathbb{R}^{d \times d}`, so that :math:`\theta = (\nu, B_1, ..., B_p, \Sigma_u)`.
104
109
105
110
106
-
9. TabDDPM
111
+
10. TabDDPM
107
112
-----------
108
113
109
114
:class:`~qolmat.imputations.diffusions.ddpms.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [8] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:
@@ -115,8 +120,8 @@ In training phase, we use the self-supervised learning method of [9] to train in
115
120
116
121
In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].
117
122
118
-
References
119
-
----------
123
+
References (Imputers)
124
+
---------------------
120
125
121
126
[1] Candès, Emmanuel J., et al. `Robust principal component analysis? <https://arxiv.org/abs/2001.05484>`_ Journal of the ACM (JACM) 58.3 (2011): 1-37.
0 commit comments