Skip to content

Commit c4b7773

Browse files
Merge pull request #131 from scikit-learn-contrib/dev
Dev
2 parents 5da1e22 + 0b579e3 commit c4b7773

35 files changed

+2803
-732
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.1.2
2+
current_version = 0.1.4
33
commit = True
44
tag = True
55

.flake8

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[flake8]
2-
exclude = .git,__pycache__,.vscode,tests
2+
exclude = .git,__pycache__,.vscode
33
max-line-length=99
44
ignore=E302,E305,W503,E203,E731,E402,E266,E712,F401,F821
55
indent-size = 4

HISTORY.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,16 @@
22
History
33
=======
44

5+
0.1.4 (2024-04-15)
6+
------------------
7+
8+
* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
9+
* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
10+
* Tutorial plot_tuto_categorical showcasing mixed type imputation
11+
* Titanic dataset added
12+
* accuracy metric implemented
13+
* metrics.py rationalized, and split with algebra.py
14+
515
0.1.3 (2024-03-07)
616
------------------
717

README.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,8 @@ Selected Topics in Signal Processing 10.4 (2016): 740-756.
232232
[6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
233233
(`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)
234234

235+
[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
236+
235237
📝 License
236238
==========
237239

docs/api.rst

Lines changed: 38 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ Qolmat API
44

55
.. currentmodule:: qolmat
66

7-
Imputers
8-
=========
7+
Imputers API
8+
============
99

1010
.. autosummary::
1111
:toctree: generated/
@@ -15,10 +15,8 @@ Imputers
1515
imputations.imputers.ImputerKNN
1616
imputations.imputers.ImputerInterpolation
1717
imputations.imputers.ImputerLOCF
18-
imputations.imputers.ImputerMedian
19-
imputations.imputers.ImputerMean
18+
imputations.imputers.ImputerSimple
2019
imputations.imputers.ImputerMICE
21-
imputations.imputers.ImputerMode
2220
imputations.imputers.ImputerNOCB
2321
imputations.imputers.ImputerOracle
2422
imputations.imputers.ImputerRegressor
@@ -28,17 +26,17 @@ Imputers
2826
imputations.imputers.ImputerSoftImpute
2927
imputations.imputers.ImputerShuffle
3028

31-
Comparator
32-
===========
29+
Comparator API
30+
==============
3331

3432
.. autosummary::
3533
:toctree: generated/
3634
:template: class.rst
3735

3836
benchmark.comparator.Comparator
3937

40-
Missing Patterns
41-
================
38+
Missing Patterns API
39+
====================
4240

4341
.. autosummary::
4442
:toctree: generated/
@@ -51,8 +49,8 @@ Missing Patterns
5149
benchmark.missing_patterns.GroupedHoleGenerator
5250

5351

54-
Metrics
55-
=======
52+
Metrics API
53+
===========
5654

5755
.. autosummary::
5856
:toctree: generated/
@@ -63,6 +61,7 @@ Metrics
6361
benchmark.metrics.mean_absolute_error
6462
benchmark.metrics.mean_absolute_percentage_error
6563
benchmark.metrics.weighted_mean_absolute_percentage_error
64+
benchmark.metrics.accuracy
6665
benchmark.metrics.dist_wasserstein
6766
benchmark.metrics.kl_divergence
6867
benchmark.metrics.kolmogorov_smirnov_test
@@ -75,19 +74,19 @@ Metrics
7574
benchmark.metrics.pattern_based_weighted_mean_metric
7675

7776

78-
RPCA engine
79-
================
77+
RPCA engine API
78+
===============
8079

8180
.. autosummary::
8281
:toctree: generated/
8382
:template: class.rst
8483

85-
imputations.rpca.rpca_pcp.RPCAPCP
86-
imputations.rpca.rpca_noisy.RPCANoisy
84+
imputations.rpca.rpca_pcp.RpcaPcp
85+
imputations.rpca.rpca_noisy.RpcaNoisy
8786

8887

89-
EM engine
90-
================
88+
Expectation-Maximization engine API
89+
===================================
9190

9291
.. autosummary::
9392
:toctree: generated/
@@ -96,8 +95,8 @@ EM engine
9695
imputations.em_sampler.MultiNormalEM
9796
imputations.em_sampler.VARpEM
9897

99-
Diffusion engine
100-
================
98+
Diffusion Model engine API
99+
==========================
101100

102101
.. autosummary::
103102
:toctree: generated/
@@ -107,9 +106,27 @@ Diffusion engine
107106
imputations.diffusions.ddpms.TabDDPM
108107
imputations.diffusions.ddpms.TsDDPM
109108

109+
Preprocessing API
110+
=================
111+
112+
.. autosummary::
113+
:toctree: generated/
114+
:template: class.rst
115+
116+
imputations.preprocessing.MixteHGBM
117+
imputations.preprocessing.BinTransformer
118+
imputations.preprocessing.OneHotEncoderProjector
119+
imputations.preprocessing.WrapperTransformer
120+
121+
.. autosummary::
122+
:toctree: generated/
123+
:template: function.rst
124+
125+
imputations.preprocessing.make_pipeline_mixte_preprocessing
126+
imputations.preprocessing.make_robust_MixteHGB
110127

111-
Utils
112-
================
128+
Utils API
129+
=========
113130

114131
.. autosummary::
115132
:toctree: generated/

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
author = "Quantmetry"
2828

2929
# The full version, including alpha/beta/rc tags
30-
version = "0.1.2"
30+
version = "0.1.4"
3131
release = version
3232

3333
# -- General configuration ---------------------------------------------------

docs/imputers.rst

Lines changed: 23 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,32 +3,36 @@ Imputers
33

44
All imputers can be found in the ``qolmat.imputations`` folder.
55

6-
1. mean/median/shuffle
7-
----------------------
8-
Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerMean`, :class:`~qolmat.imputations.imputers.ImputerMedian` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
6+
1. Simple (mean/median/mode)
7+
----------------------------
8+
Imputes the missing values using a basic simple statistics: the mode (most frequent value) for the categorical columns, and the mean,median or mode (depending on the user parameter) for the numerical columns. See :class:`~qolmat.imputations.imputers.ImputerSimple`.
99

10-
2. LOCF
10+
2. Shuffle
11+
----------
12+
Imputes the missing values using a random value sampled in the same column. See :class:`~qolmat.imputations.imputers.ImputerShuffle`.
13+
14+
3. LOCF
1115
-------
12-
Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
16+
Imputes the missing values using the last observation carried forward. See :class:`~qolmat.imputations.imputers.ImputerLOCF`.
1317

14-
3. interpolation (on residuals)
15-
-------------------------------
16-
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
18+
4. Time interpolation and TSA decomposition
19+
-------------------------------------------
20+
Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See :class:`~qolmat.imputations.imputers.ImputerResiduals`.
1721

1822

19-
4. MICE
23+
5. MICE
2024
-------
2125
Multiple Imputation by Chained Equation: multiple imputations based on ICE. It uses `IterativeImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_. See the :class:`~qolmat.imputations.imputers.ImputerMICE` class.
2226

23-
5. RPCA
27+
6. RPCA
2428
-------
2529
Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of PCA which allows to work with a data matrix :math:`\mathbf{D} \in \mathbb{R}^{n \times d}` containing missing values and grossly corrupted observations. We consider here the imputation task alone, but these methods can also tackle anomaly correction.
2630

2731
Two cases are considered.
2832

2933
**RPCA via Principal Component Pursuit (PCP)** [1, 12]
3034

31-
The class :class:`RPCAPCP` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
35+
The class :class:`RpcaPcp` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
3236

3337
.. math::
3438
\text{min}_{\mathbf{M} \in \mathbb{R}^{m \times n}} \quad \Vert \mathbf{M} \Vert_* + \lambda \Vert P_\Omega(\mathbf{D-M}) \Vert_1
@@ -38,15 +42,15 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
3842

3943
**Noisy RPCA** [2, 3, 4]
4044

41-
The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
45+
The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
4246

4347
.. math::
4448
\text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p
4549
4650
with :math:`\mathbf{E} = \mathbf{D} - \mathbf{M} - \mathbf{A}`.
4751
See the :class:`~qolmat.imputations.imputers.ImputerRpcaNoisy` class for implementation details.
4852

49-
6. SoftImpute
53+
7. SoftImpute
5054
-------------
5155
SoftImpute is an iterative method for matrix completion that uses nuclear-norm regularization [11]. It is a faster alternative to RPCA, although it is much less robust due to the quadratic penalization. Given a matrix :math:`\mathbf{D} \in \mathbb{R}^{n \times d}` with observed entries indexed by the set :math:`\Omega`, this algorithm solves the following problem:
5256

@@ -56,11 +60,11 @@ SoftImpute is an iterative method for matrix completion that uses nuclear-norm r
5660
The imputed values are then given by the matrix :math:`M=LQ` on the unobserved data.
5761
See the :class:`~qolmat.imputations.imputers.ImputerSoftImpute` class for implementation details.
5862

59-
7. KNN
63+
8. KNN
6064
------
6165
K-nearest neighbors, based on `KNNImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html>`_. See the :class:`~qolmat.imputations.imputers.ImputerKNN` class.
6266

63-
8. EM sampler
67+
9. EM sampler
6468
-------------
6569
Imputes missing values via EM algorithm [5], and more precisely via MCEM algorithm [6]. See the :class:`~qolmat.imputations.imputers.ImputerEM` class.
6670
Suppose the data :math:`\mathbf{X}` has a density :math:`p_\theta` parametrized by some parameter :math:`\theta`. The EM algorithm allows to draw samples from this distribution by alternating between the expectation and maximization steps.
@@ -91,6 +95,7 @@ We estimate the distribution parameter :math:`\theta` by likelihood maximization
9195
Once the parameter :math:`\theta^*` has been estimated the final data imputation can be done in two different ways, depending on the value of the argument `method`:
9296

9397
* `mle`: Returns the maximum likelihood estimator
98+
9499
.. math::
95100
X^* = \mathrm{argmax}_X L(X, \theta^*)
96101
@@ -103,7 +108,7 @@ Two parametric distributions are implemented:
103108
* :class:`~qolmat.imputations.em_sampler.VARpEM`: [7]: :math:`\mathbf{X} \in \mathbb{R}^{n \times d} \sim VAR_p(\nu, B_1, ..., B_p)` is generated by a VAR(p) process such that :math:`X_t = \nu + B_1 X_{t-1} + ... + B_p X_{t-p} + u_t` where :math:`\nu \in \mathbb{R}^d` is a vector of intercept terms, the :math:`B_i \in \mathbb{R}^{d \times d}` are the lags coefficient matrices and :math:`u_t` is white noise nonsingular covariance matrix :math:`\Sigma_u \mathbb{R}^{d \times d}`, so that :math:`\theta = (\nu, B_1, ..., B_p, \Sigma_u)`.
104109

105110

106-
9. TabDDPM
111+
10. TabDDPM
107112
-----------
108113

109114
:class:`~qolmat.imputations.diffusions.ddpms.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [8] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:
@@ -115,8 +120,8 @@ In training phase, we use the self-supervised learning method of [9] to train in
115120

116121
In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].
117122

118-
References
119-
----------
123+
References (Imputers)
124+
---------------------
120125

121126
[1] Candès, Emmanuel J., et al. `Robust principal component analysis? <https://arxiv.org/abs/2001.05484>`_ Journal of the ACM (JACM) 58.3 (2011): 1-37.
122127

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
imputers
1818
examples/tutorials/plot_tuto_benchmark_TS
19+
examples/tutorials/plot_tuto_categorical
1920
examples/tutorials/plot_tuto_diffusion_models
2021

2122
.. toctree::

environment.dev.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ dependencies:
1616
- python=3.8
1717
- pip=23.0.1
1818
- scipy=1.10.1
19-
- scikit-learn=1.2.2
19+
- scikit-learn=1.3.2
2020
- sphinx=4.3.2
2121
- sphinx-gallery=0.10.1
2222
- sphinx_rtd_theme=1.0.0

examples/RPCA.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ plt.show()
199199

200200
```python
201201
%%time
202-
# rpca_noisy = RPCANoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
202+
# rpca_noisy = RpcaNoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
203203
rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
204204
M, A = rpca_noisy.decompose(D, Omega)
205205
# imputed = X

0 commit comments

Comments
 (0)