scikit-learn-contrib
diff --git a/‎.bumpversion.cfg‎
Lines changed: 1 addition & 1 deletion b/‎.bumpversion.cfg‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.flake8‎
Lines changed: 1 addition & 1 deletion b/‎.flake8‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎HISTORY.rst‎
Lines changed: 10 additions & 0 deletions b/‎HISTORY.rst‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎README.rst‎
Lines changed: 2 additions & 0 deletions b/‎README.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/api.rst‎
Lines changed: 38 additions & 21 deletions b/‎docs/api.rst‎
Lines changed: 38 additions & 21 deletions
diff --git a/‎docs/conf.py‎
Lines changed: 1 addition & 1 deletion b/‎docs/conf.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/imputers.rst‎
Lines changed: 23 additions & 18 deletions b/‎docs/imputers.rst‎
Lines changed: 23 additions & 18 deletions
diff --git a/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎environment.dev.yml‎
Lines changed: 1 addition & 1 deletion b/‎environment.dev.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/RPCA.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/RPCA.md‎
Lines changed: 1 addition & 1 deletion
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.1.2
+current_version = 0.1.4
 commit = True
 tag = True
 
 
@@ -1,5 +1,5 @@
 [flake8]
-exclude = .git,__pycache__,.vscode,tests
+exclude = .git,__pycache__,.vscode
 max-line-length=99
 ignore=E302,E305,W503,E203,E731,E402,E266,E712,F401,F821
 indent-size = 4
 
@@ -2,6 +2,16 @@
 History
 =======
 
+0.1.4 (2024-04-15)
+------------------
+
+* ImputerMean, ImputerMedian and ImputerMode have been merged into ImputerSimple
+* File preprocessing.py added with classes new MixteHGBM, BinTransformer, OneHotEncoderProjector and WrapperTransformer providing tools to manage mixed types data
+* Tutorial plot_tuto_categorical showcasing mixed type imputation
+* Titanic dataset added
+* accuracy metric implemented
+* metrics.py rationalized, and split with algebra.py
+
 0.1.3 (2024-03-07)
 ------------------
 
 
@@ -232,6 +232,8 @@ Selected Topics in Signal Processing 10.4 (2016): 740-756.
 [6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
 (`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)
 
+[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
+
 📝 License
 ==========
 
 
@@ -4,8 +4,8 @@ Qolmat API
 
 .. currentmodule:: qolmat
 
-Imputers
-=========
+Imputers API
+============
 
 .. autosummary::
     :toctree: generated/
@@ -15,10 +15,8 @@ Imputers
     imputations.imputers.ImputerKNN
     imputations.imputers.ImputerInterpolation
     imputations.imputers.ImputerLOCF
-    imputations.imputers.ImputerMedian
-    imputations.imputers.ImputerMean
+    imputations.imputers.ImputerSimple
     imputations.imputers.ImputerMICE
-    imputations.imputers.ImputerMode
     imputations.imputers.ImputerNOCB
     imputations.imputers.ImputerOracle
     imputations.imputers.ImputerRegressor
@@ -28,17 +26,17 @@ Imputers
     imputations.imputers.ImputerSoftImpute
     imputations.imputers.ImputerShuffle
 
-Comparator
-===========
+Comparator API
+==============
 
 .. autosummary::
     :toctree: generated/
     :template: class.rst
 
     benchmark.comparator.Comparator
 
-Missing Patterns
-================
+Missing Patterns API
+====================
 
 .. autosummary::
     :toctree: generated/
@@ -51,8 +49,8 @@ Missing Patterns
     benchmark.missing_patterns.GroupedHoleGenerator
 
 
-Metrics
-=======
+Metrics API
+===========
 
 .. autosummary::
     :toctree: generated/
@@ -63,6 +61,7 @@ Metrics
     benchmark.metrics.mean_absolute_error
     benchmark.metrics.mean_absolute_percentage_error
     benchmark.metrics.weighted_mean_absolute_percentage_error
+    benchmark.metrics.accuracy
     benchmark.metrics.dist_wasserstein
     benchmark.metrics.kl_divergence
     benchmark.metrics.kolmogorov_smirnov_test
@@ -75,19 +74,19 @@ Metrics
     benchmark.metrics.pattern_based_weighted_mean_metric
 
 
-RPCA engine
-================
+RPCA engine API
+===============
 
 .. autosummary::
     :toctree: generated/
     :template: class.rst
 
-    imputations.rpca.rpca_pcp.RPCAPCP
-    imputations.rpca.rpca_noisy.RPCANoisy
+    imputations.rpca.rpca_pcp.RpcaPcp
+    imputations.rpca.rpca_noisy.RpcaNoisy
 
 
-EM engine
-================
+Expectation-Maximization engine API
+===================================
 
 .. autosummary::
     :toctree: generated/
@@ -96,8 +95,8 @@ EM engine
     imputations.em_sampler.MultiNormalEM
     imputations.em_sampler.VARpEM
 
-Diffusion engine
-================
+Diffusion Model engine API
+==========================
 
 .. autosummary::
     :toctree: generated/
@@ -107,9 +106,27 @@ Diffusion engine
     imputations.diffusions.ddpms.TabDDPM
     imputations.diffusions.ddpms.TsDDPM
 
+Preprocessing API
+=================
+
+.. autosummary::
+    :toctree: generated/
+    :template: class.rst
+    
+    imputations.preprocessing.MixteHGBM
+    imputations.preprocessing.BinTransformer
+    imputations.preprocessing.OneHotEncoderProjector
+    imputations.preprocessing.WrapperTransformer
+
+.. autosummary::
+    :toctree: generated/
+    :template: function.rst
+    
+    imputations.preprocessing.make_pipeline_mixte_preprocessing
+    imputations.preprocessing.make_robust_MixteHGB
 
-Utils
-================
+Utils API
+=========
 
 .. autosummary::
     :toctree: generated/
 
@@ -27,7 +27,7 @@
 author = "Quantmetry"
 
 # The full version, including alpha/beta/rc tags
-version = "0.1.2"
+version = "0.1.4"
 release = version
 
 # -- General configuration ---------------------------------------------------
 
@@ -3,32 +3,36 @@ Imputers
 
 All imputers can be found in the ``qolmat.imputations`` folder.
 
-1. mean/median/shuffle
-----------------------
-Imputes the missing values using the mean/median along each column or with a random value in each column. See the :class:`~qolmat.imputations.imputers.ImputerMean`, :class:`~qolmat.imputations.imputers.ImputerMedian` and :class:`~qolmat.imputations.imputers.ImputerShuffle` classes.
+1. Simple (mean/median/mode)
+----------------------------
+Imputes the missing values using a basic simple statistics: the mode (most frequent value) for the categorical columns, and the mean,median or mode (depending on the user parameter) for the numerical columns. See :class:`~qolmat.imputations.imputers.ImputerSimple`.
 
-2. LOCF
+2. Shuffle
+----------
+Imputes the missing values using a random value sampled in the same column. See :class:`~qolmat.imputations.imputers.ImputerShuffle`.
+
+3. LOCF
 -------
-Imputes the missing values using the last observation carried forward. See the :class:`~qolmat.imputations.imputers.ImputerLOCF` class.
+Imputes the missing values using the last observation carried forward. See :class:`~qolmat.imputations.imputers.ImputerLOCF`.
 
-3. interpolation (on residuals)
--------------------------------
-Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See the :class:`~qolmat.imputations.imputers.ImputerResiduals` class.
+4. Time interpolation and TSA decomposition
+-------------------------------------------
+Imputes missing using some interpolation strategies supported by `pd.Series.interpolate <https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html>`_. It is done column by column. See the :class:`~qolmat.imputations.imputers.ImputerInterpolation` class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on `statsmodels.tsa.seasonal.seasonal_decompose <https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html>`_, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See :class:`~qolmat.imputations.imputers.ImputerResiduals`.
 
 
-4. MICE
+5. MICE
 -------
 Multiple Imputation by Chained Equation: multiple imputations based on ICE. It uses `IterativeImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer>`_. See the :class:`~qolmat.imputations.imputers.ImputerMICE` class.
 
-5. RPCA
+6. RPCA
 -------
 Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of PCA which allows to work with a data matrix :math:`\mathbf{D} \in \mathbb{R}^{n \times d}` containing missing values and grossly corrupted observations. We consider here the imputation task alone, but these methods can also tackle anomaly correction.
 
 Two cases are considered.
 
 **RPCA via Principal Component Pursuit (PCP)** [1, 12]
 
-The class :class:`RPCAPCP` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
+The class :class:`RpcaPcp` implements a matrix decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A}` where :math:`\mathbf{M}` has low-rank and :math:`\mathbf{A}` is sparse. It relies on the following optimisation problem
 
 .. math::
    \text{min}_{\mathbf{M} \in \mathbb{R}^{m \times n}} \quad \Vert \mathbf{M} \Vert_* + \lambda \Vert P_\Omega(\mathbf{D-M}) \Vert_1
@@ -38,15 +42,15 @@ See the :class:`~qolmat.imputations.imputers.ImputerRpcaPcp` class for implement
 
 **Noisy RPCA** [2, 3, 4]
 
-The class :class:`RPCANoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or  :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
+The class :class:`RpcaNoisy` implements an recommanded improved version, which relies on a decomposition :math:`\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}`. The additionnal term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the :math:`\eta_k`and :math:`H_k`. By defining :math:`\Vert \mathbf{MH_k} \Vert_p` is either :math:`\Vert \mathbf{MH_k} \Vert_1` or  :math:`\Vert \mathbf{MH_k} \Vert_F^2`, the optimisation problem is the following
 
 .. math::
    \text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p
 
 with :math:`\mathbf{E} = \mathbf{D} - \mathbf{M} - \mathbf{A}`.
 See the :class:`~qolmat.imputations.imputers.ImputerRpcaNoisy` class for implementation details.
 
-6. SoftImpute
+7. SoftImpute
 -------------
 SoftImpute is an iterative method for matrix completion that uses nuclear-norm regularization [11]. It is a faster alternative to RPCA, although it is much less robust due to the quadratic penalization. Given a matrix :math:`\mathbf{D} \in \mathbb{R}^{n \times d}` with observed entries indexed by the set :math:`\Omega`, this algorithm solves the following problem:
 
@@ -56,11 +60,11 @@ SoftImpute is an iterative method for matrix completion that uses nuclear-norm r
 The imputed values are then given by the matrix :math:`M=LQ` on the unobserved data.
 See the :class:`~qolmat.imputations.imputers.ImputerSoftImpute` class for implementation details.
 
-7. KNN
+8. KNN
 ------
 K-nearest neighbors, based on `KNNImputer <https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html>`_. See the :class:`~qolmat.imputations.imputers.ImputerKNN` class.
 
-8. EM sampler
+9. EM sampler
 -------------
 Imputes missing values via EM algorithm [5], and more precisely via MCEM algorithm [6]. See the :class:`~qolmat.imputations.imputers.ImputerEM` class.
 Suppose the data :math:`\mathbf{X}` has a density :math:`p_\theta` parametrized by some parameter :math:`\theta`. The EM algorithm allows to draw samples from this distribution by alternating between the expectation and maximization steps.
@@ -91,6 +95,7 @@ We estimate the distribution parameter :math:`\theta` by likelihood maximization
 Once the parameter :math:`\theta^*` has been estimated the final data imputation can be done in two different ways, depending on the value of the argument `method`:
 
 * `mle`: Returns the maximum likelihood estimator
+
 .. math::
     X^* = \mathrm{argmax}_X L(X, \theta^*)
 
@@ -103,7 +108,7 @@ Two parametric distributions are implemented:
 * :class:`~qolmat.imputations.em_sampler.VARpEM`: [7]: :math:`\mathbf{X} \in \mathbb{R}^{n \times d} \sim VAR_p(\nu, B_1, ..., B_p)` is generated by a VAR(p) process such that :math:`X_t = \nu + B_1 X_{t-1} + ... + B_p X_{t-p} + u_t` where :math:`\nu \in \mathbb{R}^d` is a vector of intercept terms, the :math:`B_i  \in \mathbb{R}^{d \times d}` are the lags coefficient matrices and :math:`u_t` is white noise nonsingular covariance matrix :math:`\Sigma_u \mathbb{R}^{d \times d}`, so that :math:`\theta = (\nu, B_1, ..., B_p, \Sigma_u)`.
 
 
-9. TabDDPM
+10. TabDDPM
 -----------
 
 :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM` is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [8] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:
@@ -115,8 +120,8 @@ In training phase, we use the self-supervised learning method of [9] to train in
 
 In the case of time-series data, we also propose :class:`~qolmat.imputations.diffusions.ddpms.TsDDPM` (built on top of :class:`~qolmat.imputations.diffusions.ddpms.TabDDPM`) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model :math:`\epsilon_\theta` takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].
 
-References
-----------
+References (Imputers)
+---------------------
 
 [1] Candès, Emmanuel J., et al. `Robust principal component analysis? <https://arxiv.org/abs/2001.05484>`_ Journal of the ACM (JACM) 58.3 (2011): 1-37.
 
 
@@ -16,6 +16,7 @@
 
    imputers
    examples/tutorials/plot_tuto_benchmark_TS
+   examples/tutorials/plot_tuto_categorical
    examples/tutorials/plot_tuto_diffusion_models
 
 .. toctree::
 
@@ -16,7 +16,7 @@ dependencies:
     - python=3.8
     - pip=23.0.1
     - scipy=1.10.1
-    - scikit-learn=1.2.2
+    - scikit-learn=1.3.2
     - sphinx=4.3.2
     - sphinx-gallery=0.10.1
     - sphinx_rtd_theme=1.0.0
 
@@ -199,7 +199,7 @@ plt.show()
 
 ```python
 %%time
-# rpca_noisy = RPCANoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
+# rpca_noisy = RpcaNoisy(period=10, tau=1, lam=0.4, rank=2, list_periods=[10], list_etas=[0.01], norm="L2")
 rpca_noisy = RpcaNoisy(tau=1, lam=0.4, rank=2, norm="L2")
 M, A = rpca_noisy.decompose(D, Omega)
 # imputed = X