scikit-learn-contrib
diff --git a/‎CONTRIBUTING.rst‎
Lines changed: 1 addition & 1 deletion b/‎CONTRIBUTING.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎HISTORY.rst‎
Lines changed: 4 additions & 1 deletion b/‎HISTORY.rst‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎README.rst‎
Lines changed: 7 additions & 7 deletions b/‎README.rst‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎environment.ci.yml‎
Lines changed: 1 addition & 1 deletion b/‎environment.ci.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/benchmark.md‎
Lines changed: 19 additions & 24 deletions b/‎examples/benchmark.md‎
Lines changed: 19 additions & 24 deletions
diff --git a/‎qolmat/imputations/imputers.py‎
Lines changed: 18 additions & 76 deletions b/‎qolmat/imputations/imputers.py‎
Lines changed: 18 additions & 76 deletions
diff --git a/‎qolmat/imputations/imputers_keras.py‎
Lines changed: 0 additions & 44 deletions b/‎qolmat/imputations/imputers_keras.py‎
Lines changed: 0 additions & 44 deletions
@@ -36,7 +36,7 @@ If you need to use tensorflow, enter the command:
 
 .. code:: sh
 
-    $ pip install -e .[tensorflow]
+    $ pip install -e .[pytorch]
 
 Once the environment is installed, pre-commit is installed, but need to be activated using the following command:
 
 
@@ -9,7 +9,10 @@ History
 * The Imputer classes do not possess a dictionary attribute anymore, and all list attributes have
 been changed into tuple attributes so that all are not immutable
 * All the tests from scikit-learn's check_estimator now pass for the class Imputer
-* Fix MLP imputer
+* Fix MLP imputer, created a builder for MLP imputer
+* Switch tensorflow by pytorch. Change Test, environment, benchmark and imputers for pytorch
+* Add new datasets
+* Added dcor metrics with a pattern-wise computation on data with missing values
 
 0.0.14 (2023-06-14)
 -------------------
 
@@ -62,7 +62,7 @@ To install directly from the github repository :
 
 Let us start with a basic imputation problem. Here, we generate one-dimensional noisy time series.
 
-.. code:: sh
+.. code-block:: python
 
     import matplotlib.pyplot as plt
     import numpy as np
@@ -75,7 +75,7 @@ Let us start with a basic imputation problem. Here, we generate one-dimensional
 
 For this demonstration, let us create artificial holes in our dataset.
 
-.. code:: sh
+.. code-block:: python
 
     from qolmat.utils.data import add_holes
     plt.rcParams.update({'font.size': 18})
@@ -101,7 +101,7 @@ For this demonstration, let us create artificial holes in our dataset.
 To impute missing data, there are several methods that can be imported with ``from qolmat.imputations import imputers``.
 The creation of an imputation dictionary will enable us to benchmark the various imputations.
 
-.. code:: sh
+.. code-block:: python
 
     from sklearn.linear_model import LinearRegression
     from qolmat.imputations import imputers
@@ -146,7 +146,7 @@ The creation of an imputation dictionary will enable us to benchmark the various
 
 It is possible to define a parameter dictionary for an imputer with three pieces of information: min, max and type. The aim of the dictionary is to determine the optimal parameters for data imputation. Here, we call this dictionary ``dict_config_opti``.
 
-.. code:: sh
+.. code-block:: python
 
     search_params = {
         "RPCA_opti": {
@@ -157,7 +157,7 @@ It is possible to define a parameter dictionary for an imputer with three pieces
 
 Then with the comparator function in ``from qolmat.benchmark import comparator``, we can compare the different imputation methods. This **does not use knowledge on missing values**, but it relies data masking instead. For more details on how imputors and comparator work, please see the following `link <https://qolmat.readthedocs.io/en/latest/explanation.html>`_.
 
-.. code:: sh
+.. code-block:: python
 
     from qolmat.benchmark import comparator
 
@@ -175,7 +175,7 @@ Then with the comparator function in ``from qolmat.benchmark import comparator``
 
 We can observe the benchmark results.
 
-.. code:: sh
+.. code-block:: python
 
     dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)
 
@@ -196,7 +196,7 @@ We can observe the benchmark results.
 
 Finally, we keep the best ``TSMLE`` imputor we represent.
 
-.. code:: sh
+.. code-block:: python
 
     dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)
 
 
@@ -14,5 +14,5 @@ dependencies:
           - pytest
           - pytest-cov
           - pytest-mock
-          - tensorflow
+          - torch==2.0.1
           - -e .
@@ -19,18 +19,6 @@ In Qolmat, a few data imputation methods are implemented as well as a way to eva
 
 First, import some useful librairies
 
-```python
-X= np.array([[0], [1], [2]])
-```
-
-```python
-np.cov(X)
-```
-
-```python
-
-```
-
 ```python
 import warnings
 # warnings.filterwarnings('error')
@@ -146,7 +134,7 @@ imputer_tsmle = imputers.ImputerEM(groups=("station",), model="VAR1", method="ml
 
 
 imputer_knn = imputers.ImputerKNN(groups=("station",), n_neighbors=10)
-imputer_mice = imputers.ImputerMICE(groups=("station",), estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan)
+imputer_mice = imputers.ImputerMICE(groups=("station",), estimator=LinearRegression(), sample_posterior=False, max_iter=100)
 imputer_regressor = imputers.ImputerRegressor(groups=("station",), estimator=LinearRegression())
 ```
 
@@ -352,8 +340,11 @@ plt.show()
 In this section, we present an MLP model of data imputation using Keras, which can be installed using a "pip install tensorflow".
 
 ```python
-from qolmat.imputations import imputers_keras
-import tensorflow as tf
+from qolmat.imputations import imputers_pytorch
+try:
+    import torch.nn as nn
+except ModuleNotFoundError:
+    raise PyTorchExtraNotInstalled
 ```
 
 For the MLP model, we work on a dataset that corresponds to weather data with missing values. We add missing MCAR values on the features "TEMP", "PRES" and other features with NaN values. The goal is impute the missing values for the features "TEMP" and "PRES" by a Deep Learning method. We add features to take into account the seasonality of the data set and a feature for the station name
@@ -371,17 +362,21 @@ For the example, we use a simple MLP model with 3 layers of neurons.
 Then we train the model without taking a group on the stations
 
 ```python
-estimator = tf.keras.models.Sequential([
-    tf.keras.layers.Dense(256, activation='relu'),
-    tf.keras.layers.Dense(128, activation='relu'),
-    tf.keras.layers.Dense(64, activation='relu'),
-    tf.keras.layers.Dense(1)])
-estimator.compile(optimizer='adam', loss='mae')
-dict_imputers["MLP"] = imputer_mlp = imputers_keras.ImputerRegressorKeras(estimator=estimator, groups=['station'], handler_nan = "column")
+estimator = nn.Sequential(
+        nn.Linear(np.sum(df_data.isna().sum()==0), 256),
+        nn.ReLU(),
+        nn.Linear(256, 128),
+        nn.ReLU(),
+        nn.Linear(128, 64),
+        nn.ReLU(),
+        nn.Linear(64, 1)
+    )
+# imputers_pytorch.build_mlp_example(input_dim=np.sum(df_data.isna().sum()==0), list_num_neurons=[256,128,64])
+dict_imputers["MLP"] = imputer_mlp = imputers_pytorch.ImputerRegressorPyTorch(estimator=estimator, groups=['station'], handler_nan = "column", epochs=500)
 ```
 
 We can re-run the imputation model benchmark as before.
-```python jupyter={"outputs_hidden": true} tags=[]
+```python tags=[]
 generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=2, groups=["station"], subset=cols_to_impute, ratio_masked=ratio_masked)
 
 comparison = comparator.Comparator(
@@ -395,7 +390,7 @@ comparison = comparator.Comparator(
 results = comparison.compare(df_data)
 results
 ```
-```python jupyter={"outputs_hidden": true, "source_hidden": true} tags=[]
+```python jupyter={"source_hidden": true} tags=[]
 df_plot = df_data
 dfs_imputed = {name: imp.fit_transform(df_plot) for name, imp in dict_imputers.items()}
 station = df_plot.index.get_level_values("station")[0]
 
@@ -1407,76 +1407,26 @@ class ImputerRegressor(_Imputer):
 
     def __init__(
         self,
+        imputer_params: Tuple[str, ...] = ("handler_nan",),
         groups: Tuple[str, ...] = (),
         estimator: Optional[BaseEstimator] = None,
         handler_nan: str = "column",
         random_state: Union[None, int, np.random.RandomState] = None,
     ):
         super().__init__(
-            imputer_params=("handler_nan",),
+            imputer_params=imputer_params,
             groups=groups,
             random_state=random_state,
         )
         self.estimator = estimator
         self.handler_nan = handler_nan
 
-    def _get_params_fit(self) -> Dict:
-        """Get the parameters required for the fit, only used for neural networks.
+    def _fit_estimator(self, X, y) -> Self:
+        return self.estimator.fit(X, y)
 
-        Returns
-        -------
-        Dict
-            Dictionary of fit parameters.
-        """
-        return {}
-
-    def fit(self, X: pd.DataFrame, y: pd.DataFrame = None) -> _Imputer:
-        """Fit the imputer on X.
-
-        Parameters
-        ----------
-        X : pd.DataFrame
-            Data matrix on which the Imputer must be fitted.
-
-        Returns
-        -------
-        self : Self
-            Returns self.
-        """
-
-        super().fit(X)
-        df = self._check_input(X)
-
-        cols_with_nans = df.columns[df.isna().any()]
-        self.estimators_ = {}
-        for col in cols_with_nans:
-            # Define the Train and Test set
-            X_ = df.drop(columns=col, errors="ignore")
-            y_ = df[col]
-
-            # Selects only the valid values in the Train Set according to the chosen method
-            is_valid = pd.Series(True, index=df.index)
-            if self.handler_nan == "fit":
-                pass
-            elif self.handler_nan == "row":
-                is_valid = ~X_.isna().any(axis=1)
-            elif self.handler_nan == "column":
-                X_ = X_.dropna(how="any", axis=1)
-            else:
-                raise ValueError(
-                    f"Value '{self.handler_nan}' is not correct for argument `handler_nan'"
-                )
-
-            # Selects only non-NaN values for the Test Set
-            is_na = y_.isna()
-
-            # Train the model according to an ML or DL method and after predict the imputation
-            if not X_.empty:
-                hp = self._get_params_fit()
-                self.estimators_[col] = self.estimator
-                self.estimators_[col].fit(X_[(~is_na) & is_valid], y_[(~is_na) & is_valid], **hp)
-
-        return self
+    def _predict_estimator(self, X) -> pd.Series:
+        pred = self.estimator.predict(X)
+        return pd.Series(pred, index=X.index, dtype=float)
 
     def _transform_element(self, df: pd.DataFrame, col: str = "__all__") -> pd.DataFrame:
         """
@@ -1501,7 +1451,8 @@ def _transform_element(self, df: pd.DataFrame, col: str = "__all__") -> pd.DataF
             Input has to be a pandas.DataFrame.
         """
         self._check_dataframe(df)
-        df_imputed = df.apply(pd.DataFrame.median, result_type="broadcast", axis=0)
+        # df_imputed = df.apply(pd.DataFrame.median, result_type="broadcast", axis=0)
+        df_imputed = df.copy()
         cols_with_nans = df.columns[df.isna().any()]
 
         for col in cols_with_nans:
@@ -1526,24 +1477,15 @@ def _transform_element(self, df: pd.DataFrame, col: str = "__all__") -> pd.DataF
             is_na = y.isna()
 
             # Train the model according to an ML or DL method and after predict the imputation
-            if col not in self.estimators_:
-                y_imputed = pd.Series(y.mean(), index=y.index)
-            else:
-                X_select = X[is_na & is_valid]
-                y_imputed = self.estimators_[col].predict(X_select)
-                y_imputed = y_imputed.flatten().astype(float)
-
-                y_imputed = pd.Series(y_imputed, index=X_select.index)
-
-            # Adds the imputed values
-            # df_imputed.loc[~is_na, col] = y[~is_na]
-            # if isinstance(y_imputed, pd.Series):
-            #     y_reshaped = y_imputed
-            # else:
-            #     y_reshaped = y_imputed.flatten()
-            # df_imputed.loc[is_na & is_valid, col] = y_imputed.values[: sum(is_na & is_valid)]
-            df_imputed[col] = y_imputed.where(is_valid & is_na, y)
-
+            is_in_fit = (~is_na) & is_valid
+            is_in_pred = is_na & is_valid
+            if is_in_fit.any() and is_in_pred.any() and not X.empty:
+                self._fit_estimator(X[is_in_fit], y[is_in_fit])
+                X_pred = X[is_in_pred]
+                y_imputed = self._predict_estimator(X_pred)
+
+                df_imputed[col] = y_imputed.where(is_in_pred, y)
+        # df_imputed = df_imputed.fillna(df_imputed.median())
         return df_imputed