Skip to content

Commit cc6ef55

Browse files
Merge pull request #45 from Quantmetry/switch_pytorch
Switch pytorch
2 parents 5d21fe7 + df18cf5 commit cc6ef55

File tree

16 files changed

+497
-232
lines changed

16 files changed

+497
-232
lines changed

CONTRIBUTING.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ If you need to use tensorflow, enter the command:
3636

3737
.. code:: sh
3838
39-
$ pip install -e .[tensorflow]
39+
$ pip install -e .[pytorch]
4040
4141
Once the environment is installed, pre-commit is installed, but need to be activated using the following command:
4242

HISTORY.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,10 @@ History
99
* The Imputer classes do not possess a dictionary attribute anymore, and all list attributes have
1010
been changed into tuple attributes so that all are not immutable
1111
* All the tests from scikit-learn's check_estimator now pass for the class Imputer
12-
* Fix MLP imputer
12+
* Fix MLP imputer, created a builder for MLP imputer
13+
* Switch tensorflow by pytorch. Change Test, environment, benchmark and imputers for pytorch
14+
* Add new datasets
15+
* Added dcor metrics with a pattern-wise computation on data with missing values
1316

1417
0.0.14 (2023-06-14)
1518
-------------------

README.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ To install directly from the github repository :
6262

6363
Let us start with a basic imputation problem. Here, we generate one-dimensional noisy time series.
6464

65-
.. code:: sh
65+
.. code-block:: python
6666
6767
import matplotlib.pyplot as plt
6868
import numpy as np
@@ -75,7 +75,7 @@ Let us start with a basic imputation problem. Here, we generate one-dimensional
7575
7676
For this demonstration, let us create artificial holes in our dataset.
7777

78-
.. code:: sh
78+
.. code-block:: python
7979
8080
from qolmat.utils.data import add_holes
8181
plt.rcParams.update({'font.size': 18})
@@ -101,7 +101,7 @@ For this demonstration, let us create artificial holes in our dataset.
101101
To impute missing data, there are several methods that can be imported with ``from qolmat.imputations import imputers``.
102102
The creation of an imputation dictionary will enable us to benchmark the various imputations.
103103

104-
.. code:: sh
104+
.. code-block:: python
105105
106106
from sklearn.linear_model import LinearRegression
107107
from qolmat.imputations import imputers
@@ -146,7 +146,7 @@ The creation of an imputation dictionary will enable us to benchmark the various
146146
147147
It is possible to define a parameter dictionary for an imputer with three pieces of information: min, max and type. The aim of the dictionary is to determine the optimal parameters for data imputation. Here, we call this dictionary ``dict_config_opti``.
148148

149-
.. code:: sh
149+
.. code-block:: python
150150
151151
search_params = {
152152
"RPCA_opti": {
@@ -157,7 +157,7 @@ It is possible to define a parameter dictionary for an imputer with three pieces
157157
158158
Then with the comparator function in ``from qolmat.benchmark import comparator``, we can compare the different imputation methods. This **does not use knowledge on missing values**, but it relies data masking instead. For more details on how imputors and comparator work, please see the following `link <https://qolmat.readthedocs.io/en/latest/explanation.html>`_.
159159

160-
.. code:: sh
160+
.. code-block:: python
161161
162162
from qolmat.benchmark import comparator
163163
@@ -175,7 +175,7 @@ Then with the comparator function in ``from qolmat.benchmark import comparator``
175175
176176
We can observe the benchmark results.
177177

178-
.. code:: sh
178+
.. code-block:: python
179179
180180
dfs_imputed = imputer_tsmle.fit_transform(df_with_nan)
181181
@@ -196,7 +196,7 @@ We can observe the benchmark results.
196196

197197
Finally, we keep the best ``TSMLE`` imputor we represent.
198198

199-
.. code:: sh
199+
.. code-block:: python
200200
201201
dfs_imputed = imputer_tsmle.fit_transform(df_with_nan)
202202

environment.ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,5 @@ dependencies:
1414
- pytest
1515
- pytest-cov
1616
- pytest-mock
17-
- tensorflow
17+
- torch==2.0.1
1818
- -e .

examples/benchmark.md

Lines changed: 19 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -19,18 +19,6 @@ In Qolmat, a few data imputation methods are implemented as well as a way to eva
1919

2020
First, import some useful librairies
2121

22-
```python
23-
X= np.array([[0], [1], [2]])
24-
```
25-
26-
```python
27-
np.cov(X)
28-
```
29-
30-
```python
31-
32-
```
33-
3422
```python
3523
import warnings
3624
# warnings.filterwarnings('error')
@@ -146,7 +134,7 @@ imputer_tsmle = imputers.ImputerEM(groups=("station",), model="VAR1", method="ml
146134

147135

148136
imputer_knn = imputers.ImputerKNN(groups=("station",), n_neighbors=10)
149-
imputer_mice = imputers.ImputerMICE(groups=("station",), estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan)
137+
imputer_mice = imputers.ImputerMICE(groups=("station",), estimator=LinearRegression(), sample_posterior=False, max_iter=100)
150138
imputer_regressor = imputers.ImputerRegressor(groups=("station",), estimator=LinearRegression())
151139
```
152140

@@ -352,8 +340,11 @@ plt.show()
352340
In this section, we present an MLP model of data imputation using Keras, which can be installed using a "pip install tensorflow".
353341

354342
```python
355-
from qolmat.imputations import imputers_keras
356-
import tensorflow as tf
343+
from qolmat.imputations import imputers_pytorch
344+
try:
345+
import torch.nn as nn
346+
except ModuleNotFoundError:
347+
raise PyTorchExtraNotInstalled
357348
```
358349

359350
For the MLP model, we work on a dataset that corresponds to weather data with missing values. We add missing MCAR values on the features "TEMP", "PRES" and other features with NaN values. The goal is impute the missing values for the features "TEMP" and "PRES" by a Deep Learning method. We add features to take into account the seasonality of the data set and a feature for the station name
@@ -371,17 +362,21 @@ For the example, we use a simple MLP model with 3 layers of neurons.
371362
Then we train the model without taking a group on the stations
372363

373364
```python
374-
estimator = tf.keras.models.Sequential([
375-
tf.keras.layers.Dense(256, activation='relu'),
376-
tf.keras.layers.Dense(128, activation='relu'),
377-
tf.keras.layers.Dense(64, activation='relu'),
378-
tf.keras.layers.Dense(1)])
379-
estimator.compile(optimizer='adam', loss='mae')
380-
dict_imputers["MLP"] = imputer_mlp = imputers_keras.ImputerRegressorKeras(estimator=estimator, groups=['station'], handler_nan = "column")
365+
estimator = nn.Sequential(
366+
nn.Linear(np.sum(df_data.isna().sum()==0), 256),
367+
nn.ReLU(),
368+
nn.Linear(256, 128),
369+
nn.ReLU(),
370+
nn.Linear(128, 64),
371+
nn.ReLU(),
372+
nn.Linear(64, 1)
373+
)
374+
# imputers_pytorch.build_mlp_example(input_dim=np.sum(df_data.isna().sum()==0), list_num_neurons=[256,128,64])
375+
dict_imputers["MLP"] = imputer_mlp = imputers_pytorch.ImputerRegressorPyTorch(estimator=estimator, groups=['station'], handler_nan = "column", epochs=500)
381376
```
382377

383378
We can re-run the imputation model benchmark as before.
384-
```python jupyter={"outputs_hidden": true} tags=[]
379+
```python tags=[]
385380
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=2, groups=["station"], subset=cols_to_impute, ratio_masked=ratio_masked)
386381

387382
comparison = comparator.Comparator(
@@ -395,7 +390,7 @@ comparison = comparator.Comparator(
395390
results = comparison.compare(df_data)
396391
results
397392
```
398-
```python jupyter={"outputs_hidden": true, "source_hidden": true} tags=[]
393+
```python jupyter={"source_hidden": true} tags=[]
399394
df_plot = df_data
400395
dfs_imputed = {name: imp.fit_transform(df_plot) for name, imp in dict_imputers.items()}
401396
station = df_plot.index.get_level_values("station")[0]

qolmat/imputations/imputers.py

Lines changed: 18 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1407,76 +1407,26 @@ class ImputerRegressor(_Imputer):
14071407

14081408
def __init__(
14091409
self,
1410+
imputer_params: Tuple[str, ...] = ("handler_nan",),
14101411
groups: Tuple[str, ...] = (),
14111412
estimator: Optional[BaseEstimator] = None,
14121413
handler_nan: str = "column",
14131414
random_state: Union[None, int, np.random.RandomState] = None,
14141415
):
14151416
super().__init__(
1416-
imputer_params=("handler_nan",),
1417+
imputer_params=imputer_params,
14171418
groups=groups,
14181419
random_state=random_state,
14191420
)
14201421
self.estimator = estimator
14211422
self.handler_nan = handler_nan
14221423

1423-
def _get_params_fit(self) -> Dict:
1424-
"""Get the parameters required for the fit, only used for neural networks.
1424+
def _fit_estimator(self, X, y) -> Self:
1425+
return self.estimator.fit(X, y)
14251426

1426-
Returns
1427-
-------
1428-
Dict
1429-
Dictionary of fit parameters.
1430-
"""
1431-
return {}
1432-
1433-
def fit(self, X: pd.DataFrame, y: pd.DataFrame = None) -> _Imputer:
1434-
"""Fit the imputer on X.
1435-
1436-
Parameters
1437-
----------
1438-
X : pd.DataFrame
1439-
Data matrix on which the Imputer must be fitted.
1440-
1441-
Returns
1442-
-------
1443-
self : Self
1444-
Returns self.
1445-
"""
1446-
1447-
super().fit(X)
1448-
df = self._check_input(X)
1449-
1450-
cols_with_nans = df.columns[df.isna().any()]
1451-
self.estimators_ = {}
1452-
for col in cols_with_nans:
1453-
# Define the Train and Test set
1454-
X_ = df.drop(columns=col, errors="ignore")
1455-
y_ = df[col]
1456-
1457-
# Selects only the valid values in the Train Set according to the chosen method
1458-
is_valid = pd.Series(True, index=df.index)
1459-
if self.handler_nan == "fit":
1460-
pass
1461-
elif self.handler_nan == "row":
1462-
is_valid = ~X_.isna().any(axis=1)
1463-
elif self.handler_nan == "column":
1464-
X_ = X_.dropna(how="any", axis=1)
1465-
else:
1466-
raise ValueError(
1467-
f"Value '{self.handler_nan}' is not correct for argument `handler_nan'"
1468-
)
1469-
1470-
# Selects only non-NaN values for the Test Set
1471-
is_na = y_.isna()
1472-
1473-
# Train the model according to an ML or DL method and after predict the imputation
1474-
if not X_.empty:
1475-
hp = self._get_params_fit()
1476-
self.estimators_[col] = self.estimator
1477-
self.estimators_[col].fit(X_[(~is_na) & is_valid], y_[(~is_na) & is_valid], **hp)
1478-
1479-
return self
1427+
def _predict_estimator(self, X) -> pd.Series:
1428+
pred = self.estimator.predict(X)
1429+
return pd.Series(pred, index=X.index, dtype=float)
14801430

14811431
def _transform_element(self, df: pd.DataFrame, col: str = "__all__") -> pd.DataFrame:
14821432
"""
@@ -1501,7 +1451,8 @@ def _transform_element(self, df: pd.DataFrame, col: str = "__all__") -> pd.DataF
15011451
Input has to be a pandas.DataFrame.
15021452
"""
15031453
self._check_dataframe(df)
1504-
df_imputed = df.apply(pd.DataFrame.median, result_type="broadcast", axis=0)
1454+
# df_imputed = df.apply(pd.DataFrame.median, result_type="broadcast", axis=0)
1455+
df_imputed = df.copy()
15051456
cols_with_nans = df.columns[df.isna().any()]
15061457

15071458
for col in cols_with_nans:
@@ -1526,24 +1477,15 @@ def _transform_element(self, df: pd.DataFrame, col: str = "__all__") -> pd.DataF
15261477
is_na = y.isna()
15271478

15281479
# Train the model according to an ML or DL method and after predict the imputation
1529-
if col not in self.estimators_:
1530-
y_imputed = pd.Series(y.mean(), index=y.index)
1531-
else:
1532-
X_select = X[is_na & is_valid]
1533-
y_imputed = self.estimators_[col].predict(X_select)
1534-
y_imputed = y_imputed.flatten().astype(float)
1535-
1536-
y_imputed = pd.Series(y_imputed, index=X_select.index)
1537-
1538-
# Adds the imputed values
1539-
# df_imputed.loc[~is_na, col] = y[~is_na]
1540-
# if isinstance(y_imputed, pd.Series):
1541-
# y_reshaped = y_imputed
1542-
# else:
1543-
# y_reshaped = y_imputed.flatten()
1544-
# df_imputed.loc[is_na & is_valid, col] = y_imputed.values[: sum(is_na & is_valid)]
1545-
df_imputed[col] = y_imputed.where(is_valid & is_na, y)
1546-
1480+
is_in_fit = (~is_na) & is_valid
1481+
is_in_pred = is_na & is_valid
1482+
if is_in_fit.any() and is_in_pred.any() and not X.empty:
1483+
self._fit_estimator(X[is_in_fit], y[is_in_fit])
1484+
X_pred = X[is_in_pred]
1485+
y_imputed = self._predict_estimator(X_pred)
1486+
1487+
df_imputed[col] = y_imputed.where(is_in_pred, y)
1488+
# df_imputed = df_imputed.fillna(df_imputed.median())
15471489
return df_imputed
15481490

15491491

qolmat/imputations/imputers_keras.py

Lines changed: 0 additions & 44 deletions
This file was deleted.

0 commit comments

Comments
 (0)