Skip to content

Commit dc34a1d

Browse files
Merge pull request #80 from Quantmetry/gsa_bench
Gsa bench
2 parents 876abb8 + 858df17 commit dc34a1d

File tree

9 files changed

+91
-51
lines changed

9 files changed

+91
-51
lines changed

AUTHORS.rst

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,20 @@
22
Credits
33
=======
44

5-
Development Lead
5+
Development Team
66
----------------
77

88
* Julien Roussel <[email protected]>
9-
10-
Maintainers
11-
------------
12-
13-
* Mikail Duran <[email protected]>
149
* Anh Khoa Ngo Ho <[email protected]>
10+
* Charles-Henri Prat <[email protected]>
1511
* Guillaume Saës <[email protected]>
1612

17-
Contributors
18-
------------
13+
Past Contributors
14+
-----------------
1915

2016
* Hong-Lan Botterman
17+
* Nicolas Brunel
2118
* Firas Dakhli
19+
* Mikaïl Duran
2220
* Rima Hajou
23-
* Vianey Taquet
21+
* Thomas Morzadec

HISTORY.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ History
1212
* Implementation of TabDDPM and TsDDPM, which are diffusion-based models for tabular data and time-series data, based on Denoising Diffusion Probabilistic Models. Their implementations follow the work of Tashiro et al., (2021) and Kotelnikov et al., (2023).
1313
* ImputerDiffusion is an imputer-wrapper of these two models TabDDPM and TsDDPM.
1414
* Docstrings and tests improved for the EM sampler
15-
* Online documentation reworked, with new tutorials on hole generators and a benchmark for time series imputation
15+
* Fix ImputerPytorch
16+
* Update Benchmark Deep Learning
1617

1718
0.0.15 (2023-08-03)
1819
-------------------

README.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,14 @@ The following table contains the available imputation methods. We distinguish si
171171
- Imputes missing values via EM algorithm
172172
- both
173173
- both
174+
* - MLP
175+
- Imputer based Multi-Layers Perceptron Model
176+
- both
177+
- both
178+
* - Autoencoder
179+
- Imputer based Autoencoder Model with Variationel method
180+
- both
181+
- both
174182
* - TabDDPM
175183
- Imputer based on Denoising Diffusion Probabilistic Models
176184
- both

examples/benchmark.md

Lines changed: 43 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@ jupyter:
88
format_version: '1.3'
99
jupytext_version: 1.14.4
1010
kernelspec:
11-
display_name: env_qolmat_dev
11+
display_name: env_qolmat
1212
language: python
13-
name: env_qolmat_dev
13+
name: env_qolmat
1414
---
1515

1616
**This notebook aims to present the Qolmat repo through an example of a multivariate time series.
@@ -172,8 +172,8 @@ dict_imputers = {
172172
# "locf": imputer_locf,
173173
# "nocb": imputer_nocb,
174174
# "knn": imputer_knn,
175-
# "ols": imputer_regressor,
176-
# "mice_ols": imputer_mice,
175+
"ols": imputer_regressor,
176+
"mice_ols": imputer_mice,
177177
}
178178
n_imputers = len(dict_imputers)
179179
```
@@ -295,13 +295,14 @@ plt.show()
295295

296296
```
297297

298-
## (Optional) Neuronal Network Model
298+
## (Optional) Deep Learning Model
299299

300300

301301
In this section, we present an MLP model of data imputation using Keras, which can be installed using a "pip install tensorflow".
302302

303303
```python
304304
from qolmat.imputations import imputers_pytorch
305+
from qolmat.imputations.diffusions.ddpms import TabDDPM
305306
try:
306307
import torch.nn as nn
307308
except ModuleNotFoundError:
@@ -323,33 +324,56 @@ For the example, we use a simple MLP model with 3 layers of neurons.
323324
Then we train the model without taking a group on the stations
324325

325326
```python
326-
estimator = nn.Sequential(
327-
nn.Linear(np.sum(df_data.isna().sum()==0), 256),
328-
nn.ReLU(),
329-
nn.Linear(256, 128),
330-
nn.ReLU(),
331-
nn.Linear(128, 64),
332-
nn.ReLU(),
333-
nn.Linear(64, 1)
334-
)
335-
# imputers_pytorch.build_mlp_example(input_dim=np.sum(df_data.isna().sum()==0), list_num_neurons=[256,128,64])
336-
dict_imputers["MLP"] = imputer_mlp = imputers_pytorch.ImputerRegressorPyTorch(estimator=estimator, groups=['station'], handler_nan = "column", epochs=500)
327+
fig = plt.figure(figsize=(10 * n_stations, 3 * n_cols))
328+
for i_station, (station, df) in enumerate(df_data.groupby("station")):
329+
df_station = df_data.loc[station]
330+
for i_col, col in enumerate(cols_to_impute):
331+
fig.add_subplot(n_cols, n_stations, i_col * n_stations + i_station + 1)
332+
plt.plot(df_station[col], '.', label=station)
333+
# break
334+
plt.ylabel(col)
335+
plt.xticks(rotation=15)
336+
if i_col == 0:
337+
plt.title(station)
338+
if i_col != n_cols - 1:
339+
plt.xticks([], [])
340+
plt.show()
341+
```
342+
343+
```python
344+
# estimator = nn.Sequential(
345+
# nn.Linear(np.sum(df_data.isna().sum()==0), 256),
346+
# nn.ReLU(),
347+
# nn.Linear(256, 128),
348+
# nn.ReLU(),
349+
# nn.Linear(128, 64),
350+
# nn.ReLU(),
351+
# nn.Linear(64, 1)
352+
# )
353+
estimator = imputers_pytorch.build_mlp(input_dim=np.sum(df_data.isna().sum()==0), list_num_neurons=[256,128,64])
354+
encoder, decoder = imputers_pytorch.build_autoencoder(input_dim=df_data.values.shape[1],latent_dim=4, output_dim=df_data.values.shape[1], list_num_neurons=[4*4, 2*4])
355+
```
356+
357+
```python
358+
dict_imputers["MLP"] = imputer_mlp = imputers_pytorch.ImputerRegressorPyTorch(estimator=estimator, groups=('station',), handler_nan = "column", epochs=500)
359+
dict_imputers["Autoencoder"] = imputer_autoencoder = imputers_pytorch.ImputerAutoencoder(encoder, decoder, max_iterations=100, epochs=100)
360+
dict_imputers["Diffusion"] = imputer_diffusion = imputers_pytorch.ImputerDiffusion(model=TabDDPM(num_sampling=5), epochs=100, batch_size=100)
337361
```
338362

339363
We can re-run the imputation model benchmark as before.
340364
```python tags=[]
341-
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=2, groups=["station"], subset=cols_to_impute, ratio_masked=ratio_masked)
365+
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=3, groups=('station',), subset=cols_to_impute, ratio_masked=ratio_masked)
342366

343367
comparison = comparator.Comparator(
344368
dict_imputers,
345-
cols_to_impute,
369+
selected_columns = df_data.columns,
346370
generator_holes = generator_holes,
347371
metrics=["mae", "wmape", "KL_columnwise", "ks_test"],
348372
max_evals=10,
349373
dict_config_opti=dict_config_opti,
350374
)
351375
results = comparison.compare(df_data)
352-
results
376+
results.style.highlight_min(color="green", axis=1)
353377
```
354378
```python tags=[]
355379
df_plot = df_data

qolmat/benchmark/metrics.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1022,8 +1022,8 @@ def distance_anticorr(df1: pd.DataFrame, df2: pd.DataFrame, df_mask: pd.DataFram
10221022
float
10231023
Distance correlation score
10241024
"""
1025-
df1 = df1[df_mask.any(axis=1)]
1026-
df2 = df2[df_mask.any(axis=1)]
1025+
df1 = df1.loc[df_mask.any(axis=1)]
1026+
df2 = df2.loc[df_mask.any(axis=1)]
10271027
return (1 - dcor.distance_correlation(df1.values, df2.values)) / 2
10281028

10291029

@@ -1059,8 +1059,8 @@ def pattern_based_weighted_mean_metric(
10591059
"""
10601060
scores = []
10611061
weights = []
1062-
df1 = df1[df_mask.any(axis=1)]
1063-
df2 = df2[df_mask.any(axis=1)]
1062+
df1 = df1.loc[df_mask.any(axis=1)]
1063+
df2 = df2.loc[df_mask.any(axis=1)]
10641064
df_nan = df1.notna()
10651065
max_num_row = 0
10661066
for tup_pattern, df_nan_pattern in df_nan.groupby(df_nan.columns.tolist()):

qolmat/imputations/imputers.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1600,7 +1600,6 @@ def _transform_element(
16001600
# df_imputed = df.apply(pd.DataFrame.median, result_type="broadcast", axis=0)
16011601
df_imputed = df.copy()
16021602
cols_with_nans = df.columns[df.isna().any()]
1603-
16041603
for col in cols_with_nans:
16051604
model = self._dict_fitting["__all__"][ngroup][col]
16061605
if model is None:
@@ -1613,6 +1612,7 @@ def _transform_element(
16131612
X = X.loc[is_na]
16141613

16151614
y_hat = self._predict_estimator(model, X)
1615+
y_hat.index = X.index
16161616
df_imputed.loc[X.index, col] = y_hat
16171617
return df_imputed
16181618

qolmat/imputations/imputers_pytorch.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -368,7 +368,6 @@ def _transform_element(
368368
)
369369
X = df_train_scaler.values
370370
mask = df.isna().values
371-
372371
for _ in range(self.max_iterations):
373372
self.fit(X, X)
374373
Z = autoencoder.encode(X)
@@ -382,7 +381,7 @@ def _transform_element(
382381
return df_imputed
383382

384383

385-
def build_mlp_example(
384+
def build_mlp(
386385
input_dim: int,
387386
list_num_neurons: List[int],
388387
output_dim: int = 1,
@@ -414,7 +413,7 @@ def build_mlp_example(
414413
415414
Examples
416415
--------
417-
>>> model = build_mlp_example(input_dim=10, list_num_neurons=[32, 64, 128], output_dim=1)
416+
>>> model = build_mlp(input_dim=10, list_num_neurons=[32, 64, 128], output_dim=1)
418417
>>> print(model)
419418
Sequential(
420419
(0): Linear(in_features=10, out_features=32, bias=True)
@@ -437,7 +436,7 @@ def build_mlp_example(
437436
return estimator
438437

439438

440-
def build_autoencoder_example(
439+
def build_autoencoder(
441440
input_dim: int,
442441
latent_dim: int,
443442
list_num_neurons: List[int],
@@ -472,7 +471,7 @@ def build_autoencoder_example(
472471
473472
Examples
474473
--------
475-
>>> encoder, decoder = build_autoencoder_example(
474+
>>> encoder, decoder = build_autoencoder(
476475
input_dim=10,
477476
latent_dim=4,
478477
list_num_neurons=[32, 64, 128],
@@ -500,13 +499,13 @@ def build_autoencoder_example(
500499
)
501500
"""
502501

503-
encoder = build_mlp_example(
502+
encoder = build_mlp(
504503
input_dim=input_dim,
505504
output_dim=latent_dim,
506505
list_num_neurons=np.sort(list_num_neurons)[::-1].tolist(),
507506
activation=activation,
508507
)
509-
decoder = build_mlp_example(
508+
decoder = build_mlp(
510509
input_dim=latent_dim,
511510
output_dim=output_dim,
512511
list_num_neurons=np.sort(list_num_neurons).tolist(),

setup.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,22 +13,31 @@
1313
LICENSE = "new BSD"
1414
AUTHORS = """
1515
Hong-Lan Botterman,
16-
Julien Roussel,
17-
Thomas Morzadec,
18-
Rima Hajou,
1916
Firas Dakhli,
17+
Rima Hajou,
18+
Thomas Morzadec,
2019
Anh Khoa Ngo Ho,
2120
Charles-Henri Prat
21+
Julien Roussel,
22+
Guillaume Saës,
2223
"""
2324
AUTHORS_EMAIL = """
2425
25-
26-
27-
2826
27+
28+
2929
3030
31+
32+
3133
"""
34+
MAINTAINER = "Julien ROUSSEL, Anh Khoa NGO HO, Charles-Henri PRAT, Guillaume SAËS"
35+
MAINTAINER_EMAIL = (
36+
37+
38+
39+
40+
)
3241
URL = "https://github.com/Quantmetry/qolmat"
3342
DOWNLOAD_URL = "https://pypi.org/project/qolmat/#files"
3443
PROJECT_URLS = {

tests/imputations/test_imputers_pytorch.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ def test_ImputerRegressorPyTorch_fit_transform(df: pd.DataFrame) -> None:
2929
nn.manual_seed(42)
3030
if nn.cuda.is_available():
3131
nn.cuda.manual_seed(42)
32-
estimator = imputers_pytorch.build_mlp_example(input_dim=2, list_num_neurons=[64, 32])
32+
estimator = imputers_pytorch.build_mlp(input_dim=2, list_num_neurons=[64, 32])
3333
imputer = imputers_pytorch.ImputerRegressorPyTorch(
3434
estimator=estimator, handler_nan="column", epochs=10
3535
)
@@ -47,13 +47,14 @@ def test_ImputerRegressorPyTorch_fit_transform(df: pd.DataFrame) -> None:
4747

4848
expected = pd.DataFrame(
4949
{
50-
"col1": [2.031, 15, 19, 23, 33],
50+
"col1": [2.031, 15, 2.132, 23, 33],
5151
"col2": [69, 76, 74, 80, 78],
52-
"col3": [174, 166, 182, 177, 175.5],
52+
"col3": [174, 166, 182, 177, 2.345],
5353
"col4": [9, 12, 11, 12, 8],
54-
"col5": [93, 75, 75, 12, 75],
54+
"col5": [93, 75, 2.132, 12, 2.345],
5555
}
5656
)
57+
print(result["col5"])
5758
np.testing.assert_allclose(result, expected, atol=1e-3)
5859

5960

0 commit comments

Comments
 (0)