Skip to content

Commit 7403db6

Browse files
committed
feat: merge with dev and fix conflicts
2 parents 07d0538 + cf46338 commit 7403db6

35 files changed

+804
-75
lines changed

.github/workflows/test.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ jobs:
2020
uses: conda-incubator/setup-miniconda@v2
2121
with:
2222
python-version: ${{matrix.python-version}}
23+
environment-file: environment.ci.yml
2324
channels: default, conda-forge
2425
- name: Lint with flake8
2526
run: |
@@ -28,9 +29,9 @@ jobs:
2829
- name: Test with pytest
2930
run: |
3031
conda install pytest
31-
#pytest
32+
pytest
3233
echo you should uncomment pytest and delete this line
3334
- name: typing with mypy
3435
run: |
35-
#mypy qolmat
36+
mypy qolmat
3637
echo you should uncomment mypy qolmat and delete this line

.pre-commit-config.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ repos:
1212
rev: 22.8.0
1313
hooks:
1414
- id: black
15-
# exclude: (tests/)
1615
args:
1716
- "-l 99"
1817
# Flake8

LICENSE

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
Copyright (c) 2023, Hong-Lan BOTTERMAN, Julien ROUSSEL, Thomas MORZADEC, Rima HAJOU, Firas DAKHLI and qolmat contributors.
2+
All rights reserved.
3+
4+
Redistribution and use in source and binary forms, with or without
5+
modification, are permitted provided that the following conditions are met:
6+
7+
1. Redistributions of source code must retain the above copyright notice, this
8+
list of conditions and the following disclaimer.
9+
10+
2. Redistributions in binary form must reproduce the above copyright notice,
11+
this list of conditions and the following disclaimer in the documentation
12+
and/or other materials provided with the distribution.
13+
14+
3. Neither the name of the copyright holder nor the names of its
15+
contributors may be used to endorse or promote products derived from
16+
this software without specific prior written permission.
17+
18+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.rst

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,27 @@
11
.. -*- mode: rst -*-
22
3-
|ReadTheDocs|_ |PythonVersion|_ |PyPi|_ |Conda|_
3+
|GitHubActions|_ |ReadTheDocs|_ |License|_ |PythonVersion|_ |PyPi|_ |Release|_ |Commits|_
4+
5+
.. |GitHubActions| image:: https://github.com/Quantmetry/qolmat/actions/workflows/test.yml/badge.svg
6+
.. _GitHubActions: https://github.com/Quantmetry/qolmat/actions
47

58
.. |ReadTheDocs| image:: https://readthedocs.org/projects/qolmat/badge
69
.. _ReadTheDocs: https://qolmat.readthedocs.io/en/latest
710

11+
.. |License| image:: https://img.shields.io/github/license/Quantmetry/qolmat
12+
.. _License: https://github.com/Quantmetry/qolmat/blob/dev_MLP/LICENSE
13+
814
.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/qolmat
9-
.. _PythonVersion: https://pypi.org/project/mapie/
15+
.. _PythonVersion: https://pypi.org/project/qolmat/
1016

1117
.. |PyPi| image:: https://img.shields.io/pypi/v/qolmat
1218
.. _PyPi: https://pypi.org/project/qolmat/
1319

14-
.. |Conda| image:: https://img.shields.io/conda/vn/conda-forge/qolmat
15-
.. _Conda: https://anaconda.org/conda-forge/qolmat
20+
.. |Release| image:: https://img.shields.io/github/v/release/Quantmetry/qolmat
21+
.. _Release: https://github.com/Quantmetry/qolmat
22+
23+
.. |Commits| image:: https://img.shields.io/github/commits-since/Quantmetry/qolmat/latest/main
24+
.. _Commits: https://github.com/Quantmetry/qolmat/commits/master
1625

1726

1827
Welcome to Qolmat’s documentation!
@@ -62,7 +71,7 @@ Missing values can be generated following the MCAR mechanism.
6271
On each sample, different imputation models are tested and reconstruction errors are computed on these artificially missing entries. Then the errors of each imputation model are averaged and we eventually obtained a unique error score per model. This procedure allows the comparison of different models on the same dataset.
6372

6473

65-
.. image:: images/comparator.png
74+
.. image:: docs/images/comparator.png
6675
:align: center
6776

6877

docs/examples/imputer_keras.rst

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
#########################
2+
Imputation with keras
3+
#########################
4+
5+
In this section, we present several approaches to do data imputation using Deep Learning methods:
6+
7+
- **Multi-Layers Perceptrons (MLP)**
8+
9+
- **Autoencoder**
10+
11+
To do this, we use Keras with the Tensorflow package, as with the following imports:
12+
13+
.. code-block:: python
14+
15+
import warnings
16+
import pandas as pd
17+
import numpy as np
18+
import tensorflow as tf
19+
20+
tab10 = plt.get_cmap("tab10")
21+
plt.rcParams.update({'font.size': 18})
22+
23+
from matplotlib import pyplot as plt
24+
import matplotlib.ticker as plticker
25+
26+
from sklearn.linear_model import LinearRegression
27+
28+
from qolmat.benchmark import comparator, missing_patterns
29+
from qolmat.imputations import imputers
30+
from qolmat.imputations import imputers_keras
31+
from qolmat.utils import data, utils, plot
32+
33+
*********************
34+
MLP Model
35+
*********************
36+
37+
For the MLP model, we work on a dataset that corresponds to weather data with missing values. We add missing MCAR values on the features "TEMP", "PRES" and other features with NaN values. The goal is impute the missing values for the features "TEMP" and "PRES" by a Deep Learning method. We add features to take into account the seasonality of the data set and a feature for the station name
38+
39+
.. code-block:: python
40+
41+
df = data.get_data("Beijing")
42+
cols_to_impute = ["TEMP", "PRES"]
43+
cols_with_nans = list(df.columns[df.isna().any()])
44+
df_data = data.add_datetime_features(df)
45+
df_data[cols_with_nans + cols_to_impute] = data.add_holes(pd.DataFrame(df_data[cols_with_nans + cols_to_impute]), ratio_masked=.1, mean_size=120)
46+
df_data.isna().sum()
47+
48+
There are two methods to train a Deep Learning model by removing missing data:
49+
50+
- **By line:** We impute the lines containing at least one missing value by a median method and we train the Deep Learning model only on the lines without any missing values. In this case, one must be careful to have enough data;
51+
52+
- **By column:** We remove the columns containing at least one missing value. And we train the Deep Learning model only on the columns without missing values. In this case, it is necessary to think about having at least one column because in the contrary case a median method will be applied.
53+
54+
55+
.. image:: ../images/line_or_column.png
56+
57+
In the dataset, we have few rows, so we will impute "PRES" and "TEMP" by a column method.
58+
We can observe the missing data for the temperature and pressure data.
59+
60+
.. code-block:: python
61+
62+
n_stations = len(df_data.groupby("station").size())
63+
n_cols = len(cols_to_impute)
64+
fig = plt.figure(figsize=(10 * n_stations, 3 * n_cols))
65+
for i_station, (station, df) in enumerate(df_data.groupby("station")):
66+
df_station = df_data.loc[station]
67+
for i_col, col in enumerate(cols_to_impute):
68+
fig.add_subplot(n_cols, n_stations, i_col * n_stations + i_station + 1)
69+
plt.plot(df_station[col], '.', label=station)
70+
# break
71+
plt.ylabel(col)
72+
plt.xticks(rotation=15)
73+
if i_col == 0:
74+
plt.title(station)
75+
if i_col != n_cols - 1:
76+
plt.xticks([], [])
77+
plt.show()
78+
79+
80+
.. image:: ../images/data_holes.png
81+
82+
The documentation to build a Multi-Layers Perceptrons (MLP) with Keras is detailed on the link : https://www.tensorflow.org/guide/core/mlp_core
83+
84+
For the example, we use a simple MLP model with 2 layers of neurons.
85+
Then we train the model without taking a group on the stations
86+
87+
.. code-block:: python
88+
89+
estimator_mlp = tf.keras.models.Sequential([
90+
tf.keras.layers.Dense(128, activation='sigmoid'),
91+
tf.keras.layers.Dense(32, activation='sigmoid'),
92+
tf.keras.layers.Dense(1)])
93+
estimator_mlp.compile(optimizer='adam', loss='mae')
94+
imputer_mlp = imputers_keras.ImputerRegressorKeras(estimator=estimator_mlp, handler_nan = "column")
95+
96+
Training and imputation are done using **.fit_transform**.
97+
98+
.. code-block:: python
99+
100+
df_plot = df_data
101+
df_imputed = imputer_mlp.fit_transform(df_plot)
102+
103+
We can compare on the figure below with in blue the real data and in red the missing data that have been imputed
104+
105+
.. image:: ../images/data_holes_impute.png
106+
107+
************************
108+
Autoencoder Imputation
109+
************************
110+
111+
***************
112+
Benchmark
113+
***************
114+
115+
A benchmark on different imputer models is proposed for comparison.
116+
117+
.. code-block:: python
118+
119+
imputer_mice_ols = imputers.ImputerMICE(groups=["station"], estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan)
120+
imputer_ols = imputers.ImputerRegressor(groups=["station"], estimator=LinearRegression())
121+
122+
dict_imputers = {
123+
"OLS": imputer_ols,
124+
"MICE_ols": imputer_mice_ols,
125+
"MLP": imputer_mlp,
126+
}
127+
n_imputers = len(dict_imputers)
128+
ratio_masked = 0.1
129+
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=2, groups = ["station"], subset = cols_to_impute, ratio_masked=ratio_masked)
130+
131+
comparison = comparator.Comparator(
132+
dict_imputers,
133+
df_data.columns,
134+
generator_holes = generator_holes,
135+
n_calls_opt=5,
136+
)
137+
results = comparison.compare(df_data)
138+
results
139+
140+
It is possible to change the value of **ratio_masked** which allows you to choose the proportion of values that you mask in order to compare the imputation methods.
141+
In **result**, you can find the different metrics for each imputation method.
142+
143+
We can display the result of the different predictions
144+
145+
.. image::
146+
../images/imputer_keras_graph1.png
147+
.. image::
148+
../images/imputer_keras_graph2.png

docs/images/data_holes.png

647 KB
Loading

docs/images/data_holes_impute.png

606 KB
Loading
46.2 KB
Loading
48.7 KB
Loading

docs/images/line_or_column.png

180 KB
Loading

0 commit comments

Comments
 (0)