|
| 1 | +######################### |
| 2 | +Imputation with keras |
| 3 | +######################### |
| 4 | + |
| 5 | +In this section, we present several approaches to do data imputation using Deep Learning methods: |
| 6 | + |
| 7 | +- **Multi-Layers Perceptrons (MLP)** |
| 8 | + |
| 9 | +- **Autoencoder** |
| 10 | + |
| 11 | +To do this, we use Keras with the Tensorflow package, as with the following imports: |
| 12 | + |
| 13 | +.. code-block:: python |
| 14 | +
|
| 15 | + import warnings |
| 16 | + import pandas as pd |
| 17 | + import numpy as np |
| 18 | + import tensorflow as tf |
| 19 | +
|
| 20 | + tab10 = plt.get_cmap("tab10") |
| 21 | + plt.rcParams.update({'font.size': 18}) |
| 22 | +
|
| 23 | + from matplotlib import pyplot as plt |
| 24 | + import matplotlib.ticker as plticker |
| 25 | +
|
| 26 | + from sklearn.linear_model import LinearRegression |
| 27 | +
|
| 28 | + from qolmat.benchmark import comparator, missing_patterns |
| 29 | + from qolmat.imputations import imputers |
| 30 | + from qolmat.imputations import imputers_keras |
| 31 | + from qolmat.utils import data, utils, plot |
| 32 | +
|
| 33 | +********************* |
| 34 | +MLP Model |
| 35 | +********************* |
| 36 | + |
| 37 | +For the MLP model, we work on a dataset that corresponds to weather data with missing values. We add missing MCAR values on the features "TEMP", "PRES" and other features with NaN values. The goal is impute the missing values for the features "TEMP" and "PRES" by a Deep Learning method. We add features to take into account the seasonality of the data set and a feature for the station name |
| 38 | + |
| 39 | +.. code-block:: python |
| 40 | +
|
| 41 | + df = data.get_data("Beijing") |
| 42 | + cols_to_impute = ["TEMP", "PRES"] |
| 43 | + cols_with_nans = list(df.columns[df.isna().any()]) |
| 44 | + df_data = data.add_datetime_features(df) |
| 45 | + df_data[cols_with_nans + cols_to_impute] = data.add_holes(pd.DataFrame(df_data[cols_with_nans + cols_to_impute]), ratio_masked=.1, mean_size=120) |
| 46 | + df_data.isna().sum() |
| 47 | +
|
| 48 | +There are two methods to train a Deep Learning model by removing missing data: |
| 49 | + |
| 50 | +- **By line:** We impute the lines containing at least one missing value by a median method and we train the Deep Learning model only on the lines without any missing values. In this case, one must be careful to have enough data; |
| 51 | + |
| 52 | +- **By column:** We remove the columns containing at least one missing value. And we train the Deep Learning model only on the columns without missing values. In this case, it is necessary to think about having at least one column because in the contrary case a median method will be applied. |
| 53 | + |
| 54 | + |
| 55 | +.. image:: ../images/line_or_column.png |
| 56 | + |
| 57 | +In the dataset, we have few rows, so we will impute "PRES" and "TEMP" by a column method. |
| 58 | +We can observe the missing data for the temperature and pressure data. |
| 59 | + |
| 60 | +.. code-block:: python |
| 61 | +
|
| 62 | + n_stations = len(df_data.groupby("station").size()) |
| 63 | + n_cols = len(cols_to_impute) |
| 64 | + fig = plt.figure(figsize=(10 * n_stations, 3 * n_cols)) |
| 65 | + for i_station, (station, df) in enumerate(df_data.groupby("station")): |
| 66 | + df_station = df_data.loc[station] |
| 67 | + for i_col, col in enumerate(cols_to_impute): |
| 68 | + fig.add_subplot(n_cols, n_stations, i_col * n_stations + i_station + 1) |
| 69 | + plt.plot(df_station[col], '.', label=station) |
| 70 | + # break |
| 71 | + plt.ylabel(col) |
| 72 | + plt.xticks(rotation=15) |
| 73 | + if i_col == 0: |
| 74 | + plt.title(station) |
| 75 | + if i_col != n_cols - 1: |
| 76 | + plt.xticks([], []) |
| 77 | + plt.show() |
| 78 | +
|
| 79 | +
|
| 80 | +.. image:: ../images/data_holes.png |
| 81 | + |
| 82 | +The documentation to build a Multi-Layers Perceptrons (MLP) with Keras is detailed on the link : https://www.tensorflow.org/guide/core/mlp_core |
| 83 | + |
| 84 | +For the example, we use a simple MLP model with 2 layers of neurons. |
| 85 | +Then we train the model without taking a group on the stations |
| 86 | + |
| 87 | +.. code-block:: python |
| 88 | + |
| 89 | + estimator_mlp = tf.keras.models.Sequential([ |
| 90 | + tf.keras.layers.Dense(128, activation='sigmoid'), |
| 91 | + tf.keras.layers.Dense(32, activation='sigmoid'), |
| 92 | + tf.keras.layers.Dense(1)]) |
| 93 | + estimator_mlp.compile(optimizer='adam', loss='mae') |
| 94 | + imputer_mlp = imputers_keras.ImputerRegressorKeras(estimator=estimator_mlp, handler_nan = "column") |
| 95 | +
|
| 96 | +Training and imputation are done using **.fit_transform**. |
| 97 | + |
| 98 | +.. code-block:: python |
| 99 | + |
| 100 | + df_plot = df_data |
| 101 | + df_imputed = imputer_mlp.fit_transform(df_plot) |
| 102 | +
|
| 103 | +We can compare on the figure below with in blue the real data and in red the missing data that have been imputed |
| 104 | + |
| 105 | +.. image:: ../images/data_holes_impute.png |
| 106 | + |
| 107 | +************************ |
| 108 | +Autoencoder Imputation |
| 109 | +************************ |
| 110 | + |
| 111 | +*************** |
| 112 | +Benchmark |
| 113 | +*************** |
| 114 | + |
| 115 | +A benchmark on different imputer models is proposed for comparison. |
| 116 | + |
| 117 | +.. code-block:: python |
| 118 | +
|
| 119 | + imputer_mice_ols = imputers.ImputerMICE(groups=["station"], estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan) |
| 120 | + imputer_ols = imputers.ImputerRegressor(groups=["station"], estimator=LinearRegression()) |
| 121 | +
|
| 122 | + dict_imputers = { |
| 123 | + "OLS": imputer_ols, |
| 124 | + "MICE_ols": imputer_mice_ols, |
| 125 | + "MLP": imputer_mlp, |
| 126 | + } |
| 127 | + n_imputers = len(dict_imputers) |
| 128 | + ratio_masked = 0.1 |
| 129 | + generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=2, groups = ["station"], subset = cols_to_impute, ratio_masked=ratio_masked) |
| 130 | +
|
| 131 | + comparison = comparator.Comparator( |
| 132 | + dict_imputers, |
| 133 | + df_data.columns, |
| 134 | + generator_holes = generator_holes, |
| 135 | + n_calls_opt=5, |
| 136 | + ) |
| 137 | + results = comparison.compare(df_data) |
| 138 | + results |
| 139 | +
|
| 140 | +It is possible to change the value of **ratio_masked** which allows you to choose the proportion of values that you mask in order to compare the imputation methods. |
| 141 | +In **result**, you can find the different metrics for each imputation method. |
| 142 | + |
| 143 | +We can display the result of the different predictions |
| 144 | + |
| 145 | +.. image:: |
| 146 | + ../images/imputer_keras_graph1.png |
| 147 | +.. image:: |
| 148 | + ../images/imputer_keras_graph2.png |
0 commit comments