|
17 | 17 | # (as in "independent and identically distributed random variables"). |
18 | 18 | # ``` |
19 | 19 | # |
20 | | -# This assumption is usually violated when dealing with time series. A sample |
21 | | -# depends on past information. |
| 20 | +# This assumption is usually violated in time series, where each sample can be |
| 21 | +# influenced by previous samples (both their feature and target values) in an |
| 22 | +# inherently ordered sequence. |
22 | 23 | # |
23 | | -# We will take an example to highlight such issues with non-i.i.d. data in the |
24 | | -# previous cross-validation strategies presented. We are going to load financial |
| 24 | +# In this notebook we demonstrate the issues that arise when using the |
| 25 | +# cross-validation strategies we have presented so far, along with non-i.i.d. |
| 26 | +# data. For such purpose we load financial |
25 | 27 | # quotations from some energy companies. |
26 | 28 |
|
27 | 29 | # %% |
|
68 | 70 | data, target, shuffle=True, random_state=0 |
69 | 71 | ) |
70 | 72 |
|
| 73 | +# Shuffling breaks the index order, but we still want it to be time-ordered |
| 74 | +data_train.sort_index(ascending=True, inplace=True) |
| 75 | +data_test.sort_index(ascending=True, inplace=True) |
| 76 | +target_train.sort_index(ascending=True, inplace=True) |
| 77 | +target_test.sort_index(ascending=True, inplace=True) |
| 78 | + |
71 | 79 | # %% [markdown] |
72 | 80 | # We will use a decision tree regressor that we expect to overfit and thus not |
73 | | -# generalize to unseen data. We will use a `ShuffleSplit` cross-validation to |
| 81 | +# generalize to unseen data. We use a `ShuffleSplit` cross-validation to |
74 | 82 | # check the generalization performance of our model. |
75 | 83 | # |
76 | 84 | # Let's first define our model |
|
89 | 97 | cv = ShuffleSplit(random_state=0) |
90 | 98 |
|
91 | 99 | # %% [markdown] |
92 | | -# Finally, we perform the evaluation. |
| 100 | +# We then perform the evaluation using the `ShuffleSplit` strategy. |
93 | 101 |
|
94 | 102 | # %% |
95 | 103 | from sklearn.model_selection import cross_val_score |
|
102 | 110 | # %% [markdown] |
103 | 111 | # Surprisingly, we get outstanding generalization performance. We will |
104 | 112 | # investigate and find the reason for such good results with a model that is |
105 | | -# expected to fail. We previously mentioned that `ShuffleSplit` is an iterative |
106 | | -# cross-validation scheme that shuffles data and split. We will simplify this |
| 113 | +# expected to fail. We previously mentioned that `ShuffleSplit` is a |
| 114 | +# cross-validation method that iteratively shuffles and splits the data. |
| 115 | +# |
| 116 | +# We can simplify the |
107 | 117 | # procedure with a single split and plot the prediction. We can use |
108 | 118 | # `train_test_split` for this purpose. |
109 | 119 |
|
|
123 | 133 | print(f"The R2 on this single split is: {test_score:.2f}") |
124 | 134 |
|
125 | 135 | # %% [markdown] |
126 | | -# Similarly, we obtain good results in terms of $R^2$. We will plot the |
| 136 | +# Similarly, we obtain good results in terms of $R^2$. We now plot the |
127 | 137 | # training, testing and prediction samples. |
128 | 138 |
|
129 | 139 | # %% |
|
136 | 146 | _ = plt.title("Model predictions using a ShuffleSplit strategy") |
137 | 147 |
|
138 | 148 | # %% [markdown] |
139 | | -# So in this context, it seems that the model predictions are following the |
140 | | -# testing. But we can also see that the testing samples are next to some |
141 | | -# training sample. And with these time-series, we see a relationship between a |
142 | | -# sample at the time `t` and a sample at `t+1`. In this case, we are violating |
143 | | -# the i.i.d. assumption. The insight to get is the following: a model can output |
144 | | -# of its training set at the time `t` for a testing sample at the time `t+1`. |
145 | | -# This prediction would be close to the true value even if our model did not |
146 | | -# learn anything, but just memorized the training dataset. |
| 149 | +# From the plot above, we can see that the training and testing samples are |
| 150 | +# alternating. This structure effectively evaluates the model’s ability to |
| 151 | +# interpolate between neighboring data points, rather than its true |
| 152 | +# generalization ability. As a result, the model’s predictions are close to the |
| 153 | +# actual values, even if it has not learned anything meaningful from the data. |
| 154 | +# This is a form of **data leakage**, where the model gains access to future |
| 155 | +# information (testing data) while training, leading to an over-optimistic |
| 156 | +# estimate of the generalization performance. |
147 | 157 | # |
148 | | -# An easy way to verify this hypothesis is to not shuffle the data when doing |
| 158 | +# An easy way to verify this is to not shuffle the data during |
149 | 159 | # the split. In this case, we will use the first 75% of the data to train and |
150 | | -# the remaining data to test. |
| 160 | +# the remaining data to test. This way we preserve the time order of the data, and |
| 161 | +# ensure training on past data and evaluating on future data. |
151 | 162 |
|
152 | 163 | # %% |
153 | 164 | data_train, data_test, target_train, target_test = train_test_split( |
|
212 | 223 | from sklearn.model_selection import TimeSeriesSplit |
213 | 224 |
|
214 | 225 | cv = TimeSeriesSplit(n_splits=groups.nunique()) |
215 | | -test_score = cross_val_score( |
216 | | - regressor, data, target, cv=cv, groups=groups, n_jobs=2 |
217 | | -) |
| 226 | +test_score = cross_val_score(regressor, data, target, cv=cv, n_jobs=2) |
218 | 227 | print(f"The mean R2 is: {test_score.mean():.2f} ± {test_score.std():.2f}") |
219 | 228 |
|
220 | 229 | # %% [markdown] |
|
0 commit comments