Skip to content

Commit cca9ecb

Browse files
authored
Rework cross_validation_time.py (#791)
1 parent 39b6bb5 commit cca9ecb

File tree

1 file changed

+31
-22
lines changed

1 file changed

+31
-22
lines changed

python_scripts/cross_validation_time.py

Lines changed: 31 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,13 @@
1717
# (as in "independent and identically distributed random variables").
1818
# ```
1919
#
20-
# This assumption is usually violated when dealing with time series. A sample
21-
# depends on past information.
20+
# This assumption is usually violated in time series, where each sample can be
21+
# influenced by previous samples (both their feature and target values) in an
22+
# inherently ordered sequence.
2223
#
23-
# We will take an example to highlight such issues with non-i.i.d. data in the
24-
# previous cross-validation strategies presented. We are going to load financial
24+
# In this notebook we demonstrate the issues that arise when using the
25+
# cross-validation strategies we have presented so far, along with non-i.i.d.
26+
# data. For such purpose we load financial
2527
# quotations from some energy companies.
2628

2729
# %%
@@ -68,9 +70,15 @@
6870
data, target, shuffle=True, random_state=0
6971
)
7072

73+
# Shuffling breaks the index order, but we still want it to be time-ordered
74+
data_train.sort_index(ascending=True, inplace=True)
75+
data_test.sort_index(ascending=True, inplace=True)
76+
target_train.sort_index(ascending=True, inplace=True)
77+
target_test.sort_index(ascending=True, inplace=True)
78+
7179
# %% [markdown]
7280
# We will use a decision tree regressor that we expect to overfit and thus not
73-
# generalize to unseen data. We will use a `ShuffleSplit` cross-validation to
81+
# generalize to unseen data. We use a `ShuffleSplit` cross-validation to
7482
# check the generalization performance of our model.
7583
#
7684
# Let's first define our model
@@ -89,7 +97,7 @@
8997
cv = ShuffleSplit(random_state=0)
9098

9199
# %% [markdown]
92-
# Finally, we perform the evaluation.
100+
# We then perform the evaluation using the `ShuffleSplit` strategy.
93101

94102
# %%
95103
from sklearn.model_selection import cross_val_score
@@ -102,8 +110,10 @@
102110
# %% [markdown]
103111
# Surprisingly, we get outstanding generalization performance. We will
104112
# investigate and find the reason for such good results with a model that is
105-
# expected to fail. We previously mentioned that `ShuffleSplit` is an iterative
106-
# cross-validation scheme that shuffles data and split. We will simplify this
113+
# expected to fail. We previously mentioned that `ShuffleSplit` is a
114+
# cross-validation method that iteratively shuffles and splits the data.
115+
#
116+
# We can simplify the
107117
# procedure with a single split and plot the prediction. We can use
108118
# `train_test_split` for this purpose.
109119

@@ -123,7 +133,7 @@
123133
print(f"The R2 on this single split is: {test_score:.2f}")
124134

125135
# %% [markdown]
126-
# Similarly, we obtain good results in terms of $R^2$. We will plot the
136+
# Similarly, we obtain good results in terms of $R^2$. We now plot the
127137
# training, testing and prediction samples.
128138

129139
# %%
@@ -136,18 +146,19 @@
136146
_ = plt.title("Model predictions using a ShuffleSplit strategy")
137147

138148
# %% [markdown]
139-
# So in this context, it seems that the model predictions are following the
140-
# testing. But we can also see that the testing samples are next to some
141-
# training sample. And with these time-series, we see a relationship between a
142-
# sample at the time `t` and a sample at `t+1`. In this case, we are violating
143-
# the i.i.d. assumption. The insight to get is the following: a model can output
144-
# of its training set at the time `t` for a testing sample at the time `t+1`.
145-
# This prediction would be close to the true value even if our model did not
146-
# learn anything, but just memorized the training dataset.
149+
# From the plot above, we can see that the training and testing samples are
150+
# alternating. This structure effectively evaluates the model’s ability to
151+
# interpolate between neighboring data points, rather than its true
152+
# generalization ability. As a result, the model’s predictions are close to the
153+
# actual values, even if it has not learned anything meaningful from the data.
154+
# This is a form of **data leakage**, where the model gains access to future
155+
# information (testing data) while training, leading to an over-optimistic
156+
# estimate of the generalization performance.
147157
#
148-
# An easy way to verify this hypothesis is to not shuffle the data when doing
158+
# An easy way to verify this is to not shuffle the data during
149159
# the split. In this case, we will use the first 75% of the data to train and
150-
# the remaining data to test.
160+
# the remaining data to test. This way we preserve the time order of the data, and
161+
# ensure training on past data and evaluating on future data.
151162

152163
# %%
153164
data_train, data_test, target_train, target_test = train_test_split(
@@ -212,9 +223,7 @@
212223
from sklearn.model_selection import TimeSeriesSplit
213224

214225
cv = TimeSeriesSplit(n_splits=groups.nunique())
215-
test_score = cross_val_score(
216-
regressor, data, target, cv=cv, groups=groups, n_jobs=2
217-
)
226+
test_score = cross_val_score(regressor, data, target, cv=cv, n_jobs=2)
218227
print(f"The mean R2 is: {test_score.mean():.2f} ± {test_score.std():.2f}")
219228

220229
# %% [markdown]

0 commit comments

Comments
 (0)