Skip to content

Commit dda1e5b

Browse files
re-adding 50fold example now with less seed hacking
1 parent 1b7788e commit dda1e5b

File tree

1 file changed

+25
-0
lines changed

1 file changed

+25
-0
lines changed

source/classification2.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1100,6 +1100,31 @@ cv_10_metrics
11001100
In this case, using 10-fold instead of 5-fold cross validation did
11011101
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
11021102
you might even end up with a *higher* standard error when increasing the number of folds!
1103+
We can make the reduction in standard error more dramatic by increasing the number of folds
1104+
by a large amount. In the following code we show the result when $C = 50$;
1105+
picking such a large number of folds can take a long time to run in practice,
1106+
so we usually stick to 5 or 10.
1107+
1108+
```{code-cell} ipython3
1109+
:tags: [remove-output]
1110+
cv_50_df = pd.DataFrame(
1111+
cross_validate(
1112+
estimator=cancer_pipe,
1113+
cv=50,
1114+
X=X,
1115+
y=y
1116+
)
1117+
)
1118+
cv_50_metrics = cv_50_df.agg(["mean", "sem"])
1119+
cv_50_metrics
1120+
```
1121+
1122+
```{code-cell} ipython3
1123+
:tags: [remove-input]
1124+
# hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
1125+
cv_50_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(10)
1126+
cv_50_metrics
1127+
```
11031128

11041129
```{code-cell} ipython3
11051130
:tags: [remove-cell]

0 commit comments

Comments
 (0)