1- # Held-Out and Cross-Validation
1+ # Held-Out Evaluation and Cross-Validation
22
33Held-out evaluation involves splitting a data set into two parts, a
44training data set and a test data set. The training data is used to
135135$$
136136
137137If the estimate is larger than the true value, the error is positive,
138- and if oit 's smaller, then error is negative. If an estimator's
138+ and if it 's smaller, then error is negative. If an estimator's
139139unbiased, then expected error is zero. So typically, absolute error
140140or squared error are used, which will always have positive
141141expectations for an imperfect estimator. * Absolute error* is defined as
@@ -146,6 +146,8 @@ and *squared error* as
146146$$
147147\textrm{sq-err} = \left( \hat{\theta} - \theta \right)^2.
148148$$
149+ @GneitingRaftery :2007 provide a thorough overview of such scoring rules
150+ and their properties.
149151
150152Bayesian posterior means minimize expected square error, whereas
151153posterior medians minimize expected absolute error. Estimates based
@@ -228,10 +230,10 @@ estimated coefficients,
228230& \approx &
229231\frac{1}{M} \sum_ {m = 1}^M
230232 \left( \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_ n \right)
231- \\ [ 4pt]
233+ \\ [ 4pt]
232234& = & \frac{1}{M} \sum_ {m = 1}^M \alpha^{(m)}
233235 + \frac{1}{M} \sum_ {m = 1}^M (\beta^{(m)} \cdot \tilde{x}_ n)
234- \\ [ 4pt]
236+ \\ [ 4pt]
235237& = & \frac{1}{M} \sum_ {m = 1}^M \alpha^{(m)}
236238 + \left( \frac{1}{M} \sum_ {m = 1}^M \beta^{(m)}\right) \cdot \tilde{x}_ n
237239\\ [ 4pt]
@@ -262,12 +264,15 @@ by partitioning the data and using each subset in turn as the test set
262264with the remaining subsets as training data. A partition into ten
263265subsets is common to reduce computational overhead. In the limit,
264266when the test set is just a single item, the result is known as
265- leave-one-out (LOO) cross-validation.
267+ leave-one-out (LOO) cross-validation [ @ VehtariEtAl :2017 ] .
266268
267269Partitioning the data and reusing the partitions is very fiddly in the
268270indexes and may not lead to even divisions of the data. It's far
269271easier to use random partitions, which support arbitrarily sized
270- test/training splits and can be easily implemented in Stan.
272+ test/training splits and can be easily implemented in Stan. The
273+ drawback is that the variance of the resulting estimate is higher than
274+ with a balanced block partition.
275+
271276
272277### Stan implementation with random folds
273278
@@ -365,7 +370,7 @@ will be the error of the posterior mean estimate,
365370\\ [ 4pt]
366371& = &
367372\mathbb{E}\left[
368- \hat{y}^{\textrm{test}} - y^{\textrm{test}}
373+ \hat{y}^{\textrm{test}} - y^{\textrm{test}}
369374 \mid x^{\textrm{test}}, x^{\textrm{train}}, y^{\textrm{train}}
370375\right]
371376\\ [ 4pt]
379384y^{\textrm{train}}).
380385$$
381386This just calculates error; taking absolute value or squaring will
382- compute absolue error and mean square error. Note that the absolute
387+ compute absolute error and mean square error. Note that the absolute
383388value and square operation should * not* be done within the Stan
384389program because neither is a linear function and the result of
385390averaging squares is not the same as squaring an average in general.
@@ -388,3 +393,50 @@ Because the test set size is chosen for convenience in
388393cross-validation, results should be presented on a per-item scale,
389394such as average absolute error or root mean square error, not on the
390395scale of error in the fold being evaluated.
396+
397+ ### User-defined permutations
398+
399+ It is straightforward to declare the variable ` permutation ` in the
400+ data block instead of the transformed data block and read it in as
401+ data. This allows an external program to control the blocking,
402+ allowing non-random partitions to be evaluated.
403+
404+
405+ ### Cross-validation with structured data
406+
407+ Cross-validation must be done with care if the data is inherently
408+ structured. For example, in a simple natural language application,
409+ data might be structured by document. For cross-validation, one needs
410+ to cross-validate at the document level, not at the individual word
411+ level. This is related to [ mixed replication in posterior predictive
412+ checking] ( #mixed-replication ) , where there is a choice to simulate new
413+ elements of existing groups or generate entirely new groups.
414+
415+ Education testing applications are typically grouped by school
416+ district, by school, by classroom, and by demographic features of the
417+ individual students or the school as a whole. Depending on the
418+ variables of interest, different structured subsets should be
419+ evaluated. For example, the focus of interest may be on the
420+ performance of entire classrooms, so it would make sense to
421+ cross-validate at the class or school level on classroom performance.
422+
423+
424+ ### Cross-validation with spatio-temporal data
425+
426+ Often data measurements have spatial or temporal properties. For
427+ example, home energy consumption varies by time of day, day of week,
428+ on holidays, by season, and by ambient temperature (e.g., a hot spell
429+ or a cold snap). Cross-validation must be tailored to the predictive
430+ goal. For example, in predicting energy consumption, the quantity of
431+ interest may be the prediction for next week's energy consumption
432+ given historical data and current weather covariates. This suggests
433+ an alternative to cross-validation, wherein individual weeks are each
434+ tested given previous data. This often allows comparing how well
435+ prediction performs with more or less historical data.
436+
437+ ### Approximate cross-validation
438+
439+ @VehtariEtAl :2017 introduce a method that approximates the evaluation
440+ of leave-one-out cross validation inexpensively using only the data
441+ point log likelihoods from a single model fit. This method is
442+ documented and implemented in the R package loo [ @GabryEtAl :2019] .
0 commit comments