Skip to content

Commit 3a642df

Browse files
author
Bob Carpenter
committed
part 3 users guide revisions; fixes #133
1 parent ce6308f commit 3a642df

16 files changed

+412
-74
lines changed

src/bibtex/all.bib

Lines changed: 45 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,48 @@
11
% a do-nothing command that serves a purpose
22
@preamble{ " \newcommand{\noop}[1]{} " }
33
4+
@article{GneitingRaftery:2007,
5+
author={Gneiting, Tilmann and Raftery, Adrian E},
6+
year={2007},
7+
title={Strictly proper scoring rules, prediction, and estimation},
8+
journal={Journal of the American Statistical Association},
9+
volume={102},
10+
number={477},
11+
pages={359--378}
12+
}
13+
14+
@article{BayarriBerger:2000,
15+
Author = {Bayarri, MJ and Berger, James O},
16+
Journal = {Journal of the American Statistical Association},
17+
Number = {452},
18+
Pages = {1127--1142},
19+
Title = {P values for composite null models},
20+
Volume = {95},
21+
Year = {2000}}
22+
23+
@article{GabryEtAl:2019,
24+
author={Jonah Gabry and Aki Vehtari and M{\aa}ns Magnusson and
25+
Yuling Yao and Andrew Gelman and Paul-Christian B\"urkner
26+
and Ben Goodrich and Juho Piironen},
27+
year={2019},
28+
title={{l}oo: Efficient leave-one-out cross-validation and {WAIC}
29+
for {B}ayesian models},
30+
journal={The Comprehensive {R} Network},
31+
volume={2},
32+
number={2}
33+
}
34+
35+
@article{VehtariEtAl:2017,
36+
author = {Vehtari, Aki and Gelman, Andrew and Gabry, Jonah},
37+
year = {2017},
38+
title = {Practical {B}ayesian model evaluation using leave-one-out
39+
cross-validation and {WAIC}},
40+
journal = {Statistics and computing},
41+
volume = {27},
42+
number = {5},
43+
pages = {1413--1432}
44+
}
45+
446
@article{Rao:1945,
547
title = {Information and accuracy attainable in the estimation of statistical parameters},
648
author = {Rao, C. Radhakrishna},
@@ -32,7 +74,8 @@ @article{Rubin:1984
3274

3375

3476
@article{GelmanEtAl:1996,
35-
title = {Posterior predictive assessment of model fitness via realized discrepancies},
77+
title = {Posterior predictive assessment of model fitness via
78+
realized discrepancies},
3679
author = {Gelman, Andrew and Meng, Xiao-Li and Stern, Hal},
3780
journal = {Statistica Sinica},
3881
pages = {733--760},
@@ -775,9 +818,7 @@ @article{CookGelmanRubin:2006
775818
number = {3},
776819
pages = {675--692},
777820
year = {2006},
778-
doi = {10.1198/106186006X136976},
779-
URL = {http://amstat.tandfonline.com/doi/abs/10.1198/106186006X136976},
780-
eprint = {http://amstat.tandfonline.com/doi/pdf/10.1198/106186006X136976}
821+
doi = {10.1198/106186006X136976}
781822
}
782823

783824
@article{Cormack:1964,

src/stan-users-guide/cross-validation.Rmd

Lines changed: 60 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Held-Out and Cross-Validation
1+
# Held-Out Evaluation and Cross-Validation
22

33
Held-out evaluation involves splitting a data set into two parts, a
44
training data set and a test data set. The training data is used to
@@ -135,7 +135,7 @@ $$
135135
$$
136136

137137
If the estimate is larger than the true value, the error is positive,
138-
and if oit's smaller, then error is negative. If an estimator's
138+
and if it's smaller, then error is negative. If an estimator's
139139
unbiased, then expected error is zero. So typically, absolute error
140140
or squared error are used, which will always have positive
141141
expectations for an imperfect estimator. *Absolute error* is defined as
@@ -146,6 +146,8 @@ and *squared error* as
146146
$$
147147
\textrm{sq-err} = \left( \hat{\theta} - \theta \right)^2.
148148
$$
149+
@GneitingRaftery:2007 provide a thorough overview of such scoring rules
150+
and their properties.
149151

150152
Bayesian posterior means minimize expected square error, whereas
151153
posterior medians minimize expected absolute error. Estimates based
@@ -228,10 +230,10 @@ estimated coefficients,
228230
& \approx &
229231
\frac{1}{M} \sum_{m = 1}^M
230232
\left( \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n \right)
231-
\\[4pt]
233+
\\[4pt]
232234
& = & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)}
233235
+ \frac{1}{M} \sum_{m = 1}^M (\beta^{(m)} \cdot \tilde{x}_n)
234-
\\[4pt]
236+
\\[4pt]
235237
& = & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)}
236238
+ \left( \frac{1}{M} \sum_{m = 1}^M \beta^{(m)}\right) \cdot \tilde{x}_n
237239
\\[4pt]
@@ -262,12 +264,15 @@ by partitioning the data and using each subset in turn as the test set
262264
with the remaining subsets as training data. A partition into ten
263265
subsets is common to reduce computational overhead. In the limit,
264266
when the test set is just a single item, the result is known as
265-
leave-one-out (LOO) cross-validation.
267+
leave-one-out (LOO) cross-validation [@VehtariEtAl:2017].
266268

267269
Partitioning the data and reusing the partitions is very fiddly in the
268270
indexes and may not lead to even divisions of the data. It's far
269271
easier to use random partitions, which support arbitrarily sized
270-
test/training splits and can be easily implemented in Stan.
272+
test/training splits and can be easily implemented in Stan. The
273+
drawback is that the variance of the resulting estimate is higher than
274+
with a balanced block partition.
275+
271276

272277
### Stan implementation with random folds
273278

@@ -365,7 +370,7 @@ will be the error of the posterior mean estimate,
365370
\\[4pt]
366371
& = &
367372
\mathbb{E}\left[
368-
\hat{y}^{\textrm{test}} - y^{\textrm{test}}
373+
\hat{y}^{\textrm{test}} - y^{\textrm{test}}
369374
\mid x^{\textrm{test}}, x^{\textrm{train}}, y^{\textrm{train}}
370375
\right]
371376
\\[4pt]
@@ -379,7 +384,7 @@ $$
379384
y^{\textrm{train}}).
380385
$$
381386
This just calculates error; taking absolute value or squaring will
382-
compute absolue error and mean square error. Note that the absolute
387+
compute absolute error and mean square error. Note that the absolute
383388
value and square operation should *not* be done within the Stan
384389
program because neither is a linear function and the result of
385390
averaging squares is not the same as squaring an average in general.
@@ -388,3 +393,50 @@ Because the test set size is chosen for convenience in
388393
cross-validation, results should be presented on a per-item scale,
389394
such as average absolute error or root mean square error, not on the
390395
scale of error in the fold being evaluated.
396+
397+
### User-defined permutations
398+
399+
It is straightforward to declare the variable `permutation` in the
400+
data block instead of the transformed data block and read it in as
401+
data. This allows an external program to control the blocking,
402+
allowing non-random partitions to be evaluated.
403+
404+
405+
### Cross-validation with structured data
406+
407+
Cross-validation must be done with care if the data is inherently
408+
structured. For example, in a simple natural language application,
409+
data might be structured by document. For cross-validation, one needs
410+
to cross-validate at the document level, not at the individual word
411+
level. This is related to [mixed replication in posterior predictive
412+
checking](#mixed-replication), where there is a choice to simulate new
413+
elements of existing groups or generate entirely new groups.
414+
415+
Education testing applications are typically grouped by school
416+
district, by school, by classroom, and by demographic features of the
417+
individual students or the school as a whole. Depending on the
418+
variables of interest, different structured subsets should be
419+
evaluated. For example, the focus of interest may be on the
420+
performance of entire classrooms, so it would make sense to
421+
cross-validate at the class or school level on classroom performance.
422+
423+
424+
### Cross-validation with spatio-temporal data
425+
426+
Often data measurements have spatial or temporal properties. For
427+
example, home energy consumption varies by time of day, day of week,
428+
on holidays, by season, and by ambient temperature (e.g., a hot spell
429+
or a cold snap). Cross-validation must be tailored to the predictive
430+
goal. For example, in predicting energy consumption, the quantity of
431+
interest may be the prediction for next week's energy consumption
432+
given historical data and current weather covariates. This suggests
433+
an alternative to cross-validation, wherein individual weeks are each
434+
tested given previous data. This often allows comparing how well
435+
prediction performs with more or less historical data.
436+
437+
### Approximate cross-validation
438+
439+
@VehtariEtAl:2017 introduce a method that approximates the evaluation
440+
of leave-one-out cross validation inexpensively using only the data
441+
point log likelihoods from a single model fit. This method is
442+
documented and implemented in the R package loo [@GabryEtAl:2019].

src/stan-users-guide/decision-analysis.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ based on decisions and compute the required expected utilities.
1010

1111
## Outline of decision analysis
1212

13-
Following [@GelmanEtAl:2013], Bayesian decision analysis can be
13+
Following @GelmanEtAl:2013, Bayesian decision analysis can be
1414
factored into the following four steps.
1515

1616
1. Define a set $X$ of possible outcomes and a set $D$ of possible

src/stan-users-guide/floating-point.Rmd

Lines changed: 47 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ decimal representation of a base and a signed integer decimal
134134
exponent. For example, `36.29e-3` represents the number $36.29 \times
135135
10^{-3}$, which is the same number as is represented by `0.03629`.
136136

137-
## Arithmetic Precision
137+
## Arithmetic precision
138138

139139
The choice of significand provides $\log_{10} 2^{53} \approx 15.95$
140140
decimal (base 10) digits of *arithmetic precision*. This is just the
@@ -273,65 +273,95 @@ $$
273273
This is why all of Stan's probability functions operate on the log
274274
scale.
275275

276-
### Adding on the log scale
276+
## Log sum of exponentials {#log-sum-of-exponentials}
277277

278-
If we work on the log scale, multiplication is converted to addition,
278+
Working on the log scale, multiplication is converted to addition,
279279
$$
280-
\log (a \times b) = \log a + \log b.
280+
\log (a \cdot b) = \log a + \log b.
281281
$$
282-
Thus we can just start on the log scale and stay there through
283-
multiplication. But what about addition? If we have $\log a$ and
282+
Thus sequences of multiplication operations can remain on the log scale.
283+
But what about addition? Given $\log a$ and
284284
$\log b$, how do we get $\log (a + b)$? Working out the algebra,
285285
$$
286286
\log (a + b)
287287
=
288-
\log (\exp(\log a) + \exp(\log b))
288+
\log (\exp(\log a) + \exp(\log b)).
289289
$$
290290

291+
### Log-sum-exp function
292+
291293
The nested log of sum of exponentials is so common, it has its own
292-
name,
294+
name, "log-sum-exp",
293295
$$
294-
\texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v)
296+
\textrm{log-sum-exp}(u, v)
295297
=
296298
\log (\exp(u) + \exp(v)).
297299
$$
298300
so that
299301
$$
300302
\log (a + b)
301303
=
302-
\texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(\log a, \log b).
304+
\textrm{log-sum-exp}(\log a, \log b).
303305
$$
304306

305307
Although it appears this might overflow as soon as exponentiation is
306308
introduced, evaluation does not proceed by evaluating the terms as
307309
written. Instead, with a little algebra, the terms are rearranged
308310
into a stable form,
309311
$$
310-
\texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v)
312+
\textrm{log-sum-exp}(u, v)
311313
=
312314
\max(u, v) + \log\big( \exp(u - \max(u, v)) + \exp(v - \max(u, v)) \big).
313315
$$
314316

315317
Because the terms inside the exponentiations are $u - \max(u, v)$ and
316-
$v - \max(u, v)$, one will be zero, and the other will be negative.
318+
$v - \max(u, v)$, one will be zero and the other will be negative.
317319
Because the operation is symmetric, it may be assumed without loss of
318320
generality that $u \geq v$, so that
319321
$$
320-
\texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v) = u + \log\big(1 + \exp(v - u)\big).
322+
\textrm{log-sum-exp}(u, v) = u + \log\big(1 + \exp(v - u)\big).
321323
$$
322324

323-
The inner term may itself be evaluated using `log1p`, there is only
324-
limited gain because $\exp(v - u)$ is only near zero when $u$ is much
325-
larger than $v$, meaning the result is likely to round to $u$ anyway.
325+
Although the inner term may itself be evaluated using the built-in
326+
function `log1p`, there is only limited gain because $\exp(v - u)$ is
327+
only near zero when $u$ is much larger than $v$, meaning the final
328+
result is likely to round to $u$ anyway.
326329

327330
To conclude, when evaluating $\log (a + b)$ given $\log a$ and $\log
328331
b$, and assuming $\log a > \log b$, return
329332

330333
$$
331334
\log (a + b) =
332-
\log a + \texttt{log1p}\big(\exp(\log b - \log a)\big).
335+
\log a + \textrm{log1p}\big(\exp(\log b - \log a)\big).
333336
$$
334337

338+
### Applying log-sum-exp to a sequence
339+
340+
The log sum of exponentials function may be generalized to sequences
341+
in the obvious way, so that if $v = v_1, \ldots, v_N$, then
342+
\begin{eqnarray*}
343+
\textrm{log-sum-exp}(v)
344+
& = & \log \sum_{n = 1}^N \exp(v_n)
345+
\\[4pt]
346+
& = & \max(v) + \log \sum_{n = 1}^N \exp(v_n - \max(v)).
347+
\end{eqnarray*}
348+
The exponent cannot overflow because its argument is either zero or negative.
349+
This form makes it easy to calculate $\log (u_1 + \cdots + u_N)$ given
350+
only $\log u_n$.
351+
352+
### Calculating means with log-sum-exp
353+
354+
An immediate application is to computing the mean of a vector $u$ entirely
355+
on the log scale. That is, given $\log u$ and returning $\log \textrm{mean}(u)$.
356+
\begin{eqnarray*}
357+
\log \left( \frac{1}{N} \sum_{n = 1}^N u_n \right)
358+
& = & \log \frac{1}{N} + \log \sum_{n = 1}^N \exp(\log u_n)
359+
\\[4pt]
360+
& = & -\log N + \textrm{log-sum-exp}(\log u).
361+
\end{eqnarray*}
362+
where $\log u = (\log u_1, \ldots, \log u_N)$ is understood elementwise.
363+
364+
335365
## Comparing floating-point numbers
336366

337367
Because floating-point representations are inexact, it is rarely a

src/stan-users-guide/gaussian-processes.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -889,7 +889,7 @@ decomposition. The data declaration is the same as for the latent variable
889889
example, but we've defined a function called `gp_pred_rng` which will
890890
generate a draw from the posterior predictive mean conditioned on observed data
891891
`y1`. The code uses a Cholesky decomposition in triangular solves in order
892-
to cut down on the the number of matrix-matrix multiplications when computing
892+
to cut down on the number of matrix-matrix multiplications when computing
893893
the conditional mean and the conditional covariance of $p(\tilde{y})$.
894894

895895
```
2.37 KB
Loading
3.09 KB
Loading
-4.53 KB
Loading
-2.91 KB
Loading
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# <i style="font-size: 110%; color:#990017;">Appendices</i> {- #appendices.part}
22

3-
These are the appendices for the book, gathered here as they
4-
are not part of the main exposition.
3+
These are the appendices for the book, including a program style guide
4+
and a guide to translating BUGS or JAGS programs.

0 commit comments

Comments
 (0)