stan-dev
diff --git a/‎src/bibtex/all.bib‎
Lines changed: 45 additions & 4 deletions b/‎src/bibtex/all.bib‎
Lines changed: 45 additions & 4 deletions
diff --git a/‎src/stan-users-guide/cross-validation.Rmd‎
Lines changed: 60 additions & 8 deletions b/‎src/stan-users-guide/cross-validation.Rmd‎
Lines changed: 60 additions & 8 deletions
diff --git a/‎src/stan-users-guide/decision-analysis.Rmd‎
Lines changed: 1 addition & 1 deletion b/‎src/stan-users-guide/decision-analysis.Rmd‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/stan-users-guide/floating-point.Rmd‎
Lines changed: 47 additions & 17 deletions b/‎src/stan-users-guide/floating-point.Rmd‎
Lines changed: 47 additions & 17 deletions
diff --git a/‎src/stan-users-guide/gaussian-processes.Rmd‎
Lines changed: 1 addition & 1 deletion b/‎src/stan-users-guide/gaussian-processes.Rmd‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/stan-users-guide/img/ppc-nb-pois.jpg‎
2.37 KB b/‎src/stan-users-guide/img/ppc-nb-pois.jpg‎
2.37 KB
diff --git a/‎src/stan-users-guide/img/ppc-pois-pois.jpg‎
3.09 KB b/‎src/stan-users-guide/img/ppc-pois-pois.jpg‎
3.09 KB
diff --git a/‎src/stan-users-guide/img/ppc-pvalue-nb-pois-mean.jpg‎
-4.53 KB b/‎src/stan-users-guide/img/ppc-pvalue-nb-pois-mean.jpg‎
-4.53 KB
diff --git a/‎src/stan-users-guide/img/ppc-pvalue-nb-pois-sd.jpg‎
-2.91 KB b/‎src/stan-users-guide/img/ppc-pvalue-nb-pois-sd.jpg‎
-2.91 KB
diff --git a/‎src/stan-users-guide/part-appendices.Rmd‎
Lines changed: 2 additions & 2 deletions b/‎src/stan-users-guide/part-appendices.Rmd‎
Lines changed: 2 additions & 2 deletions
@@ -1,6 +1,48 @@
 % a do-nothing command that serves a purpose
 @preamble{ " \newcommand{\noop}[1]{} " }
 
+@article{GneitingRaftery:2007,
+  author={Gneiting, Tilmann and Raftery, Adrian E},
+  year={2007},
+  title={Strictly proper scoring rules, prediction, and estimation},
+  journal={Journal of the American Statistical Association},
+  volume={102},
+  number={477},
+  pages={359--378}
+}
+
+@article{BayarriBerger:2000,
+Author = {Bayarri, MJ and Berger, James O},
+Journal = {Journal of the American Statistical Association},
+Number = {452},
+Pages = {1127--1142},
+Title = {P values for composite null models},
+Volume = {95},
+Year = {2000}}
+
+@article{GabryEtAl:2019,
+  author={Jonah Gabry and Aki Vehtari and M{\aa}ns Magnusson and
+          Yuling Yao and Andrew Gelman and Paul-Christian B\"urkner
+	  and Ben Goodrich and Juho Piironen},
+  year={2019},
+  title={{l}oo: Efficient leave-one-out cross-validation and {WAIC}
+         for {B}ayesian models},
+  journal={The Comprehensive {R} Network},
+  volume={2},
+  number={2}
+}
+
+@article{VehtariEtAl:2017,
+  author = {Vehtari, Aki and Gelman, Andrew and Gabry, Jonah},
+  year = {2017},
+  title = {Practical {B}ayesian model evaluation using leave-one-out
+           cross-validation and {WAIC}},
+  journal = {Statistics and computing},
+  volume = {27},
+  number = {5},
+  pages = {1413--1432}
+}
+
 @article{Rao:1945,
   title = {Information and accuracy attainable in the estimation of statistical parameters},
   author = {Rao, C. Radhakrishna},
@@ -32,7 +74,8 @@ @article{Rubin:1984
 
 
 @article{GelmanEtAl:1996,
-  title = {Posterior predictive assessment of model fitness via realized discrepancies},
+  title = {Posterior predictive assessment of model fitness via
+           realized discrepancies},
   author = {Gelman, Andrew and Meng, Xiao-Li and Stern, Hal},
   journal = {Statistica Sinica},
   pages = {733--760},
@@ -775,9 +818,7 @@ @article{CookGelmanRubin:2006
   number = {3},
   pages = {675--692},
   year = {2006},
-  doi = {10.1198/106186006X136976},
-  URL = {http://amstat.tandfonline.com/doi/abs/10.1198/106186006X136976},
-  eprint = {http://amstat.tandfonline.com/doi/pdf/10.1198/106186006X136976}
+  doi = {10.1198/106186006X136976}
 }
 
 @article{Cormack:1964,
 
@@ -1,4 +1,4 @@
-# Held-Out and Cross-Validation
+# Held-Out Evaluation and Cross-Validation
 
 Held-out evaluation involves splitting a data set into two parts, a
 training data set and a test data set.  The training data is used to
@@ -135,7 +135,7 @@ $$
 $$
 
 If the estimate is larger than the true value, the error is positive,
-and if oit's smaller, then error is negative.  If an estimator's
+and if it's smaller, then error is negative.  If an estimator's
 unbiased, then expected error is zero.  So typically, absolute error
 or squared error are used, which will always have positive
 expectations for an imperfect estimator.  *Absolute error* is defined as
@@ -146,6 +146,8 @@ and *squared error* as
 $$
 \textrm{sq-err} = \left( \hat{\theta} - \theta \right)^2.
 $$
+@GneitingRaftery:2007 provide a thorough overview of such scoring rules
+and their properties.
 
 Bayesian posterior means minimize expected square error, whereas
 posterior medians minimize expected absolute error.  Estimates based
@@ -228,10 +230,10 @@ estimated coefficients,
 & \approx &
 \frac{1}{M} \sum_{m = 1}^M
   \left( \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n \right)
-\\[4pt]  
+\\[4pt]
 & = & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)}
       + \frac{1}{M} \sum_{m = 1}^M (\beta^{(m)} \cdot \tilde{x}_n)
-\\[4pt]  
+\\[4pt]
 & = & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)}
       + \left( \frac{1}{M} \sum_{m = 1}^M \beta^{(m)}\right) \cdot \tilde{x}_n
 \\[4pt]
@@ -262,12 +264,15 @@ by partitioning the data and using each subset in turn as the test set
 with the remaining subsets as training data.  A partition into ten
 subsets is common to reduce computational overhead.  In the limit,
 when the test set is just a single item, the result is known as
-leave-one-out (LOO) cross-validation.
+leave-one-out (LOO) cross-validation [@VehtariEtAl:2017].
 
 Partitioning the data and reusing the partitions is very fiddly in the
 indexes and may not lead to even divisions of the data. It's far
 easier to use random partitions, which support arbitrarily sized
-test/training splits and can be easily implemented in Stan.
+test/training splits and can be easily implemented in Stan.  The
+drawback is that the variance of the resulting estimate is higher than
+with a balanced block partition.
+
 
 ### Stan implementation with random folds
 
@@ -365,7 +370,7 @@ will be the error of the posterior mean estimate,
 \\[4pt]
 & = &
 \mathbb{E}\left[
-  \hat{y}^{\textrm{test}} - y^{\textrm{test}}  
+  \hat{y}^{\textrm{test}} - y^{\textrm{test}}
   \mid x^{\textrm{test}}, x^{\textrm{train}}, y^{\textrm{train}}
 \right]
 \\[4pt]
@@ -379,7 +384,7 @@ $$
 y^{\textrm{train}}).
 $$
 This just calculates error; taking absolute value or squaring will
-compute absolue error and mean square error. Note that the absolute
+compute absolute error and mean square error. Note that the absolute
 value and square operation should *not* be done within the Stan
 program because neither is a linear function and the result of
 averaging squares is not the same as squaring an average in general.
@@ -388,3 +393,50 @@ Because the test set size is chosen for convenience in
 cross-validation, results should be presented on a per-item scale,
 such as average absolute error or root mean square error, not on the
 scale of error in the fold being evaluated.
+
+### User-defined permutations
+
+It is straightforward to declare the variable `permutation` in the
+data block instead of the transformed data block and read it in as
+data.  This allows an external program to control the blocking,
+allowing non-random partitions to be evaluated.
+
+
+### Cross-validation with structured data
+
+Cross-validation must be done with care if the data is inherently
+structured.  For example, in a simple natural language application,
+data might be structured by document.  For cross-validation, one needs
+to cross-validate at the document level, not at the individual word
+level.  This is related to [mixed replication in posterior predictive
+checking](#mixed-replication), where there is a choice to simulate new
+elements of existing groups or generate entirely new groups.
+
+Education testing applications are typically grouped by school
+district, by school, by classroom, and by demographic features of the
+individual students or the school as a whole.  Depending on the
+variables of interest, different structured subsets should be
+evaluated.  For example, the focus of interest may be on the
+performance of entire classrooms, so it would make sense to
+cross-validate at the class or school level on classroom performance.
+
+
+### Cross-validation with spatio-temporal data
+
+Often data measurements have spatial or temporal properties.  For
+example, home energy consumption varies by time of day, day of week,
+on holidays, by season, and by ambient temperature (e.g., a hot spell
+or a cold snap).  Cross-validation must be tailored to the predictive
+goal.  For example, in predicting energy consumption, the quantity of
+interest may be the prediction for next week's energy consumption
+given historical data and current weather covariates.  This suggests
+an alternative to cross-validation, wherein individual weeks are each
+tested given previous data.  This often allows comparing how well
+prediction performs with more or less historical data.
+
+### Approximate cross-validation
+
+@VehtariEtAl:2017 introduce a method that approximates the evaluation
+of leave-one-out cross validation inexpensively using only the data
+point log likelihoods from a single model fit.  This method is
+documented and implemented in the R package loo [@GabryEtAl:2019].
@@ -10,7 +10,7 @@ based on decisions and compute the required expected utilities.
 
 ## Outline of decision analysis
 
-Following [@GelmanEtAl:2013], Bayesian decision analysis can be
+Following @GelmanEtAl:2013, Bayesian decision analysis can be
 factored into the following four steps.
 
 1. Define a set $X$ of possible outcomes and a set $D$ of possible
 
@@ -134,7 +134,7 @@ decimal representation of a base and a signed integer decimal
 exponent.  For example, `36.29e-3` represents the number $36.29 \times
 10^{-3}$, which is the same number as is represented by `0.03629`.
 
-## Arithmetic Precision
+## Arithmetic precision
 
 The choice of significand provides $\log_{10} 2^{53} \approx 15.95$
 decimal (base 10) digits of *arithmetic precision*.  This is just the
@@ -273,65 +273,95 @@ $$
 This is why all of Stan's probability functions operate on the log
 scale.
 
-### Adding on the log scale
+## Log sum of exponentials {#log-sum-of-exponentials}
 
-If we work on the log scale, multiplication is converted to addition,
+Working on the log scale, multiplication is converted to addition,
 $$
-\log (a \times b) = \log a + \log b.
+\log (a \cdot b) = \log a + \log b.
 $$
-Thus we can just start on the log scale and stay there through
-multiplication.  But what about addition?  If we have $\log a$ and
+Thus sequences of multiplication operations can remain on the log scale.
+But what about addition?  Given $\log a$ and
 $\log b$, how do we get $\log (a + b)$?  Working out the algebra,
 $$
 \log (a + b)
 =
-\log (\exp(\log a) + \exp(\log b))
+\log (\exp(\log a) + \exp(\log b)).
 $$
 
+### Log-sum-exp function
+
 The nested log of sum of exponentials is so common, it has its own
-name,
+name, "log-sum-exp",
 $$
-\texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v)
+\textrm{log-sum-exp}(u, v)
 =
 \log (\exp(u) + \exp(v)).
 $$
 so that
 $$
 \log (a + b)
 =
-\texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(\log a, \log b).
+\textrm{log-sum-exp}(\log a, \log b).
 $$
 
 Although it appears this might overflow as soon as exponentiation is
 introduced, evaluation does not proceed by evaluating the terms as
 written.  Instead, with a little algebra, the terms are rearranged
 into a stable form,
 $$
-\texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v)
+\textrm{log-sum-exp}(u, v)
 =
 \max(u, v) + \log\big( \exp(u - \max(u, v)) + \exp(v - \max(u, v)) \big).
 $$
 
 Because the terms inside the exponentiations are $u - \max(u, v)$ and
-$v - \max(u, v)$, one will be zero, and the other will be negative.
+$v - \max(u, v)$, one will be zero and the other will be negative.
 Because the operation is symmetric, it may be assumed without loss of
 generality that $u \geq v$, so that
 $$
-\texttt{log}\mathtt{\_}\texttt{sum}\mathtt{\_}\texttt{exp}(u, v) = u + \log\big(1 + \exp(v - u)\big).
+\textrm{log-sum-exp}(u, v) = u + \log\big(1 + \exp(v - u)\big).
 $$
 
-The inner term may itself be evaluated using `log1p`, there is only
-limited gain because $\exp(v - u)$ is only near zero when $u$ is much
-larger than $v$, meaning the result is likely to round to $u$ anyway.
+Although the inner term may itself be evaluated using the built-in
+function `log1p`, there is only limited gain because $\exp(v - u)$ is
+only near zero when $u$ is much larger than $v$, meaning the final
+result is likely to round to $u$ anyway.
 
 To conclude, when evaluating $\log (a + b)$ given $\log a$ and $\log
 b$, and assuming $\log a > \log b$, return
 
 $$
 \log (a + b) =
-\log a + \texttt{log1p}\big(\exp(\log b - \log a)\big).
+\log a + \textrm{log1p}\big(\exp(\log b - \log a)\big).
 $$
 
+### Applying log-sum-exp to a sequence
+
+The log sum of exponentials function may be generalized to sequences
+in the obvious way, so that if $v = v_1, \ldots, v_N$, then
+\begin{eqnarray*}
+\textrm{log-sum-exp}(v)
+& = & \log \sum_{n = 1}^N \exp(v_n)
+\\[4pt]
+& = & \max(v) + \log \sum_{n = 1}^N \exp(v_n - \max(v)).
+\end{eqnarray*}
+The exponent cannot overflow because its argument is either zero or negative.
+This form makes it easy to calculate $\log (u_1 + \cdots + u_N)$ given
+only $\log u_n$.
+
+### Calculating means with log-sum-exp
+
+An immediate application is to computing the mean of a vector $u$ entirely
+on the log scale.  That is, given $\log u$ and returning $\log \textrm{mean}(u)$.
+\begin{eqnarray*}
+\log \left( \frac{1}{N} \sum_{n = 1}^N u_n \right)
+& = & \log \frac{1}{N} + \log \sum_{n = 1}^N \exp(\log u_n)
+\\[4pt]
+& = & -\log N + \textrm{log-sum-exp}(\log u).
+\end{eqnarray*}
+where $\log u = (\log u_1, \ldots, \log u_N)$ is understood elementwise.
+
+
 ## Comparing floating-point numbers
 
 Because floating-point representations are inexact, it is rarely a
 
@@ -889,7 +889,7 @@ decomposition. The data declaration is the same as for the latent variable
 example, but we've defined a function called `gp_pred_rng` which will
 generate a draw from the posterior predictive mean conditioned on observed data
 `y1`. The code uses a Cholesky decomposition in triangular solves in order
-to cut down on the the number of matrix-matrix multiplications when computing
+to cut down on the number of matrix-matrix multiplications when computing
 the conditional mean and the conditional covariance of $p(\tilde{y})$.
 
 ```
 
@@ -1,4 +1,4 @@
 # <i style="font-size: 110%; color:#990017;">Appendices</i> {- #appendices.part}
 
-These are the appendices for the book, gathered here as they
-are not part of the main exposition.
+These are the appendices for the book, including a program style guide
+and a guide to translating BUGS or JAGS programs.