stan-dev
diff --git a/‎src/bibtex/all.bib‎
Lines changed: 1 addition & 10 deletions b/‎src/bibtex/all.bib‎
Lines changed: 1 addition & 10 deletions
diff --git a/‎src/reference-manual/execution.Rmd‎
Lines changed: 1 addition & 1 deletion b/‎src/reference-manual/execution.Rmd‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/reference-manual/mcmc.Rmd‎
Lines changed: 2 additions & 2 deletions b/‎src/reference-manual/mcmc.Rmd‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/stan-users-guide/algebraic-equations.Rmd‎
Lines changed: 2 additions & 2 deletions b/‎src/stan-users-guide/algebraic-equations.Rmd‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/stan-users-guide/clustering.Rmd‎
Lines changed: 1 addition & 1 deletion b/‎src/stan-users-guide/clustering.Rmd‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/stan-users-guide/finite-mixtures.Rmd‎
Lines changed: 2 additions & 2 deletions b/‎src/stan-users-guide/finite-mixtures.Rmd‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/stan-users-guide/gaussian-processes.Rmd‎
Lines changed: 8 additions & 8 deletions b/‎src/stan-users-guide/gaussian-processes.Rmd‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎src/stan-users-guide/hyperspherical-models.Rmd‎
Lines changed: 4 additions & 15 deletions b/‎src/stan-users-guide/hyperspherical-models.Rmd‎
Lines changed: 4 additions & 15 deletions
diff --git a/‎src/stan-users-guide/latent-discrete.Rmd‎
Lines changed: 16 additions & 18 deletions b/‎src/stan-users-guide/latent-discrete.Rmd‎
Lines changed: 16 additions & 18 deletions
diff --git a/‎src/stan-users-guide/measurement-error.Rmd‎
Lines changed: 6 additions & 6 deletions b/‎src/stan-users-guide/measurement-error.Rmd‎
Lines changed: 6 additions & 6 deletions
@@ -1082,15 +1082,6 @@ @article{HoerlKennard:1970
   pages = {55--67}
 }
 
-@article{Hoffman-Gelman:2011,
-  Author = {Hoffman, Matthew D. and Gelman, Andrew},
-  Title = {The No-{U}-Turn Sampler: Adaptively Setting Path Lengths in {H}amiltonian {M}onte {C}arlo},
-  Journal = {arXiv},
-  Volume = {1111.4246},
-  url = {http://arxiv.org/abs/1111.4246},
-  Year = {2011}
-}
-
 @article{Hoffman-Gelman:2014,
   Title = {{T}he {N}o-{U}-{T}urn {S}ampler: {A}daptively {S}etting {P}ath {L}engths in {H}amiltonian {M}onte {C}arlo},
   Author = {Hoffman, Matthew D. and Gelman, Andrew},
@@ -1413,7 +1404,7 @@ @phdthesis{Schofield:2007
   author = {Schofield, Matthew R.},
   year = {2007},
   title = {Hierarchical Capture-Recapture Models},
-  school = {Department of of Statistics, University of Otago, Dunedin}
+  school = {Department of Statistics, University of Otago, Dunedin}
 }
 
 @article{SmithSpiegelhalterThomas:1995,
 
@@ -171,7 +171,7 @@ If the user specifies the number of leapfrog steps (i.e., chooses to
 use standard HMC), that number of leapfrog steps are simulated.  If
 the user has not specified the number of leapfrog steps, the No-U-Turn
 sampler (NUTS) will determine the number of leapfrog steps adaptively
-[@Hoffman-Gelman:2011], [@Hoffman-Gelman:2014].
+[@Hoffman-Gelman:2014].
 
 
 ### Log Probability and Gradient Calculation {-}
 
@@ -326,7 +326,7 @@ using the notation of @Hoffman-Gelman:2014. In practice, the efficacy
 of the optimization is sensitive to the value of these parameters, but
 we do not recommend changing the defaults without experience with the
 dual-averaging algorithm. For more information, see the discussion of
-dual averaging in @Hoffman-Gelman:2011, Hoffman-Gelman:2014.
+dual averaging in Hoffman-Gelman:2014.
 
 The full set of dual-averaging parameters are
 
@@ -474,7 +474,7 @@ e.g., @RobertsEtAl:1997) at each step and avoid the random-walk
 behavior that arises in random-walk Metropolis or Gibbs samplers when
 there is correlation in the posterior. For a precise definition of the
 NUTS algorithm and a proof of detailed balance, see
-@Hoffman-Gelman:2011, @Hoffman-Gelman:2014.
+@Hoffman-Gelman:2014.
 
 NUTS generates a proposal by starting at an initial position
 determined by the parameters drawn in the last iteration. It then
 
@@ -31,7 +31,7 @@ A system of algebraic equations is coded directly in Stan as a
 function with a strictly specified signature. For example, the
 nonlinear system given above can be coded using the
 following function in Stan (see the [user-defined functions
-section](#functions-programming) for more information on coding
+section](#functions-programming.chapter) for more information on coding
 user-defined functions).
 
 ```
@@ -136,7 +136,7 @@ do so, the current metropolis proposal gets rejected.
 
 ## Control Parameters for the Algebraic Solver {#algebra-control.section}
 
-The call to the algebraic solver shown above uses the default control settings. The solver
+The call to the algebraic solver shown previously uses the default control settings. The solver
 allows three additional parameters, all of which must be supplied if any of them is
 supplied.
 
 
@@ -422,7 +422,7 @@ parameters.
 ## Latent Dirichlet Allocation
 
 Latent Dirichlet allocation (LDA) is a mixed-membership multinomial
-clustering model @BleiNgJordan:2003 that generalized naive
+clustering model [@BleiNgJordan:2003] that generalizes naive
 Bayes.  Using the topic and document terminology common in discussions of
 LDA, each document is modeled as having a mixture of topics, with each
 word drawn from a topic based on the mixing proportions.
 
@@ -496,7 +496,7 @@ on the linear scale; it is defined to be equal to `log(exp(lp1) + exp(lp2))`, bu
 
 The code given above to compute the zero-inflated Poisson
 redundantly calculates all of the Bernoulli terms and also
-`poisson_lpmf(0 \mid lambda)` every time the first condition
+`poisson_lpmf(0 | lambda)` every time the first condition
 body executes.  The use of the redundant terms is conditioned on
 `y`, which is known when the data are read in.  This allows
 the transformed data block to be used to compute some more convenient
@@ -650,7 +650,7 @@ transformed data {
 }
 ```
 
-The model block can then be reduced to three statements.
+The model block is then reduced to three statements.
 
 ```
 model {
 
@@ -346,7 +346,7 @@ model {
 }
 ```
 
-The data block now declares a vector `y` of observed values `y[n]`
+The data block declares a vector `y` of observed values `y[n]`
 for inputs `x[n]`.  The transformed data block now only defines the mean
 vector to be zero.  The three hyperparameters are defined as parameters
 constrained to be non-negative.  The computation of the covariance matrix
@@ -366,7 +366,7 @@ noticeable, but for larger matrices ($N \gtrsim 100$) the Cholesky
 decomposition version will be faster.
 
 Hamiltonian Monte Carlo sampling is fast and effective for hyperparameter
-inference in this model @Neal:1997. If the posterior is
+inference in this model [@Neal:1997]. If the posterior is
 well-concentrated for the hyperparameters the Stan implementation will fit
 hyperparameters in models with a few hundred data points in seconds.
 
@@ -419,7 +419,7 @@ model {
 
 Two differences between the latent variable GP and the marginal likelihood GP
 are worth noting. The first is that we have augmented our parameter block with
-a new parameter vector of length $N$ called $`eta`$. This is used in the model
+a new parameter vector of length $N$ called `eta`. This is used in the model
 block to generate a multivariate normal vector called $f$, corresponding to the
 latent GP. We put a $\textsf{normal}(0,1)$ prior on `eta` like we did in the
 Cholesky-parameterized GP in the simulation section.  The second difference is
@@ -482,7 +482,7 @@ $$
 $$
 
 We can extend our latent variable GP Stan program to deal with classification
-problems. Below $a$ is the bias term, which can help account for imbalanced
+problems. Below `a` is the bias term, which can help account for imbalanced
 classes in the training data:
 
 
@@ -513,12 +513,12 @@ $$
 \right)
 + \delta_{i, j}\sigma^2.
 $$
-The estimation of $\rho$ was termed "automatic relevance determination" in
-@Neal:1996, but this is misleading, because the magnitude the scale of
+The estimation of $\rho$ was termed "automatic relevance determination" by
+@Neal:1996, but this is misleading, because the magnitude of the scale of
 the posterior for each $\rho_d$ is dependent on the scaling of the input data
 along dimension $d$. Moreover, the scale of the parameters $\rho_d$ measures
 non-linearity along the $d$-th dimension, rather than "relevance"
-@PiironenVehtari:2016.
+[@PiironenVehtari:2016].
 
 A priori, the closer $\rho_d$ is to zero, the more nonlinear the
 conditional mean in dimension $d$ is.  A posteriori, the actual dependencies
@@ -595,7 +595,7 @@ inherent statistical properties of a GP, the GP's purpose in the model, and the
 numerical issues that may arise in Stan when estimating a GP.
 
 Perhaps most importantly, the parameters $\rho$ and $\alpha$ are weakly
-identified @zhang-gp:2004. The ratio of the two
+identified [@zhang-gp:2004]. The ratio of the two
 parameters is well-identified, but in practice we put independent priors on the
 two hyperparameters because these two quantities are more interpretable than
 their ratio.
 
@@ -73,7 +73,7 @@ set of points in $\mathbb{R}^3$, but each such point may be described
 uniquely by a latitude and longitude.  Geometrically, the surface
 defined by $S^2$ in $\mathbb{R}^3$ behaves locally like a plane, i.e.,
 $\mathbb{R}^2$.  However, the overall shape of $S^2$ is not like a plane
-in that is compact (i.e., there is a maximum distance between points).
+in that it is compact (i.e., there is a maximum distance between points).
 If you set off around the globe in a "straight line" (i.e., a
 geodesic), you wind up back where you started eventually; that is why
 the geodesics on the sphere ($S^2$) are called "great circles," and
@@ -123,7 +123,7 @@ option built into all of the Stan interfaces.
 
 Unit vectors correspond directly to angles and thus to rotations.
 This is easy to see in two dimensions, where a point on a circle
-determines a compass direction, or equivalently, an angle $\theta$).
+determines a compass direction, or equivalently, an angle $\theta$.
 Given an angle $\theta$, a matrix can be defined, the
 pre-multiplication by which rotates a point by an angle of $\theta$.
 For angle $\theta$ (in two dimensions), the $2 \times 2$ rotation
@@ -139,17 +139,6 @@ $$
 Given a two-dimensional vector $x$, $R_{\theta} \, x$ is the rotation
 of $x$ (around the origin) by $\theta$ degrees.
 
-### Unit vector type {-}
-
-In Stan, unit vectors in $K$ dimensions are declared as
-
-```
-unit_vector[K] alpha;
-```
-
-A unit vector has length one (meaning the sum of squared values is
-one, not that its number of elements is one).
-
 ### Angles from unit vectors {-}
 
 Angles can be calculated from unit vectors.  For example, a random
@@ -167,9 +156,9 @@ transformed parameters {
 ```
 
 If the distribution of $(x, y)$ is uniform over a circle, then the
-distribution of $\arctan \frac{y}{x}$ is uniform over $(-\pi, \pi]$.
+distribution of $\arctan \frac{y}{x}$ is uniform over $(-\pi, \pi)$.
 
-It might be tempting to try to just declare theta directly as a
+It might be tempting to try to just declare `theta` directly as a
 parameter with the lower and upper bound constraint as given above.
 The drawback to this approach is that the values $-\pi$ and $\pi$ are
 at $-\infty$ and $\infty$ on the unconstrained scale, which can
 
@@ -4,7 +4,7 @@ Stan does not support sampling discrete parameters.  So it is not
 possible to directly translate BUGS or JAGS models with discrete
 parameters (i.e., discrete stochastic nodes).  Nevertheless, it is
 possible to code many models that involve bounded discrete
-parameters by marginalizing out the discrete parameters.^[The computations are similar to those involved in expectation maximization (EM) algorithms @dempster-et-al:1977.]
+parameters by marginalizing out the discrete parameters.^[The computations are similar to those involved in expectation maximization (EM) algorithms [@dempster-et-al:1977].]
 
 This chapter shows how to code several widely-used models involving
 latent discrete parameters.  The next chapter, the [clustering
@@ -29,12 +29,12 @@ exactly the marginalization needed for coding the model in Stan.
 
 ## Change Point Models {#change-point.section}
 
-The first example is a model of coal mining disasters in the U.K. for the years 1851--1962.^[The source of the data is @Jarret:1979, which itself is a note correcting an earlier data collection.]
+The first example is a model of coal mining disasters in the U.K. for the years 1851--1962.^[The source of the data is [@Jarret:1979], which itself is a note correcting an earlier data collection.]
 
 
 ### Model with Latent Discrete Parameter {-}
 
-[@PyMC:2014 Section 3.1] provides a Poisson model of disaster
+@PyMC:2014[, Section 3.1] provides a Poisson model of disaster
 $D_t$ in year $t$ with two rate parameters, an early rate ($e$)
 and late rate ($l$), that change at a given point in time $s$.  The
 full model expressed using a latent discrete parameter $s$ is
@@ -86,7 +86,7 @@ where the likelihood is defined by marginalizing $s$ as
 p(D \mid e,l) &= \sum_{s=1}^T p(s, D \mid e,l) \\
  &= \sum_{s=1}^T p(s) \, p(D \mid s,e,l) \\
  &= \sum_{s=1}^T \textsf{uniform}(s \mid 1,T) \,
-    \prod_{t=1}^T \textsf{Poisson}(D_t \mid t < s \; ? \; e \: : \: l)
+    \prod_{t=1}^T \textsf{Poisson}(D_t \mid t < s \; ? \; e \: : \: l).
 \end{align*}
 
 Stan operates on the log scale and thus requires the log likelihood,
@@ -248,7 +248,7 @@ knitr::include_graphics("./img/s-discrete-posterior.png", auto_pdf = TRUE)
 
 In order their range of estimates be visible, the first plot is on the log
 scale and the second plot on the linear scale; note the narrower range
-of years in the right-hand plot resulting from sampling. The posterior
+of years in the second plot resulting from sampling. The posterior
 mean of $s$ is roughly 1891.
 
 
@@ -343,7 +343,7 @@ parameter; just because the population must be finite doesn't mean the
 parameter representing it must be.  The parameter will be used to
 produce a real-valued estimate of the population size.
 
-The Lincoln-Petersen [@Lincoln:1930,@Petersen:1896] method for
+The Lincoln-Petersen [@Lincoln:1930;@Petersen:1896] method for
 estimating population size is
 $$
 \hat{N} = \frac{M C}{R}.
@@ -385,7 +385,7 @@ for this model.
 
 To ensure the MLE is the Lincoln-Petersen estimate, an improper
 uniform prior for $N$ is used; this could (and should) be replaced
-with a more informative prior if possible based on knowledge of the
+with a more informative prior if possible, based on knowledge of the
 population under study.
 
 The one tricky part of the model is the lower bound $C - R + M$ placed
@@ -402,10 +402,9 @@ details of all constrained parameter transforms.
 
 ### Cormack-Jolly-Seber with Discrete Parameter {-}
 
-The Cormack-Jolly-Seber (CJS) model
-[@Cormack:1964; Jolly:1965; Seber:1965] is an open-population model
-in which the population may change over time due to death; the
-presentation here draws heavily on @Schofield:2007.
+The Cormack-Jolly-Seber (CJS) model [@Cormack:1964; @Jolly:1965; @Seber:1965] 
+is an open-population model in which the population may change over time 
+due to death; the presentation here draws heavily on @Schofield:2007.
 
 The basic data are
 
@@ -514,7 +513,7 @@ By defining these probabilities in terms of $\chi$ directly, there is
 no need for a latent binary parameter indicating whether an animal is
 alive at time $t$ or not.  The definition of $\chi$ is typically used
 to define the likelihood (i.e., marginalize out the latent discrete
-parameter) for the CJS model [@Schofield:2007, page 9].
+parameter) for the CJS model [@Schofield:2007].
 
 The Stan model defines $\chi$ as a transformed parameter based on
 parameters $\phi$ and $p$.  In the model block, the log probability is
@@ -796,8 +795,8 @@ predictors.
 
 Although seemingly disparate tasks, the rating/coding/annotation of
 items with categories and diagnostic testing for disease or other
-conditions share several characteristics which allow their statistical
-properties to modeled similarly.
+conditions, share several characteristics which allow their statistical
+properties to be modeled similarly.
 
 ### Diagnostic Accuracy {-}
 
@@ -877,8 +876,7 @@ z_i \sim \textsf{categorical}(\pi).
 $$
 
 The rating $y_{i, j}$ provided for item $i$ by rater $j$ is modeled as
-a categorical response of rater $i$ to an item of category $z_i$,^[In the subscript, $z[i]$ is written as $z_i$ to
-  improve legibility.]
+a categorical response of rater $i$ to an item of category $z_i$,^[In the subscript, $z_i$ is written as $z[i]$ to improve legibility.]
 $$
 y_{i, j} \sim \textsf{categorical}(\theta_{j,\pi_{z[i]}}).
 $$
@@ -958,7 +956,7 @@ function.
 
 ### Stan Implementation {-}
 
-The Stan program for the Dawid and Skene model is provided below @DawidSkene:1979.
+The Stan program for the Dawid and Skene model is provided below [@DawidSkene:1979].
 
 ```
 data {
@@ -998,7 +996,7 @@ model {
 <a name="id:dawid-skene-model.figure"></a>
 
 The model marginalizes out the discrete parameter $z$, storing the
-unnormalized conditional probability $\log q(z_i=k|\theta,\pi)$ in\
+unnormalized conditional probability $\log q(z_i=k|\theta,\pi)$ in 
 `log_q_z[i, k]`.
 
 The Stan model converges quickly and mixes well using NUTS starting at
 
@@ -4,12 +4,12 @@ Most quantities used in statistical models arise from measurements.
 Most of these measurements are taken with some error.  When the
 measurement error is small relative to the quantity being measured,
 its effect on a model is usually small.  When measurement error is
-large relative to the quantity being measured, or when  precise
+large relative to the quantity being measured, or when precise
 relations can be estimated being measured quantities, it is useful to
 introduce an explicit model of measurement error.  One kind of
 measurement error is rounding.
 
-Meta-analysis plays out statistically  much like measurement error
+Meta-analysis plays out statistically much like measurement error
 models, where the inferences drawn from multiple data sets are
 combined to do inference over all of them.  Inferences for each data
 set are treated as providing a kind of measurement error with respect
@@ -102,7 +102,7 @@ Rounding may be done in many ways, such as rounding weights to the
 nearest milligram, or to the nearest pound; rounding may even be done
 by rounding down to the nearest integer.
 
-Exercise 3.5(b) from @GelmanEtAl:2013 provides an example.
+Exercise 3.5(b) by @GelmanEtAl:2013 provides an example.
 
 \begin{quote}
   3.5. \ Suppose we weigh an object five times and measure
@@ -227,7 +227,7 @@ the studies being analyzed.
 Suppose the data in question arise from a total of $M$ studies
 providing paired binomial data for a treatment and control group.  For
 instance, the data might be post-surgical pain reduction under a treatment
-of ibuprofen @WarnThompsonSpiegelhalter:2002 or mortality after
+of ibuprofen [@WarnThompsonSpiegelhalter:2002] or mortality after
 myocardial infarction under a treatment of beta blockers
 [@GelmanEtAl:2013, Section 5.6].
 
@@ -352,8 +352,8 @@ in each school.
 
 #### Extensions and Alternatives {-}
 
-@SmithSpiegelhalterThomas:1995 and [@GelmanEtAl:2013, Section 19.4]
-provides meta-analyses based directly on binomial data.
+@SmithSpiegelhalterThomas:1995 and @GelmanEtAl:2013[, Section 19.4]
+provide meta-analyses based directly on binomial data.
 @WarnThompsonSpiegelhalter:2002 consider the modeling
 implications of using alternatives to the log-odds ratio in
 transforming the binomial data.