refine stan user manual

weberse2 · weberse2 · commit 8fca92926546 · 2020-03-31T16:47:06.000+02:00
diff --git a/src/stan-users-guide/parallelization.Rmd b/src/stan-users-guide/parallelization.Rmd
@@ -1,51 +1,62 @@
 # Parallelization  {#parallelization.chapter}
 
-Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and and `map_rect`.
+Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and `map_rect`.
 
-The main advantages to `reduce_sum` are:
+The advantages of `reduce_sum` are:
 
 1. `reduce_sum` has a more flexible argument interface, avoiding the packing and unpacking that is necessary with `map_rect`.
-2. `reduce_sum` partitions the data for parallelization automatically (this is done manually in `map_rect`).
+2. `reduce_sum` partitions data for parallelization automatically (this is done manually in `map_rect`).
 3. `reduce_sum` is easier to use.
 
 The advantages of `map_rect` are:
 
 1. `map_rect` returns a list of vectors, while `reduce_sum` returns only a real.
-2. `map_rect` can be parallelized across multiple computers, while `reduce_sum` can only parallelized across multiple cores.
+2. `map_rect` can be parallelized across multiple cores and multiple
+   computers, while `reduce_sum` can only parallelized across multiple
+   cores on a single machine.
+
+The actual speedup gained from using these functions will depend on
+many details. It is strongly recommended to only parallelize the
+computationally most expensive operations in a Stan
+program. Oftentimes this is the evaluation of the log likelihood for
+the observed data. Since only portions of a Stan program will run in
+parallel, the maximal speedup one can achieve is capped, a phenomen
+described by [Amdahl's
+law](https://en.wikipedia.org/wiki/Amdahl's_law).
 
 ## Reduce-Sum { #reduce-sum }
 
-```reduce_sum``` parallelizes operations that can be represented as a sum of functions, `g: U -> real`.
-
-For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
+```reduce_sum``` maps evaluation of a function `g: U -> real` to a list of type `U[]`, `{
+x1, x2, ... }`, and performs as reduction operation a sum over the
+results. For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
 
 ```g(x1) + g(x2) + ...```
 
 In probabilistic modeling this comes up when there are N conditionally independent terms in a likelihood. Because of the conditional independence, these terms can be computed in parallel. If dependencies exist between the terms, then this isn't possible. For instance, in evaluating the log density of a Gaussian process ```reduce_sum``` would not be very useful.
 
-```reduce_sum``` takes a function ```f: U[] -> real```, where ```f``` computes the partial sum corresponding to the slice of the sequence ```x``` passed in. For instance:
+```reduce_sum``` requires the partial sum function ```f: U[] ->
+real```, where ```f``` computes the partial sum corresponding to the
+slice of the sequence ```x``` passed in. ```reduce_sum```
+exploits the associativity of the sum operation as it holds that:
 
 ```
-f({ x1, x2, x3 }) = g(x1) + g(x2) + g(x3)
-f({ x1 }) = g(x1)
-f({ x1, x2, x3 }) = f({ x1, x2 }) + f({ x3 })
+g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
+                      = f({ x1, x2 }) + f({ x3 })
+                      = f({ x1 }) + f({ x2, x3 })
+					  = f({ x1 }) + f({ x2 }) + f({ x3 })
 ```
 
-If the user can write a function ```f: U[] -> real``` to compute the necessary partial sums in the calculation, then ```reduce_sum``` can automatically parallelize the calculations.
-
-If the set of work is represented as an array ```{ x1, x2, x3, ... }```, then mathematically it is possible to rewrite this sum with any combination of partial sums.
-
-For instance, the sum can be written:
-
-1. ```f({ x1, x2, x3, ... })```, summing over all arguments, using one partial sum
-2. ```f({ x1, ..., xM }) + reduce({ x(M + 1), x(M + 2), ...})```, computing the first M terms separately from the rest
-3. ```f({ x1 }) + f({ x2 }) + f({ x3 }) + ...```, computing each term individually and summing them
-
-The first form uses only one partial sum and no parallelization can be done, the second uses two partial sums and so can be parallelized over two workers, and the last can be parallelized over as many workers as there are elements in the array ```x```. Depending on how the list is sliced up, it is possible to parallelize this calculation over many workers.
+If the user can write a function ```f: U[] -> real``` to compute the
+necessary partial sums of the calculation, then ```reduce_sum``` can
+automatically parallelize the calculations. The exact partitioning
+into partial sums is arbitrary as these are mathematical equivalent to
+one another. As the partitioning is flexible, it is be adapted to the
+available ressources (number of concurrent threads) given to Stan.
 
-```reduce_sum``` is the tool that will allow us to automatically parallelize this calculation.
-
-For efficiency and convenience, ```reduce_sum``` allows for additional shared arguments to be passed to every term in the sum. So for the array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...``` the effective sum (with individual terms) looks like:
+For efficiency and convenience, ```reduce_sum``` allows for additional
+shared arguments to be passed to every term in the sum. So for the
+array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...```
+the effective sum (with individual terms) looks like: 
 
 ```
 g(x1, s1, s2, ...) + g(x2, s1, s2, ...) + g(x3, s1, s2, ...) + ...
@@ -59,56 +70,64 @@ f({ x1, x2 }, s1, s2, ...) + f({ x3 }, s1, s2, ...)
 
 where the particular slicing of the ```x``` array can change.
 
-Given this, the signature for reduce_sum is:
+Given this, the signature for ```reduce_sum``` is:
 
 ```
-real reduce_sum(F func, T[] x, int grainsize, T1 s1, T2 s2, ...)
+real reduce_sum(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
 ```
 
-1. ```func``` - User defined function that computes partial sums
+1. ```f``` - User defined function that computes partial sums
 2. ```x``` - Array to slice, each element corresponds to a term in the summation
 3. ```grainsize``` - Target for size of slices
-4-. ```s1, s2, ...``` - Arguments shared in every term
+4. ```s1, s2, ...``` - Arguments shared in every term
 
 The user-defined partial sum functions have the signature:
 
 ```
-real func(int start, int end, T[] x_slice, T1 arg1, T2 arg2, ...)
+real f(int start, int end, T[] x_slice, T1 s1, T2 s2, ...)
 ```
 
 and take the arguments:
+
 1. ```start``` - An integer specifying the first term in the partial sum
 2. ```end``` - An integer specifying the last term in the partial sum (inclusive)
-3. ```x_slice``` - The subset of ```x``` (from ```reduce_sum```) for which this partial sum is responsible (```x[start:end]```)
-4-. ```arg1, arg2, ...``` Arguments shared in every term  (passed on without modification from the reduce_sum call)
-
-The user-provided function ```func``` is expect to compute the ```start``` through ```end``` terms of the overall sum, accumulate them, and return that value. The user function is passed the subset ```x[start:end]``` as ```x_slice```. ```start``` and ```end``` are passed so that ```func``` can index any of the tailing ```sM``` arguments as necessary. The trailing ```sM``` arguments are passed without modification to every call of ```func```.
+3. ```x_slice``` - The subset of ```x``` (from ```reduce_sum```) for
+which this partial sum is responsible (```x_slice = x[start:end]```)
+4. ```s1, s2, ...``` - Arguments shared in every term  (passed on without modification from the ```reduce_sum``` call)
+
+The user-provided function ```f``` is expected to compute the partial
+sum with the terms ```start``` through ```end``` of the overall
+sum. The user function is passed the subset ```x[start:end]``` as
+```x_slice```. ```start``` and  ```end``` are passed so that ```f```
+can index any of the tailing ```sM``` arguments as necessary. The
+trailing ```sM``` arguments are passed without modification to every
+call of ```f```.
 
 The ```reduce_sum``` call:
 
 ```
-real sum = reduce_sum(func, x, grainsize, s1, s2, ...)
+real sum = reduce_sum(f, x, grainsize, s1, s2, ...);
 ```
 
 can be replaced by either:
 
 ```
-real sum = func(1, size(x), x, s1, s2, ...)
+real sum = f(1, size(x), x, s1, s2, ...);
 ```
 
 or the code:
 
 ```
 real sum = 0.0;
 for(i in 1:size(x)) {
-  sum = sum + func(i, i, { x[i] }, s1, s2, ...);
+  sum += f(i, i, { x[i] }, s1, s2, ...);
 }
 ```
 
 ### Example: Logistic Regression
 
-Logistic Regression is a useful example to clarify both the syntax
-and semantics of `reduce_sum` and how it can be used to speed up a typical
+Logistic regression is a useful example to clarify both the syntax
+and semantics of ```reduce_sum``` and how it can be used to speed up a typical
 model.
 
 A basic logistic regression can be coded in Stan as:
@@ -147,9 +166,9 @@ for(n in 1:N) {
 
 Now it is clear that the calculation is the sum of a number of conditionally
 independent Bernoulli log probability statements, which is the condition where
-`reduce_sum` is useful.
+```reduce_sum``` is useful.
 
-To use `reduce_sum`, a function must be written that can be used to compute
+To use ```reduce_sum```, a function must be written that can be used to compute
 arbitrary partial sums of the total sum.
 
 Using the interface defined in [Reduce-Sum](#reduce-sum), such a function
@@ -169,14 +188,14 @@ functions {
 And the likelihood statement in the model can now be written:
 
 ```
-target += partial_sum(1, N, y, x, beta); // Sum terms 1 to N in the likelihood
+target += partial_sum(1, N, y, x, beta); // Sum terms 1 to N of the likelihood
 ```
 
 In this example, `y` was chosen to be sliced over because there
 is one term in the summation per value of `y`. Technically `x` would  have
 worked as well. Use whatever conceptually makes the most sense.
 
-Because `x` is a shared argument, it is subset manually with `start:end`.
+Because `x` is a shared argument, it is subset accordingly with `start:end`.
 
 With this function, `reduce_sum` can be used to automatically parallelize the
 likelihood:
@@ -220,25 +239,31 @@ model {
 
 ### Picking the Grainsize
 
-The `grainsize` is a recommendation on how large each piece of parallel work is
-(how many terms it contains). If one, it will be chosen automatically, but it
-is probably best to choose this manually for each model.
-
-To figure out an appropriate grainsize, think about how many terms are in the summation
-and on how many cores the model should run. If there are `N` terms and `M` cores,
-run a quick test model with `grainsize` set roughly to `N / M`. Record the time, cut the
-grainsize in half, and run the test again. Repeat this iteratively until the model runtime
-begins to increase. This is a suitable grainsize for the model, because this ensures the
-caculations can be carried out with the most parallelism without losing too much efficiency.
+The `grainsize` is a recommendation on how large each piece of
+parallel work is (how many terms it contains). It is recommended to
+choose one as a starting point which will select an appropiate value
+automatically.
+
+From empirical experience, the automatic grainsize determination works
+well and no further tuning is required in most cases. In order to
+figure out an optimal grainsize, think about how many terms are in the
+summation and on how many cores the model should run. If there are `N`
+terms and `M` cores, run a quick test model with `grainsize` set
+roughly to `N / M`. Record the time, cut the grainsize in half, and
+run the test again. Repeat this iteratively until the model runtime
+begins to increase. This is a suitable grainsize for the model,
+because this ensures the caculations can be carried out with the most
+parallelism without losing too much efficiency.
 
 For instance, in a model with `N=10000` and `M = 4`, start with `grainsize = 25000`, and
 sequentially try `grainsize = 12500`, `grainsize = 6250`, etc.
 
 It is important to repeat this process until performance gets worse! It is possible
-after many halvings nothing happens, but there might still be a small grainsize that performs better.
+after many halvings nothing happens, but there might still be a smaller grainsize that performs better.
 Even if a sum has many tens of thousands of terms, depending on the internal calculations, a `grainsize`
 of thirty or forty or smaller might be the best, and it is difficult to predict this behavior.
-Without doing these halvings until performance actually gets worse, it is easy to miss this.
+Without doing these halvings until performance actually gets worse, it
+is easy to miss this.
 
 ## Map-Rect