add static version to user guide

weberse2 · weberse2 · commit df152e4694b4 · 2020-04-01T16:57:30.000+02:00
diff --git a/src/stan-users-guide/parallelization.Rmd b/src/stan-users-guide/parallelization.Rmd
@@ -1,18 +1,20 @@
 # Parallelization  {#parallelization.chapter}
 
-Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and `map_rect`.
+Stan has two mechanisms for parallelizing calculations used in a
+model: reduce with summation and rectangular map.
 
-The advantages of `reduce_sum` are:
+The advantages of reduce with summation are:
 
-1. `reduce_sum` has a more flexible argument interface, avoiding the packing and unpacking that is necessary with `map_rect`.
-2. `reduce_sum` partitions data for parallelization automatically (this is done manually in `map_rect`).
-3. `reduce_sum` is easier to use.
+1. More flexible argument interface, avoiding the packing and
+   unpacking that is necessary with rectanguar map.
+2. Partitions data for parallelization automatically (this is done manually in rectanguar map).
+3. Is easier to use.
 
-The advantages of `map_rect` are:
+The advantages of rectangular map are:
 
-1. `map_rect` returns a list of vectors, while `reduce_sum` returns only a real.
-2. `map_rect` can be parallelized across multiple cores and multiple
-   computers, while `reduce_sum` can only parallelized across multiple
+1. Returns a list of vectors, while the reduce summation returns only a scalar.
+2. Can be parallelized across multiple cores and multiple
+   computers, while reduce summation can only parallelized across multiple
    cores on a single machine.
 
 The actual speedup gained from using these functions will depend on
@@ -26,18 +28,26 @@ law](https://en.wikipedia.org/wiki/Amdahl's_law).
 
 ## Reduce-Sum { #reduce-sum }
 
-```reduce_sum``` maps evaluation of a function `g: U -> real` to a list of type `U[]`, `{
-x1, x2, ... }`, and performs as reduction operation a sum over the
-results. For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
+The higher-order reduce with summation facility maps evaluation of a
+function `g: U -> real`, which returns a scalar, to a list of type
+`U[]`, `{ x1, x2, ... }`, and performs as reduction operation a sum
+over the results. For instance, for a sequence of ```x``` values of
+type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
 
 ```g(x1) + g(x2) + ...```
 
-In probabilistic modeling this comes up when there are N conditionally independent terms in a likelihood. Because of the conditional independence, these terms can be computed in parallel. If dependencies exist between the terms, then this isn't possible. For instance, in evaluating the log density of a Gaussian process ```reduce_sum``` would not be very useful.
+In probabilistic modeling this comes up when there are $N$
+conditionally independent terms in a likelihood. Because of the
+conditional independence, these terms can be computed in parallel. If
+dependencies exist between the terms, then this isn't possible. For
+instance, in evaluating the log density of a Gaussian process then
+summation of independent terms isn't applicable.
 
-```reduce_sum``` requires the partial sum function ```f: U[] ->
-real```, where ```f``` computes the partial sum corresponding to the
-slice of the sequence ```x``` passed in. ```reduce_sum```
-exploits the associativity of the sum operation as it holds that:
+For efficiency reasons the reduce function doesn’t work with the
+element-wise evaluated function `g`, but instead requires the partial
+sum function ```f: U[] -> real```, where ```f``` computes the partial
+sum corresponding to a slice of the sequence ```x``` passed in. Due to the
+the associativity of the sum reduction it holds that:
 
 ```
 g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
@@ -46,15 +56,31 @@ g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
 					  = f({ x1 }) + f({ x2 }) + f({ x3 })
 ```
 
-If the user can write a function ```f: U[] -> real``` to compute the
-necessary partial sums of the calculation, then ```reduce_sum``` can
-automatically parallelize the calculations. The exact partitioning
-into partial sums is arbitrary as these are mathematical equivalent to
-one another. As the partitioning is flexible, it is be adapted to the
-available ressources (number of concurrent threads) given to Stan.
-
-For efficiency and convenience, ```reduce_sum``` allows for additional
-shared arguments to be passed to every term in the sum. So for the
+With the partial sum function ```f: U[] -> real``` reduction of a
+large number of terms can be evaluated in parallel automatically, since the
+overall sum can be partitioned into arbitrary smaller partial
+sums. The exact partitioning into the partial sums is not under the
+control of the user. However, since the exact numerical result will
+depend on the order of summation, Stan provides two versions of the
+reduce summation facility:
+
+* `reduce_sum`: Automatically forms partial sums resulting usually in good
+ performance without further tuning.
+* `reduce_sum_static`: Creates for the same input always the same
+call graph resulting in stable numerical evaluation. This version
+requires setting a sensible tuning parameter for good performance.
+
+The tuning parameter is the so-called `grainsize`. For the
+`reduce_sum` version the `grainsize` is merely a suggested partial sum
+size while for the `reduce_sum_static` version the `grainsize`
+specifies the maximal partial sum size. While for `reduce_sum` a
+`grainsize` of one commonly leads to good performance already (since
+automatic aggregation is performed), the `reduce_sum_static` variant
+requires setting a sensible `grainsize` for good performance as
+explained in [more detail below](#reduce-sum-grainsize).
+
+For efficiency and convenience additional
+shared arguments can be passed to every term in the sum. So for the
 array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...```
 the effective sum (with individual terms) looks like: 
 
@@ -70,10 +96,11 @@ f({ x1, x2 }, s1, s2, ...) + f({ x3 }, s1, s2, ...)
 
 where the particular slicing of the ```x``` array can change.
 
-Given this, the signature for ```reduce_sum``` is:
+Given this, the signatures are:
 
 ```
 real reduce_sum(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
+real reduce_sum_static(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
 ```
 
 1. ```f``` - User defined function that computes partial sums
@@ -91,9 +118,10 @@ and take the arguments:
 
 1. ```start``` - An integer specifying the first term in the partial sum
 2. ```end``` - An integer specifying the last term in the partial sum (inclusive)
-3. ```x_slice``` - The subset of ```x``` (from ```reduce_sum```) for
+3. ```x_slice``` - The subset of ```x``` (from ```reduce_sum``` / `reduce_sum_static`) for
 which this partial sum is responsible (```x_slice = x[start:end]```)
-4. ```s1, s2, ...``` - Arguments shared in every term  (passed on without modification from the ```reduce_sum``` call)
+4. ```s1, s2, ...``` - Arguments shared in every term  (passed on
+without modification from the ```reduce_sum``` / `reduce_sum_static` call)
 
 The user-provided function ```f``` is expected to compute the partial
 sum with the terms ```start``` through ```end``` of the overall
@@ -103,7 +131,7 @@ can index any of the tailing ```sM``` arguments as necessary. The
 trailing ```sM``` arguments are passed without modification to every
 call of ```f```.
 
-The ```reduce_sum``` call:
+A ```reduce_sum``` (or `reduce_sum_static`) call:
 
 ```
 real sum = reduce_sum(f, x, grainsize, s1, s2, ...);
@@ -127,7 +155,7 @@ for(i in 1:size(x)) {
 ### Example: Logistic Regression
 
 Logistic regression is a useful example to clarify both the syntax
-and semantics of ```reduce_sum``` and how it can be used to speed up a typical
+and semantics of reduce summation and how it can be used to speed up a typical
 model.
 
 A basic logistic regression can be coded in Stan as:
@@ -150,7 +178,7 @@ model {
 In this model predictions are made about the `N` outputs `y` using the
 covariate `x`. The intercept and slope of the linear equation are to be estimated.
 
-The key point to getting this calculation into `reduce_sum`, is recognizing that
+The key point to getting this calculation to use reduce summation, is recognizing that
 the statement:
 
 ```
@@ -166,9 +194,9 @@ for(n in 1:N) {
 
 Now it is clear that the calculation is the sum of a number of conditionally
 independent Bernoulli log probability statements, which is the condition where
-```reduce_sum``` is useful.
+reduce summation is useful.
 
-To use ```reduce_sum```, a function must be written that can be used to compute
+To use the reduce summation, a function must be written that can be used to compute
 arbitrary partial sums of the total sum.
 
 Using the interface defined in [Reduce-Sum](#reduce-sum), such a function
@@ -197,7 +225,7 @@ worked as well. Use whatever conceptually makes the most sense.
 
 Because `x` is a shared argument, it is subset accordingly with `start:end`.
 
-With this function, `reduce_sum` can be used to automatically parallelize the
+With this function, reduce summation can be used to automatically parallelize the
 likelihood:
 
 ```
@@ -207,7 +235,7 @@ target += reduce_sum(partial_sum, y,
                      x, beta);
 ```
 
-`reduce_sum` automatically breaks the sum into roughly `grainsize` sized pieces
+The reduce summation facility automatically breaks the sum into roughly `grainsize` sized pieces
 and computes them in parallel. `grainsize = 1` specifies that the grainsize should
 be estimated automatically. The final model looks like:
 
@@ -237,12 +265,19 @@ model {
 }
 ```
 
-### Picking the Grainsize
+### Picking the Grainsize {#reduce-sum-grainsize}
 
 The `grainsize` is a recommendation on how large each piece of
-parallel work is (how many terms it contains). It is recommended to
-choose one as a starting point which will select an appropiate value
-automatically.
+parallel work is (how many terms it contains). When using the
+non-static version, it is recommended to choose one as a starting
+point as automatic aggregation of partial sums are performed. However,
+for the static version the `grainsize` defines the maximal size of the
+partial sums, e.g. the static variant will split the input sequence
+until all partial sums are just smaller than `grainsize`. Therefore,
+for the static version it is more important to select a sensible
+value. The rational for choosing a sensible `grainsize` is based on
+balancing the overhead implied by creating many small tasks versus
+creating fewer large tasks which limits the potential parallelism.
 
 From empirical experience, the automatic grainsize determination works
 well and no further tuning is required in most cases. In order to
@@ -258,12 +293,14 @@ parallelism without losing too much efficiency.
 For instance, in a model with `N=10000` and `M = 4`, start with `grainsize = 25000`, and
 sequentially try `grainsize = 12500`, `grainsize = 6250`, etc.
 
-It is important to repeat this process until performance gets worse! It is possible
-after many halvings nothing happens, but there might still be a smaller grainsize that performs better.
-Even if a sum has many tens of thousands of terms, depending on the internal calculations, a `grainsize`
-of thirty or forty or smaller might be the best, and it is difficult to predict this behavior.
-Without doing these halvings until performance actually gets worse, it
-is easy to miss this.
+It is important to repeat this process until performance gets worse!
+It is possible after many halvings nothing happens, but there might
+still be a smaller grainsize that performs better.  Even if a sum has
+many tens of thousands of terms, depending on the internal
+calculations, a `grainsize` of thirty or forty or smaller might be the
+best, and it is difficult to predict this behavior.  Without doing
+these halvings until performance actually gets worse, it is easy to
+miss this.
 
 ## Map-Rect