Reworeded user's guide introduction to reduce_sum and lowered expectations about what grainsize = 1 can do. (design-doc #17)

bbbales2 · bbbales2 · commit 021d3640fc0c · 2020-04-07T10:27:13.000-04:00
diff --git a/src/functions-reference/higher-order_functions.Rmd b/src/functions-reference/higher-order_functions.Rmd
@@ -410,12 +410,12 @@ exactly. This implies that the order of summation determines the exact
 numerical result. For this reason, the higher-order reduce function is
 available in two variants:
 
-* `reduce_sum`: Compute partial sums automatically. This usually
- results in good performance without further tuning.
-* `reduce_sum_static`: For the same input, always create the same call
-graph. This results in stable numerical evaluation. This version
-requires setting a tuning parameter which controls the maximal size of
-partial sums formed.
+* `reduce_sum`: Automatically choose partial sums partitioning based on a dynamic
+ scheduling algorithm.
+* `reduce_sum_static`: Compute the same sum as `reduce_sum`, but partition
+ the input in the same way for given data set (in `reduce_sum` this partitioning
+ might change depending on computer load). This should result in stable
+ numerical evaluations.
 
 ### Specifying the Reduce-sum Function
 
@@ -437,16 +437,14 @@ partial sums. `s1, s2, ...` are shared between all terms in the sum.
 * *`f`*: function literal referring to a function specifying the
 partial sum operation. Refer to the [partial sum function](#functions-partial-sum).
 * *`x`*: array of `T`, one for each term of the reduction, `T` can be any type,
-* *`grainsize`*: recommended number of terms in each reduce call, set
-to 1 to estimate automatically for `reduce_sum` while for
-`reduce_sum_static` this determines the maximal size of the partial sums, type `int`,
+* *`grainsize`*: For `reduce_sum`, `grainsize` is the recommended size of the partial sum. For `reduce_sum_static`, `grainsize` determinse the maximum size of the partial sums, type `int`,
 * *`s1`*: first (optional) shared argument, type `T1`, where `T1` can be any type
 * *`s2`*: second (optional) shared argument, type `T2`, where `T2` can be any type,
 * *`...`*: remainder of shared arguments, each of which can be any type.
 
-### The Partial-sum Function {#functions-partial-sum}
+### The Partial sum Function {#functions-partial-sum}
 
-The partial sum function must have the following signature where the types `T`, and the
+The partial sum function must have the following signature where the type `T`, and the
 types of all the shared arguments (`T1`, `T2`, ...) match those of the original
 `reduce_sum` (`reduce_sum_static`) call.
 
diff --git a/src/stan-users-guide/parallelization.Rmd b/src/stan-users-guide/parallelization.Rmd
@@ -28,25 +28,21 @@ law](https://en.wikipedia.org/wiki/Amdahl's_law).
 
 ## Reduce-Sum { #reduce-sum }
 
-The higher-order reduce with summation facility maps evaluation of a
-function `g: U -> real`, which returns a scalar, to a list of type
-`U[]`, `{ x1, x2, ... }`, and performs as reduction operation a sum
-over the results. For instance, for a sequence of ```x``` values of
-type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
+It is often necessary in probabilistic modeling to compute the sum of
+a number of independent function evaluations. This occurs, for instance, when
+evaluating a number of conditionally independent terms in a log-likelihood.
+If `g: U -> real` is the function and `{ x1, x2, ... }` is an array of
+inputs, then that sum looks like:
 
 `g(x1) + g(x2) + ...`
 
-In probabilistic modeling this comes up when there are $N$
-conditionally independent terms in a likelihood. Because of the
-conditional independence, these terms can be computed in parallel. If
-dependencies exist between the terms, then this isn't possible. For
-instance, in evaluating the log density of a Gaussian process then
-summation of independent terms isn't applicable.
+The `reduce_sum` function is a tool for automatically parallelizing these
+calculations.
 
 For efficiency reasons the reduce function doesn’t work with the
-element-wise evaluated function `g`, but instead requires the partial
-sum function ```f: U[] -> real```, where ```f``` computes the partial
-sum corresponding to a slice of the sequence ```x``` passed in. Due to the
+element-wise evaluated function `g`, but instead the partial
+sum function `f: U[] -> real`, where `f` computes the partial
+sum corresponding to a slice of the sequence `x` passed in. Due to the
 the associativity of the sum reduction it holds that:
 
 ```
@@ -64,20 +60,20 @@ control of the user. However, since the exact numerical result will
 depend on the order of summation, Stan provides two versions of the
 reduce summation facility:
 
-* `reduce_sum`: Automatically forms partial sums resulting usually in good
- performance without further tuning.
-* `reduce_sum_static`: Creates for the same input always the same
-call graph resulting in stable numerical evaluation. This version
-requires setting a sensible tuning parameter for good performance.
-
-The tuning parameter is the so-called `grainsize`. For the
-`reduce_sum` version, the `grainsize` is merely a suggested partial sum
-size, while for the `reduce_sum_static` version the `grainsize`
-specifies the maximal partial sum size. While for `reduce_sum` a
-`grainsize` of 1 commonly leads to good performance already (since
-automatic aggregation is performed), the `reduce_sum_static` variant
-requires setting a sensible `grainsize` for good performance as
-explained in [more detail below](#reduce-sum-grainsize).
+* `reduce_sum`: Automatically choose partial sums partitioning based on a dynamic
+ scheduling algorithm.
+* `reduce_sum_static`: Compute the same sum as `reduce_sum`, but partition
+ the input in the same way for given data set (in `reduce_sum` this partitioning
+ might change depending on computer load).
+
+`grainsize` is the one tuning parameter. For `reduce_sum`, `grainsize` is
+a suggested partial sum size. A `grainsize` of 1 leaves the partitioning
+entirely up to the scheduler.
+
+For `reduce_sum_static`, `grainsize` specifies the maximal partial sum size.
+With `reduce_sum_static` it is more important to choose `grainsize`
+carefully since it entirely determines the partitioning of work.
+See details in [more detail below](#reduce-sum-grainsize).
 
 For efficiency and convenience additional
 shared arguments can be passed to every term in the sum. So for the
@@ -230,7 +226,7 @@ target += reduce_sum(partial_sum, y,
 ```
 
 The reduce summation facility automatically breaks the sum into roughly `grainsize` sized pieces
-and computes them in parallel. `grainsize = 1` specifies that the grainsize should
+and computes them in parallel. `grainsize = 1` specifies that the `grainsize` should
 be estimated automatically. The final model looks as:
 
 ```
@@ -261,35 +257,41 @@ model {
 
 ### Picking the Grainsize {#reduce-sum-grainsize}
 
-The `grainsize` is a recommendation on how large each piece of
+For `grainsize` is a recommendation on how large each piece of
 parallel work is (how many terms it contains). When using the
 non-static version, it is recommended to choose 1 as a starting
 point as automatic aggregation of partial sums are performed. However,
 for the static version the `grainsize` defines the maximal size of the
-partial sums, e.g. the static variant will split the input sequence
-until all partial sums are just smaller than `grainsize`. Therefore,
-for the static version it is more important to select a sensible
-value. The rational for choosing a sensible `grainsize` is based on
+partial sums, e.g. 
+
+The rational for choosing a sensible `grainsize` is based on
 balancing the overhead implied by creating many small tasks versus
 creating fewer large tasks which limits the potential parallelism.
 
-From empirical experience, the automatic grainsize determination works
-well and no further tuning is required in most cases. In order to
-figure out an optimal grainsize, think about how many terms are in the
-summation and on how many cores the model should run. If there are `N`
+In `reduce_sum`, `grainsize` is a recommendation on how to partition
+the work in the partial sum into smaller pieces. A `grainsize` of 1
+leaves this entirely up to the internal scheduler. Ideally this will be
+efficient, but there are no guarantees.
+
+In `reduce_sum_static`, `grainsize` is an upper limit on the worksize.
+Work will be split until all partial sums are just smaller than `grainsize`
+(and the split will happen the same way every time for the same data).
+For the static version it is more important to select a sensible `grainsize`.
+
+In order to figure out an optimal `grainsize`, if there are `N`
 terms and `M` cores, run a quick test model with `grainsize` set
-roughly to `N / M`. Record the time, cut the grainsize in half, and
+roughly to `N / M`. Record the time, cut the `grainsize` in half, and
 run the test again. Repeat this iteratively until the model runtime
-begins to increase. This is a suitable grainsize for the model,
+begins to increase. This is a suitable `grainsize` for the model,
 because this ensures the caculations can be carried out with the most
 parallelism without losing too much efficiency.
 
 For instance, in a model with `N=10000` and `M = 4`, start with `grainsize = 25000`, and
 sequentially try `grainsize = 12500`, `grainsize = 6250`, etc.
 
-It is important to repeat this process until performance gets worse!
+It is important to repeat this process until performance gets worse.
 It is possible after many halvings nothing happens, but there might
-still be a smaller grainsize that performs better.  Even if a sum has
+still be a smaller `grainsize` that performs better.  Even if a sum has
 many tens of thousands of terms, depending on the internal
 calculations, a `grainsize` of thirty or forty or smaller might be the
 best, and it is difficult to predict this behavior.  Without doing