Updated reduce_sum docs to reflect design-doc changes (design-doc pull request #17)

bbbales2 · bbbales2 · commit 3fcff92d370b · 2020-03-26T15:47:05.000-04:00
diff --git a/src/functions-reference/higher-order_functions.Rmd b/src/functions-reference/higher-order_functions.Rmd
@@ -10,6 +10,7 @@ if (knitr::is_html_output()) {
 cat(' * <a href="functions-algebraic-solver.html">Algebraic Equation Solver</a>\n')
 cat(' * <a href="functions-ode-solver.html">Ordinary Differential Equation (ODE) Solvers</a>\n')
 cat(' * <a href="functions-1d-integrator.html">1D Integrator</a>\n')
+cat(' * <a href="functions-reduce.html">Reduce-Sum</a>\n')
 cat(' * <a href="functions-map.html">Higher-Order Map</a>\n')
 }
 ```
@@ -382,67 +383,75 @@ Internally the 1D integrator uses the double-exponential methods in the Boost 1D
 
 The gradients of the integral are computed in accordance with the Leibniz integral rule. Gradients of the integrand are computed internally with Stan's automatic differentiation.
 
-## Parallel Reduce-Sum Function {#functions-reduce}
+## Reduce-Sum Function {#functions-reduce}
 
-Stan provides a higher-order ```reduce_sum``` function for parallelizing operations that can be represented as a reduce (by summation) over a sequence of terms.
+Stan provides a higher-order `reduce_sum` function for parallelizing operations that can be represented as a sum of a function, `g: U -> real`, evaluated at each element of a list of type `U[]`, `{ x1, x2, ... }`. That is:
 
-### Reduce-Sum Function
+```g(x1) + g(x2) + ...```
 
-The reduce sum function operates on a reducing function, a list of
-sliced arguments (one for each term in the parallel-for), a recommended grainsize,
-and a set of shared arguments.
+`reduce_sum` doesn't work on `g` itself, but takes a partial sum function, `f: U[] -> real`, where:
 
-<!-- real; map_rect; (F f, T[] sliced\_arg, int grainsize, T1 arg1, T2 arg2, ...); -->
-\index{{\tt \bfseries reduce\_sum }!{\tt (F f, T[] sliced\_arg, int grainsize, T1 arg1, T2 arg2, ...): real}|hyperpage}
+```f({ x1 }) = g(x1)```
+```f({ x1, x2 }) = g(x1) + g(x2)```
+```f({ x1, x2, ... }) = g(x1) + g(x2) + ...```
 
-`real` **`reduce_sum`**`(F f, T[] sliced_arg, int grainsize, T1 arg1, T2 arg2, ...)`<br>\newline
+### The Reduce-sum Function
 
-Return the equivalent of `f(1, size(sliced_arg), arg1, arg2, ...)`, but compute
-the result in parallel by breaking `sliced_arg` into pieces and computing each piece
-in a different thread. `arg1, arg2, ...` are shared between all terms in the sum.
+The `reduce_sum` function takes a partial sum function, an array argument x
+(with one for each term in the sum), a recommended grainsize, and a set of shared arguments and
+parallelizes the resultant sum.
 
-* *`f`*: function literal referring to a function specifying the reduce operation with signature  `(int, int, T[], T1, T2, ...):real`
+<!-- real; reduce_sum; (F f, T[] x, int grainsize, T1 s1, T2 s2, ...); -->
+\index{{\tt \bfseries reduce\_sum }!{\tt (F f, T[] x, int grainsize, T1 s1, T2 s2, ...): real}|hyperpage}
+
+`real` **`reduce_sum`**`(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)`<br>\newline
+
+Return the equivalent of `f(1, size(x), x, s1, s2, ...)`, but compute
+the result in parallel by breaking the array `x` into pieces and computing each piece
+in a different thread. `s1, s2, ...` are shared between all terms in the sum.
+
+* *`f`*: function literal referring to a function specifying the partial sum operation with signature  `(int, int, T[], T1, T2, ...):real`
 The arguments represent
-    + (1) the index of the first term of the reduction,
-    + (2) the index of the last term of the reduction,
-    + (3) the subset `sliced_arg` this reduce is responsible for computing,
+    + (1) the index of the first term of the partial sum,
+    + (2) the index of the last term of the partial sum,
+    + (3) the subset `x` this reduce is responsible for computing,
     + (4) first shared argument,
     + (5) second shared argument,
     + ... the rest of the shared arguments.
 
-* *`sliced_args`*: array of non-shared arguments, one for each term of the reduction, array of `T`, where `T` can be any type,
-* *`grainsize`*: recommented number of terms in each reduce call, set to zero to estimate automatically, type `int`,
-* *`arg1`*: first (optional) shared argument, type `T1`, where `T1` can be any type
-* *`arg2`*: second (optional) shared argument, type `T2`, where `T2` can be any type,
+* *`x`*: array of `T`, one for each term of the reduction, `T` can be any type,
+* *`grainsize`*: recommented number of terms in each reduce call, set to one to estimate automatically, type `int`,
+* *`s1`*: first (optional) shared argument, type `T1`, where `T1` can be any type
+* *`s2`*: second (optional) shared argument, type `T2`, where `T2` can be any type,
 * *`...`*: remainder of shared arguments, each of which can be any type.
 
-### Specifying the Reduce Function
+### The Partial-sum Function
 
-The reduce function must have the following signature where the types T, and the
-types of all the variadic arguments (`T1`, `T2`, ...) match those of the original
+The partial sum function must have the following signature where the types `T`, and the
+types of all the shared arguments (`T1`, `T2`, ...) match those of the original
 `reduce_sum` call.
 
 ```
-(int start, int end, T[] subset_sliced_arg, T1 arg1, T2 arg2, ...):real
+(int start, int end, T[] x_subset, T1 s1, T2 s2, ...):real
 ```
 
-The reduce function returns the sum of the `start` to `end` terms of the overall
+The reduce function returns the sum of the `start` to `end` terms (inclusive) of the overall
 calculations. The arguments to the reduce function are:
 
-*   *`start`*, the index of the first term of the reduction, type `int`
+*   *`start`*, the index of the first term of the partial sum, type `int`
 
-*   *`end`*, the index of the last term of the reduction (inclusive), type `int`
+*   *`end`*, the index of the last term of the partial sum (inclusive), type `int`
 
-*   *`subset_sliced_arg`*, the subset `sliced_arg` this reduce is responsible for computing, type `T[]`, where `T` matches the type of `sliced_arg` in `reduce_sum`
+*   *`x_subset`*, the subset of `x` this partial sum is responsible for computing, type `T[]`, where `T` matches the type of `x` in `reduce_sum`
 
-*   *`arg1`*, first shared argument, type `T1`, matching type of `arg1` in `reduce_sum`
+*   *`s1`*, first shared argument, type `T1`, matching type of `s1` in `reduce_sum`
 
-*   *`arg2`*, second shared argument, type `T2`, matching type of `arg2` in `reduce_sum`
+*   *`s2`*, second shared argument, type `T2`, matching type of `s2` in `reduce_sum`
 
 *   *`...`*, remainder of shared arguments, with types matching those in `reduce_sum`
 
 
-## Parallel Map-Rect Function {#functions-map}
+## Map-Rect Function {#functions-map}
 
 Stan provides a higher-order map function.  This allows map-reduce
 functionality to be coded in Stan as described in the user's guide.
diff --git a/src/stan-users-guide/parallel-computing.Rmd b/src/stan-users-guide/parallel-computing.Rmd
@@ -2,82 +2,106 @@
 
 Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and and `map_rect`.
 
-The main differences are:
+The main advantages to `reduce_sum` are:
 
-1. `reduce_sum` requires the result of the calculation to be a scalar, while `map_rect` returns a list of vectors
-2. `reduce_sum` has a more flexible interface and can accept arbitrary Stan types as arguments, `map_rect` is more restrictive on what arguments can be and how they are shaped
-3. `map_rect` can parallelize work over multiple computers or a single computer, while `reduce_sum` works only on a single computer
-4. `map_rect` requires work to be broken into pieces manually, while `reduce_sum` mostly automates this
+1. `reduce_sum` has a more flexible argument interface, avoiding the packing and unpacking that is necessary with `map_rect`.
+2. `reduce_sum` partitions the data for parallelization automatically (this is done manually in `map_rect`).
+3. `reduce_sum` is easier to use.
 
-## Reduce-Sum { #reduce-sum }
+while the advantages of `map_rect` are:
+
+1. `map_rect` returns a list of vectors, while `reduce_sum` returns only a real.
+2. `map_rect` can be parallelized across multiple computers, while `reduce_sum` can only parallelized across multiple cores.
 
-```reduce_sum``` is a tool for parallelizing operations that can be represented as a parallel-for combined with a sum (that returns a scalar).
+## Reduce-Sum { #reduce-sum }
 
-In terms of probabilistic models, an example of this comes up when N terms in a likelihood combined multiplicatively and can be computed independently of each other (where independence here is in the computational sense, not necessarily the statistical sense). In this case, computing the log density means computing the sum of a number of terms that can each be computed separately. 
+```reduce_sum``` is a tool for parallelizing operations that can be represented as a sum of functions, `g: U -> real`.
 
-```reduce_sum``` is not useful when there are dependencies between the terms. This can happen, for instance, if there were N terms in a Gaussian process likelihood. ```reduce_sum``` will not be useful for accelerating this.
+For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
 
-If for a set of input arguments, ```args0, args1, args2, ...``` and a scalar function ```f```, the log likelihood can be computed as:
+```g(x1) + g(x2) + ...```
 
-```f(args0) + f(args1) + f(args2) + ...```
+In probabilistic modeling this comes up when there are N conditionally independent terms in a likelihood. Because of the conditional independence, these terms can be computed in parallel. If dependencies exist between the terms, then this isn't possible. For instance, in evaluating the log density of a Gaussian process ```reduce_sum``` would not be very useful.
 
-then this calculation can be written as a reduction over the set of arguments. If this reducing function is called ```reduce```, then it would need to perform the operations:
+```reduce_sum``` doesn't actually take ```g: U -> real``` as an input argument. Instead it takes ```f: U[] -> real```, where ```f``` computes the partial sum corresponding to the slice of the sequence ```x``` passed in. For instance:
 
-```reduce({ args0, args1, args2, ... }) = f(args0) + f(args1) + f(args2) + ...```
+```
+f({ x1, x2, x3 }) = g(x1) + g(x2) + g(x3)
+f({ x1 }) = g(x1)
+f({ x1, x2, x3 }) = f({ x1, x2 }) + f({ x3 })
+```
 
-If the user can write a function like ```reduce```, then it is trivial for us to provide a function to automatically parallelize the calculations.
+If the user can write a function ```f: U[] -> real``` to compute the necessary partial sums in the calculation, then we can provide a function to automatically parallelize the calculations (and this is what ```reduce_sum``` is).
 
-Again, if the set of work is represented as a list of arguments ``{ args0, args1, args2, ... }```, then mathematically it is possible to rewrite this sum with any combination of partial-reduces.
+If the set of work is represented as an array ```{ x1, x2, x3, ... }```, then mathematically it is possible to rewrite this sum with any combination of partial sums.
 
 For instance, the sum can be written:
 
-1. ```reduce({ args0, args1, args2, ... })```, summing over all arguments, using one reduce function
-2. ```reduce({ args0, ..., args(M - 1) }) + reduce({ argsM, args2, ...})```, computing the first M terms separately from the rest
-3. ```reduce({ args0 }) + reduce({ args1 }) + reduce({ args2 }) + ...```, computing each term individually and summing them
+1. ```f({ x1, x2, x3, ... })```, summing over all arguments, using one partial sum
+2. ```f({ x1, ..., xM }) + reduce({ x(M + 1), x(M + 2), ...})```, computing the first M terms separately from the rest
+3. ```f({ x1 }) + f({ x2 }) + f({ x3 }) + ...```, computing each term individually and summing them
 
-The first function call is completely serial, the second can be parallelized over two workers, and the last can be parallelized over as many workers as there are arguments. Depending on how the list is sliced up, it is possible to parallelize this calculation over many workers.
+The first form uses only one partial sum and no parallelization can be done, the second uses two partial sums and so can be parallelized over two workers, and the last can be parallelized over as many workers as there are elements in the array ```x```. Depending on how the list is sliced up, it is possible to parallelize this calculation over many workers.
 
-```reduce_sum``` is the tool that will allow us to automatically parallelize these reduce operations (and sum them together).
+```reduce_sum``` is the tool that will allow us to automatically parallelize this calculation.
+
+For efficiency and convenience, ```reduce_sum``` allows for additional shared arguments to be passed to every term in the sum. So for the array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...``` the effective sum (with individual terms) looks like:
+
+```
+g(x1, s1, s2, ...) + g(x2, s1, s2, ...) + g(x3, s1, s2, ...) + ...
+```
 
-To implement this efficiently in Stan, the individual arguments are split into two types. The first are the arguments that are specific to each term in the reduction. These are called the sliced arguments (because we will slice these up to determine how to distribute the work). The second type of arguments are shared arguments, and are information that is shared in the computation of every term in the sum.
+which can be written equivalently with partial sums to look like:
+
+```
+f({ x1, x2 }, s1, s2, ...) + f({ x3 }, s1, s2, ...)
+```
+
+where the particular slicing of the ```x``` array can change.
 
 Given this, the signature for reduce_sum is:
 
 ```
-real reduce_sum(F func, T[] sliced_arg, int grainsize, T1 arg1, T2 arg2, ...)
+real reduce_sum(F func, T[] x, int grainsize, T1 s1, T2 s2, ...)
 ```
 
-1. ```func``` - The user-defined reduce function
-2. ```sliced_arg``` - An array of any type, with each element of the array corresponding to a term of the final summation (the length of ```sliced_arg``` is the total number of terms to sum)
-3. ```grainsize``` - A hint to the runtime of how many terms of the summation to compute in each reduction
-4-. ```arg1, arg2, ...``` - All the arguments that are to be shared in the calculation of every term in the sum
+1. ```func``` - User defined function that computes partial sums
+2. ```x``` - Array to slice, each element corresponds to a term in the summation
+3. ```grainsize``` - Target for size of slices
+4-. ```s1, s2, ...``` - Arguments shared in every term
 
-The user-defined reduce function is slightly different:
+The user-defined partial sum functions have the signature:
 
 ```
-real func(int start, int end, T[] subset_sliced_arg, T1 arg1, T2 arg2, ...)
+real func(int start, int end, T[] x_subset, T1 arg1, T2 arg2, ...)
 ```
 
-and takes the arguments:
-1. ```start``` - An integer specifying the first element of the sequence of terms this reduce call is responsible for computing
-2. ```end``` - An integer specifying the last element of the sequence of terms this reduce call is responsible for computing
-3. ```subset_sliced_arg``` - The subset of sliced_arg for which this reduce is responsible (```sliced_arg[start:end]```)
-4-. ```arg1, arg2, ...``` all the shared arguments -- passed on without modification from the reduce_sum call
+and take the arguments:
+1. ```start``` - An integer specifying the first term in the partial sum
+2. ```end``` - An integer specifying the last term in the partial sum (inclusive)
+3. ```x_subset``` - The subset of ```x``` (from ```reduce_sum```) for which this partial sum is responsible (```x[start:end]```)
+4-. ```arg1, arg2, ...``` Arguments shared in every term  (passed on without modification from the reduce_sum call)
 
-The user-provided function ```func``` is expect to compute the ```start``` through ```end``` terms of the overall sum, accumulate them, and return that value. The user function is only passed the subset ```sliced_arg[start:end]``` of sliced arg (as ```subset_sliced_arg```). ```start``` and ```end``` are passed so that ```func``` can index any of the ```argM``` appropriately. The trailing arguments ```argM``` are passed without modification to every call of ```func```.
+The user-provided function ```func``` is expect to compute the ```start``` through ```end``` terms of the overall sum, accumulate them, and return that value. The user function is passed the subset ```x[start:end]``` as ```x_subset```. ```start``` and ```end``` are passed so that ```func``` can index any of the tailing ```sM``` arguments as necessary. The trailing ```sM``` arguments are passed without modification to every call of ```func```.
+
+The ```reduce_sum``` call:
+
+```
+real sum = reduce_sum(func, x, grainsize, s1, s2, ...)
+```
 
-An overall call to ```reduce_sum``` can be replaced by either:
+can be replaced by either:
 
 ```
-real sum = func(1, size(sliced_arg), sliced_arg, arg1, arg2, ...)
+real sum = func(1, size(x), x, s1, s2, ...)
 ```
 
-or (modulo differences due to rearrangements of summations) the code:
+or the code:
 
 ```
 real sum = 0.0;
-for(i in 1:size(sliced_arg)) {
-  sum = sum + func(i, i, { sliced_arg[i] }, arg1, arg2, ...);
+for(i in 1:size(x)) {
+  sum = sum + func(i, i, { x[i] }, s1, s2, ...);
 }
 ```
 
@@ -126,26 +150,26 @@ independent Bernoulli log probability statements, which is the condition where
 `reduce_sum` is useful.
 
 To use `reduce_sum`, a function must be written that can be used to compute
-arbitrary subsets of the sums.
+arbitrary partial sums of the total sum.
 
-Using the reducer interface defined in [Reduce-Sum](#reduce-sum), such a function
+Using the interface defined in [Reduce-Sum](#reduce-sum), such a function
 can be written like:
 
 ```
 functions {
-  real reducer_func(int start, int end,
-                    int[] subset_y,
-                    vector x,
-                    vector beta) {
-    return bernoulli_logit_lpmf(subset_y | beta[1] + beta[2] * x[start:end]);
+  real partial_sum(int start, int end,
+                   int[] y_subset,
+                   vector x,
+                   vector beta) {
+    return bernoulli_logit_lpmf(y_subset | beta[1] + beta[2] * x[start:end]);
   }
 }
 ```
 
 And the likelihood statement in the model can now be written:
 
 ```
-target += reducer_fun(1, N, y, x, beta); // Sum terms 1 to N in the likelihood
+target += partial_sum(1, N, y, x, beta); // Sum terms 1 to N in the likelihood
 ```
 
 In this example, `y` was chosen to be sliced over because there
@@ -159,14 +183,40 @@ likelihood:
 
 ```
 int grainsize = 100;
-target += reduce_sum(reducer_func, y,
+target += reduce_sum(partial_sum, y,
                      grainsize,
                      x, beta);
 ```
 
 `reduce_sum` automatically breaks the sum into roughly `grainsize` sized pieces
 and computes them in parallel. `grainsize = 1` specifies that the grainsize should
-be estimated automatically.
+be estimated automatically. The final model looks like:
+
+```
+functions {
+  real partial_sum(int start, int end,
+                   int[] y_subset,
+                   vector x,
+                   vector beta) {
+    return bernoulli_logit_lpmf(y_subset | beta[1] + beta[2] * x[start:end]);
+  }
+}
+data {
+  int N;
+  int y[N];
+  vector[N] x;
+}
+parameters {
+  vector[2] beta;
+}
+model {
+  int grainsize = 100;
+  beta ~ std_normal();
+  target += reduce_sum(partial_sum, y,
+                       grainsize,
+                       x, beta);
+}
+```
 
 ### Picking the Grainsize