@@ -28,25 +28,22 @@ law](https://en.wikipedia.org/wiki/Amdahl's_law).
2828
2929## Reduce-Sum { #reduce-sum }
3030
31- The higher-order reduce with summation facility maps evaluation of a
32- function ` g: U -> real ` , which returns a scalar, to a list of type
33- ` U[] ` , ` { x1, x2, ... } ` , and performs as reduction operation a sum
34- over the results. For instance, for a sequence of ``` x ``` values of
35- type ``` U ``` , ``` { x1, x2, ... } ``` , we might compute the sum as
36- ` g(x1) + g(x2) + ... ` .
37-
38- In probabilistic modeling this comes up when there are $N$
39- conditionally independent terms in a likelihood. Because of the
40- conditional independence, these terms can be computed in parallel. If
41- dependencies exist between the terms, then this isn't possible. For
42- instance, in evaluating the log density of a Gaussian process then
43- summation of independent terms isn't applicable.
31+ It is often necessary in probabilistic modeling to compute the sum of
32+ a number of independent function evaluations. This occurs, for instance, when
33+ evaluating a number of conditionally independent terms in a log-likelihood.
34+ If ` g: U -> real ` is the function and ` { x1, x2, ... } ` is an array of
35+ inputs, then that sum looks like:
36+
37+ ` g(x1) + g(x2) + ... `
38+
39+ ` reduce_sum ` and ` reduce_sum_static ` are tools for parallelizing these
40+ calculations.
4441
4542For efficiency reasons the reduce function doesn’t work with the
46- element-wise evaluated function ` g ` , but instead requires the partial
47- sum function ``` f: U[] -> real ``` , where ``` f `` ` computes the partial
48- sum corresponding to a slice of the sequence ``` x `` ` passed in. Due to the
49- the associativity of the sum reduction it holds that:
43+ element-wise evaluated function ` g ` , but instead the partial
44+ sum function ` f: U[] -> real ` , where ` f ` computes the partial
45+ sum corresponding to a slice of the sequence ` x ` passed in. Due to the
46+ associativity of the sum reduction it holds that:
5047
5148```
5249g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
@@ -63,20 +60,21 @@ control of the user. However, since the exact numerical result will
6360depend on the order of summation, Stan provides two versions of the
6461reduce summation facility:
6562
66- * ` reduce_sum ` : Automatically forms partial sums resulting usually in good
67- performance without further tuning.
68- * ` reduce_sum_static ` : Creates for the same input always the same
69- call graph resulting in stable numerical evaluation. This version
70- requires setting a sensible tuning parameter for good performance.
71-
72- The tuning parameter is the so-called ` grainsize ` . For the
73- ` reduce_sum ` version, the ` grainsize ` is merely a suggested partial sum
74- size, while for the ` reduce_sum_static ` version the ` grainsize `
75- specifies the maximal partial sum size. While for ` reduce_sum ` a
76- ` grainsize ` of 1 commonly leads to good performance already (since
77- automatic aggregation is performed), the ` reduce_sum_static ` variant
78- requires setting a sensible ` grainsize ` for good performance as
79- explained in [ more detail below] ( #reduce-sum-grainsize ) .
63+ * ` reduce_sum ` : Automatically choose partial sums partitioning based on a dynamic
64+ scheduling algorithm.
65+ * ` reduce_sum_static ` : Compute the same sum as ` reduce_sum ` , but partition
66+ the input in the same way for given data set (in ` reduce_sum ` this partitioning
67+ might change depending on computer load).
68+
69+ ` grainsize ` is the one tuning parameter. For ` reduce_sum ` , ` grainsize ` is
70+ a suggested partial sum size. A ` grainsize ` of 1 leaves the partitioning
71+ entirely up to the scheduler. This should be the default way of using
72+ ` reduce_sum ` unless time is spent carefully picking ` grainsize ` . For picking a ` grainsize ` , see details [ below] ( #reduce-sum-grainsize ) .
73+
74+ For ` reduce_sum_static ` , ` grainsize ` specifies the maximal partial sum size.
75+ With ` reduce_sum_static ` it is more important to choose ` grainsize `
76+ carefully since it entirely determines the partitioning of work.
77+ See details [ below] ( #reduce-sum-grainsize ) .
8078
8179For efficiency and convenience additional
8280shared arguments can be passed to every term in the sum. So for the
@@ -222,15 +220,15 @@ accordingly with `start:end`. With this function, reduce summation can
222220be used to automatically parallelize the likelihood:
223221
224222```
225- int grainsize = 100 ;
223+ int grainsize = 1 ;
226224target += reduce_sum(partial_sum, y,
227225 grainsize,
228226 x, beta);
229227```
230228
231- The reduce summation facility automatically breaks the sum into roughly ` grainsize ` sized pieces
232- and computes them in parallel. ` grainsize = 1 ` specifies that the grainsize should
233- be estimated automatically. The final model looks as :
229+ The reduce summation facility automatically breaks the sum into pieces
230+ and computes them in parallel. ` grainsize = 1 ` specifies that the
231+ ` grainsize ` should be estimated automatically. The final model is :
234232
235233```
236234functions {
@@ -250,7 +248,7 @@ parameters {
250248 vector[2] beta;
251249}
252250model {
253- int grainsize = 100 ;
251+ int grainsize = 1 ;
254252 beta ~ std_normal();
255253 target += reduce_sum(partial_sum, y,
256254 grainsize,
@@ -260,35 +258,35 @@ model {
260258
261259### Picking the Grainsize {#reduce-sum-grainsize}
262260
263- The ` grainsize ` is a recommendation on how large each piece of
264- parallel work is (how many terms it contains). When using the
265- non-static version, it is recommended to choose 1 as a starting
266- point as automatic aggregation of partial sums are performed. However,
267- for the static version the ` grainsize ` defines the maximal size of the
268- partial sums, e.g. the static variant will split the input sequence
269- until all partial sums are just smaller than ` grainsize ` . Therefore,
270- for the static version it is more important to select a sensible
271- value. The rational for choosing a sensible ` grainsize ` is based on
261+ The rational for choosing a sensible ` grainsize ` is based on
272262balancing the overhead implied by creating many small tasks versus
273263creating fewer large tasks which limits the potential parallelism.
274264
275- From empirical experience, the automatic grainsize determination works
276- well and no further tuning is required in most cases. In order to
277- figure out an optimal grainsize, think about how many terms are in the
278- summation and on how many cores the model should run. If there are ` N `
265+ In ` reduce_sum ` , ` grainsize ` is a recommendation on how to partition
266+ the work in the partial sum into smaller pieces. A ` grainsize ` of 1
267+ leaves this entirely up to the internal scheduler and should be chosen
268+ if no benchmarking of other grainsizes is done. Ideally this will be
269+ efficient, but there are no guarantees.
270+
271+ In ` reduce_sum_static ` , ` grainsize ` is an upper limit on the worksize.
272+ Work will be split until all partial sums are just smaller than ` grainsize `
273+ (and the split will happen the same way every time for the same inputs).
274+ For the static version it is more important to select a sensible ` grainsize ` .
275+
276+ In order to figure out an optimal ` grainsize ` , if there are ` N `
279277terms and ` M ` cores, run a quick test model with ` grainsize ` set
280- roughly to ` N / M ` . Record the time, cut the grainsize in half, and
278+ roughly to ` N / M ` . Record the time, cut the ` grainsize ` in half, and
281279run the test again. Repeat this iteratively until the model runtime
282- begins to increase. This is a suitable grainsize for the model,
280+ begins to increase. This is a suitable ` grainsize ` for the model,
283281because this ensures the caculations can be carried out with the most
284282parallelism without losing too much efficiency.
285283
286284For instance, in a model with ` N=10000 ` and ` M = 4 ` , start with ` grainsize = 25000 ` , and
287285sequentially try ` grainsize = 12500 ` , ` grainsize = 6250 ` , etc.
288286
289- It is important to repeat this process until performance gets worse!
287+ It is important to repeat this process until performance gets worse.
290288It is possible after many halvings nothing happens, but there might
291- still be a smaller grainsize that performs better. Even if a sum has
289+ still be a smaller ` grainsize ` that performs better. Even if a sum has
292290many tens of thousands of terms, depending on the internal
293291calculations, a ` grainsize ` of thirty or forty or smaller might be the
294292best, and it is difficult to predict this behavior. Without doing
0 commit comments