11# Parallelization {#parallelization.chapter}
22
3- Stan has two mechanisms for parallelizing calculations used in a model: ` reduce_sum ` and ` map_rect ` .
3+ Stan has two mechanisms for parallelizing calculations used in a
4+ model: reduce with summation and rectangular map.
45
5- The advantages of ` reduce_sum ` are:
6+ The advantages of reduce with summation are:
67
7- 1 . ` reduce_sum ` has a more flexible argument interface, avoiding the packing and unpacking that is necessary with ` map_rect ` .
8- 2 . ` reduce_sum ` partitions data for parallelization automatically (this is done manually in ` map_rect ` ).
9- 3 . ` reduce_sum ` is easier to use.
8+ 1 . More flexible argument interface, avoiding the packing and
9+ unpacking that is necessary with rectanguar map.
10+ 2 . Partitions data for parallelization automatically (this is done manually in rectanguar map).
11+ 3 . Is easier to use.
1012
11- The advantages of ` map_rect ` are:
13+ The advantages of rectangular map are:
1214
13- 1 . ` map_rect ` returns a list of vectors, while ` reduce_sum ` returns only a real .
14- 2 . ` map_rect ` can be parallelized across multiple cores and multiple
15- computers, while ` reduce_sum ` can only parallelized across multiple
15+ 1 . Returns a list of vectors, while the reduce summation returns only a scalar .
16+ 2 . Can be parallelized across multiple cores and multiple
17+ computers, while reduce summation can only parallelized across multiple
1618 cores on a single machine.
1719
1820The actual speedup gained from using these functions will depend on
@@ -26,18 +28,26 @@ law](https://en.wikipedia.org/wiki/Amdahl's_law).
2628
2729## Reduce-Sum { #reduce-sum }
2830
29- ``` reduce_sum ``` maps evaluation of a function ` g: U -> real ` to a list of type ` U[] ` , `{
30- x1, x2, ... }`, and performs as reduction operation a sum over the
31- results. For instance, for a sequence of ``` x ``` values of type ``` U ``` , ``` { x1, x2, ... } ``` , we might compute the sum:
31+ The higher-order reduce with summation facility maps evaluation of a
32+ function ` g: U -> real ` , which returns a scalar, to a list of type
33+ ` U[] ` , ` { x1, x2, ... } ` , and performs as reduction operation a sum
34+ over the results. For instance, for a sequence of ``` x ``` values of
35+ type ``` U ``` , ``` { x1, x2, ... } ``` , we might compute the sum:
3236
3337``` g(x1) + g(x2) + ... ```
3438
35- In probabilistic modeling this comes up when there are N conditionally independent terms in a likelihood. Because of the conditional independence, these terms can be computed in parallel. If dependencies exist between the terms, then this isn't possible. For instance, in evaluating the log density of a Gaussian process ``` reduce_sum ``` would not be very useful.
39+ In probabilistic modeling this comes up when there are $N$
40+ conditionally independent terms in a likelihood. Because of the
41+ conditional independence, these terms can be computed in parallel. If
42+ dependencies exist between the terms, then this isn't possible. For
43+ instance, in evaluating the log density of a Gaussian process then
44+ summation of independent terms isn't applicable.
3645
37- ``` reduce_sum ``` requires the partial sum function ```f: U[ ] ->
38- real``` , where ``` f``` computes the partial sum corresponding to the
39- slice of the sequence ``` x ``` passed in. ``` reduce_sum ```
40- exploits the associativity of the sum operation as it holds that:
46+ For efficiency reasons the reduce function doesn’t work with the
47+ element-wise evaluated function ` g ` , but instead requires the partial
48+ sum function ``` f: U[] -> real ``` , where ``` f ``` computes the partial
49+ sum corresponding to a slice of the sequence ``` x ``` passed in. Due to the
50+ the associativity of the sum reduction it holds that:
4151
4252```
4353g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
@@ -46,15 +56,31 @@ g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
4656 = f({ x1 }) + f({ x2 }) + f({ x3 })
4757```
4858
49- If the user can write a function ``` f: U[] -> real ``` to compute the
50- necessary partial sums of the calculation, then ``` reduce_sum ``` can
51- automatically parallelize the calculations. The exact partitioning
52- into partial sums is arbitrary as these are mathematical equivalent to
53- one another. As the partitioning is flexible, it is be adapted to the
54- available ressources (number of concurrent threads) given to Stan.
55-
56- For efficiency and convenience, ``` reduce_sum ``` allows for additional
57- shared arguments to be passed to every term in the sum. So for the
59+ With the partial sum function ``` f: U[] -> real ``` reduction of a
60+ large number of terms can be evaluated in parallel automatically, since the
61+ overall sum can be partitioned into arbitrary smaller partial
62+ sums. The exact partitioning into the partial sums is not under the
63+ control of the user. However, since the exact numerical result will
64+ depend on the order of summation, Stan provides two versions of the
65+ reduce summation facility:
66+
67+ * ` reduce_sum ` : Automatically forms partial sums resulting usually in good
68+ performance without further tuning.
69+ * ` reduce_sum_static ` : Creates for the same input always the same
70+ call graph resulting in stable numerical evaluation. This version
71+ requires setting a sensible tuning parameter for good performance.
72+
73+ The tuning parameter is the so-called ` grainsize ` . For the
74+ ` reduce_sum ` version the ` grainsize ` is merely a suggested partial sum
75+ size while for the ` reduce_sum_static ` version the ` grainsize `
76+ specifies the maximal partial sum size. While for ` reduce_sum ` a
77+ ` grainsize ` of one commonly leads to good performance already (since
78+ automatic aggregation is performed), the ` reduce_sum_static ` variant
79+ requires setting a sensible ` grainsize ` for good performance as
80+ explained in [ more detail below] ( #reduce-sum-grainsize ) .
81+
82+ For efficiency and convenience additional
83+ shared arguments can be passed to every term in the sum. So for the
5884array ``` { x1, x2, ... } ``` and the shared arguments ``` s1, s2, ... ```
5985the effective sum (with individual terms) looks like:
6086
@@ -70,10 +96,11 @@ f({ x1, x2 }, s1, s2, ...) + f({ x3 }, s1, s2, ...)
7096
7197where the particular slicing of the ``` x ``` array can change.
7298
73- Given this, the signature for ``` reduce_sum ``` is :
99+ Given this, the signatures are :
74100
75101```
76102real reduce_sum(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
103+ real reduce_sum_static(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
77104```
78105
791061 . ``` f ``` - User defined function that computes partial sums
@@ -91,9 +118,10 @@ and take the arguments:
91118
921191 . ``` start ``` - An integer specifying the first term in the partial sum
931202 . ``` end ``` - An integer specifying the last term in the partial sum (inclusive)
94- 3 . ``` x_slice ``` - The subset of ``` x ``` (from ``` reduce_sum ``` ) for
121+ 3 . ``` x_slice ``` - The subset of ``` x ``` (from ``` reduce_sum ``` / ` reduce_sum_static ` ) for
95122which this partial sum is responsible (``` x_slice = x[start:end] ``` )
96- 4 . ``` s1, s2, ... ``` - Arguments shared in every term (passed on without modification from the ``` reduce_sum ``` call)
123+ 4 . ``` s1, s2, ... ``` - Arguments shared in every term (passed on
124+ without modification from the ``` reduce_sum ``` / ` reduce_sum_static ` call)
97125
98126The user-provided function ``` f ``` is expected to compute the partial
99127sum with the terms ``` start ``` through ``` end ``` of the overall
@@ -103,7 +131,7 @@ can index any of the tailing ```sM``` arguments as necessary. The
103131trailing ``` sM ``` arguments are passed without modification to every
104132call of ``` f ``` .
105133
106- The ``` reduce_sum ``` call:
134+ A ``` reduce_sum ``` (or ` reduce_sum_static ` ) call:
107135
108136```
109137real sum = reduce_sum(f, x, grainsize, s1, s2, ...);
@@ -127,7 +155,7 @@ for(i in 1:size(x)) {
127155### Example: Logistic Regression
128156
129157Logistic regression is a useful example to clarify both the syntax
130- and semantics of ``` reduce_sum ``` and how it can be used to speed up a typical
158+ and semantics of reduce summation and how it can be used to speed up a typical
131159model.
132160
133161A basic logistic regression can be coded in Stan as:
@@ -150,7 +178,7 @@ model {
150178In this model predictions are made about the ` N ` outputs ` y ` using the
151179covariate ` x ` . The intercept and slope of the linear equation are to be estimated.
152180
153- The key point to getting this calculation into ` reduce_sum ` , is recognizing that
181+ The key point to getting this calculation to use reduce summation , is recognizing that
154182the statement:
155183
156184```
@@ -166,9 +194,9 @@ for(n in 1:N) {
166194
167195Now it is clear that the calculation is the sum of a number of conditionally
168196independent Bernoulli log probability statements, which is the condition where
169- ``` reduce_sum ``` is useful.
197+ reduce summation is useful.
170198
171- To use ``` reduce_sum ``` , a function must be written that can be used to compute
199+ To use the reduce summation , a function must be written that can be used to compute
172200arbitrary partial sums of the total sum.
173201
174202Using the interface defined in [ Reduce-Sum] ( #reduce-sum ) , such a function
@@ -197,7 +225,7 @@ worked as well. Use whatever conceptually makes the most sense.
197225
198226Because ` x ` is a shared argument, it is subset accordingly with ` start:end ` .
199227
200- With this function, ` reduce_sum ` can be used to automatically parallelize the
228+ With this function, reduce summation can be used to automatically parallelize the
201229likelihood:
202230
203231```
@@ -207,7 +235,7 @@ target += reduce_sum(partial_sum, y,
207235 x, beta);
208236```
209237
210- ` reduce_sum ` automatically breaks the sum into roughly ` grainsize ` sized pieces
238+ The reduce summation facility automatically breaks the sum into roughly ` grainsize ` sized pieces
211239and computes them in parallel. ` grainsize = 1 ` specifies that the grainsize should
212240be estimated automatically. The final model looks like:
213241
@@ -237,12 +265,19 @@ model {
237265}
238266```
239267
240- ### Picking the Grainsize
268+ ### Picking the Grainsize {#reduce-sum-grainsize}
241269
242270The ` grainsize ` is a recommendation on how large each piece of
243- parallel work is (how many terms it contains). It is recommended to
244- choose one as a starting point which will select an appropiate value
245- automatically.
271+ parallel work is (how many terms it contains). When using the
272+ non-static version, it is recommended to choose one as a starting
273+ point as automatic aggregation of partial sums are performed. However,
274+ for the static version the ` grainsize ` defines the maximal size of the
275+ partial sums, e.g. the static variant will split the input sequence
276+ until all partial sums are just smaller than ` grainsize ` . Therefore,
277+ for the static version it is more important to select a sensible
278+ value. The rational for choosing a sensible ` grainsize ` is based on
279+ balancing the overhead implied by creating many small tasks versus
280+ creating fewer large tasks which limits the potential parallelism.
246281
247282From empirical experience, the automatic grainsize determination works
248283well and no further tuning is required in most cases. In order to
@@ -258,12 +293,14 @@ parallelism without losing too much efficiency.
258293For instance, in a model with ` N=10000 ` and ` M = 4 ` , start with ` grainsize = 25000 ` , and
259294sequentially try ` grainsize = 12500 ` , ` grainsize = 6250 ` , etc.
260295
261- It is important to repeat this process until performance gets worse! It is possible
262- after many halvings nothing happens, but there might still be a smaller grainsize that performs better.
263- Even if a sum has many tens of thousands of terms, depending on the internal calculations, a ` grainsize `
264- of thirty or forty or smaller might be the best, and it is difficult to predict this behavior.
265- Without doing these halvings until performance actually gets worse, it
266- is easy to miss this.
296+ It is important to repeat this process until performance gets worse!
297+ It is possible after many halvings nothing happens, but there might
298+ still be a smaller grainsize that performs better. Even if a sum has
299+ many tens of thousands of terms, depending on the internal
300+ calculations, a ` grainsize ` of thirty or forty or smaller might be the
301+ best, and it is difficult to predict this behavior. Without doing
302+ these halvings until performance actually gets worse, it is easy to
303+ miss this.
267304
268305## Map-Rect
269306
0 commit comments