You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/stan-users-guide/parallelization.Rmd
+80-55Lines changed: 80 additions & 55 deletions
Original file line number
Diff line number
Diff line change
@@ -1,51 +1,62 @@
1
1
# Parallelization {#parallelization.chapter}
2
2
3
-
Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and and `map_rect`.
3
+
Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and `map_rect`.
4
4
5
-
The main advantages to`reduce_sum` are:
5
+
The advantages of`reduce_sum` are:
6
6
7
7
1.`reduce_sum` has a more flexible argument interface, avoiding the packing and unpacking that is necessary with `map_rect`.
8
-
2.`reduce_sum` partitions the data for parallelization automatically (this is done manually in `map_rect`).
8
+
2.`reduce_sum` partitions data for parallelization automatically (this is done manually in `map_rect`).
9
9
3.`reduce_sum` is easier to use.
10
10
11
11
The advantages of `map_rect` are:
12
12
13
13
1.`map_rect` returns a list of vectors, while `reduce_sum` returns only a real.
14
-
2.`map_rect` can be parallelized across multiple computers, while `reduce_sum` can only parallelized across multiple cores.
14
+
2.`map_rect` can be parallelized across multiple cores and multiple
15
+
computers, while `reduce_sum` can only parallelized across multiple
16
+
cores on a single machine.
17
+
18
+
The actual speedup gained from using these functions will depend on
19
+
many details. It is strongly recommended to only parallelize the
20
+
computationally most expensive operations in a Stan
21
+
program. Oftentimes this is the evaluation of the log likelihood for
22
+
the observed data. Since only portions of a Stan program will run in
23
+
parallel, the maximal speedup one can achieve is capped, a phenomen
24
+
described by [Amdahl's
25
+
law](https://en.wikipedia.org/wiki/Amdahl's_law).
15
26
16
27
## Reduce-Sum { #reduce-sum }
17
28
18
-
```reduce_sum```parallelizes operations that can be represented as a sum of functions, `g: U -> real`.
19
-
20
-
For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
29
+
```reduce_sum```maps evaluation of a function `g: U -> real` to a list of type `U[]`, `{
30
+
x1, x2, ... }`, and performs as reduction operation a sum over the
31
+
results. For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
21
32
22
33
```g(x1) + g(x2) + ...```
23
34
24
35
In probabilistic modeling this comes up when there are N conditionally independent terms in a likelihood. Because of the conditional independence, these terms can be computed in parallel. If dependencies exist between the terms, then this isn't possible. For instance, in evaluating the log density of a Gaussian process ```reduce_sum``` would not be very useful.
25
36
26
-
```reduce_sum``` takes a function ```f: U[] -> real```, where ```f``` computes the partial sum corresponding to the slice of the sequence ```x``` passed in. For instance:
37
+
```reduce_sum``` requires the partial sum function ```f: U[] ->
38
+
real```, where ```f``` computes the partial sum corresponding to the
39
+
slice of the sequence ```x``` passed in. ```reduce_sum```
40
+
exploits the associativity of the sum operation as it holds that:
27
41
28
42
```
29
-
f({ x1, x2, x3 }) = g(x1) + g(x2) + g(x3)
30
-
f({ x1 }) = g(x1)
31
-
f({ x1, x2, x3 }) = f({ x1, x2 }) + f({ x3 })
43
+
g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
44
+
= f({ x1, x2 }) + f({ x3 })
45
+
= f({ x1 }) + f({ x2, x3 })
46
+
= f({ x1 }) + f({ x2 }) + f({ x3 })
32
47
```
33
48
34
-
If the user can write a function ```f: U[] -> real``` to compute the necessary partial sums in the calculation, then ```reduce_sum``` can automatically parallelize the calculations.
35
-
36
-
If the set of work is represented as an array ```{ x1, x2, x3, ... }```, then mathematically it is possible to rewrite this sum with any combination of partial sums.
37
-
38
-
For instance, the sum can be written:
39
-
40
-
1.```f({ x1, x2, x3, ... })```, summing over all arguments, using one partial sum
41
-
2.```f({ x1, ..., xM }) + reduce({ x(M + 1), x(M + 2), ...})```, computing the first M terms separately from the rest
42
-
3.```f({ x1 }) + f({ x2 }) + f({ x3 }) + ...```, computing each term individually and summing them
43
-
44
-
The first form uses only one partial sum and no parallelization can be done, the second uses two partial sums and so can be parallelized over two workers, and the last can be parallelized over as many workers as there are elements in the array ```x```. Depending on how the list is sliced up, it is possible to parallelize this calculation over many workers.
49
+
If the user can write a function ```f: U[] -> real``` to compute the
50
+
necessary partial sums of the calculation, then ```reduce_sum``` can
51
+
automatically parallelize the calculations. The exact partitioning
52
+
into partial sums is arbitrary as these are mathematical equivalent to
53
+
one another. As the partitioning is flexible, it is be adapted to the
54
+
available ressources (number of concurrent threads) given to Stan.
45
55
46
-
```reduce_sum``` is the tool that will allow us to automatically parallelize this calculation.
47
-
48
-
For efficiency and convenience, ```reduce_sum``` allows for additional shared arguments to be passed to every term in the sum. So for the array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...``` the effective sum (with individual terms) looks like:
56
+
For efficiency and convenience, ```reduce_sum``` allows for additional
57
+
shared arguments to be passed to every term in the sum. So for the
58
+
array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...```
59
+
the effective sum (with individual terms) looks like:
where the particular slicing of the ```x``` array can change.
61
72
62
-
Given this, the signature for reduce_sum is:
73
+
Given this, the signature for ```reduce_sum``` is:
63
74
64
75
```
65
-
real reduce_sum(F func, T[] x, int grainsize, T1 s1, T2 s2, ...)
76
+
real reduce_sum(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
66
77
```
67
78
68
-
1.```func``` - User defined function that computes partial sums
79
+
1.```f``` - User defined function that computes partial sums
69
80
2.```x``` - Array to slice, each element corresponds to a term in the summation
70
81
3.```grainsize``` - Target for size of slices
71
-
4-. ```s1, s2, ...``` - Arguments shared in every term
82
+
4.```s1, s2, ...``` - Arguments shared in every term
72
83
73
84
The user-defined partial sum functions have the signature:
74
85
75
86
```
76
-
real func(int start, int end, T[] x_slice, T1 arg1, T2 arg2, ...)
87
+
real f(int start, int end, T[] x_slice, T1 s1, T2 s2, ...)
77
88
```
78
89
79
90
and take the arguments:
91
+
80
92
1.```start``` - An integer specifying the first term in the partial sum
81
93
2.```end``` - An integer specifying the last term in the partial sum (inclusive)
82
-
3.```x_slice``` - The subset of ```x``` (from ```reduce_sum```) for which this partial sum is responsible (```x[start:end]```)
83
-
4-. ```arg1, arg2, ...``` Arguments shared in every term (passed on without modification from the reduce_sum call)
84
-
85
-
The user-provided function ```func``` is expect to compute the ```start``` through ```end``` terms of the overall sum, accumulate them, and return that value. The user function is passed the subset ```x[start:end]``` as ```x_slice```. ```start``` and ```end``` are passed so that ```func``` can index any of the tailing ```sM``` arguments as necessary. The trailing ```sM``` arguments are passed without modification to every call of ```func```.
94
+
3.```x_slice``` - The subset of ```x``` (from ```reduce_sum```) for
95
+
which this partial sum is responsible (```x_slice = x[start:end]```)
96
+
4.```s1, s2, ...``` - Arguments shared in every term (passed on without modification from the ```reduce_sum``` call)
97
+
98
+
The user-provided function ```f``` is expected to compute the partial
99
+
sum with the terms ```start``` through ```end``` of the overall
100
+
sum. The user function is passed the subset ```x[start:end]``` as
101
+
```x_slice```. ```start``` and ```end``` are passed so that ```f```
102
+
can index any of the tailing ```sM``` arguments as necessary. The
103
+
trailing ```sM``` arguments are passed without modification to every
104
+
call of ```f```.
86
105
87
106
The ```reduce_sum``` call:
88
107
89
108
```
90
-
real sum = reduce_sum(func, x, grainsize, s1, s2, ...)
109
+
real sum = reduce_sum(f, x, grainsize, s1, s2, ...);
91
110
```
92
111
93
112
can be replaced by either:
94
113
95
114
```
96
-
real sum = func(1, size(x), x, s1, s2, ...)
115
+
real sum = f(1, size(x), x, s1, s2, ...);
97
116
```
98
117
99
118
or the code:
100
119
101
120
```
102
121
real sum = 0.0;
103
122
for(i in 1:size(x)) {
104
-
sum = sum + func(i, i, { x[i] }, s1, s2, ...);
123
+
sum += f(i, i, { x[i] }, s1, s2, ...);
105
124
}
106
125
```
107
126
108
127
### Example: Logistic Regression
109
128
110
-
Logistic Regression is a useful example to clarify both the syntax
111
-
and semantics of `reduce_sum` and how it can be used to speed up a typical
129
+
Logistic regression is a useful example to clarify both the syntax
130
+
and semantics of ```reduce_sum``` and how it can be used to speed up a typical
112
131
model.
113
132
114
133
A basic logistic regression can be coded in Stan as:
@@ -147,9 +166,9 @@ for(n in 1:N) {
147
166
148
167
Now it is clear that the calculation is the sum of a number of conditionally
149
168
independent Bernoulli log probability statements, which is the condition where
150
-
`reduce_sum` is useful.
169
+
```reduce_sum``` is useful.
151
170
152
-
To use `reduce_sum`, a function must be written that can be used to compute
171
+
To use ```reduce_sum```, a function must be written that can be used to compute
153
172
arbitrary partial sums of the total sum.
154
173
155
174
Using the interface defined in [Reduce-Sum](#reduce-sum), such a function
@@ -169,14 +188,14 @@ functions {
169
188
And the likelihood statement in the model can now be written:
170
189
171
190
```
172
-
target += partial_sum(1, N, y, x, beta); // Sum terms 1 to N in the likelihood
191
+
target += partial_sum(1, N, y, x, beta); // Sum terms 1 to N of the likelihood
173
192
```
174
193
175
194
In this example, `y` was chosen to be sliced over because there
176
195
is one term in the summation per value of `y`. Technically `x` would have
177
196
worked as well. Use whatever conceptually makes the most sense.
178
197
179
-
Because `x` is a shared argument, it is subset manually with `start:end`.
198
+
Because `x` is a shared argument, it is subset accordingly with `start:end`.
180
199
181
200
With this function, `reduce_sum` can be used to automatically parallelize the
182
201
likelihood:
@@ -220,25 +239,31 @@ model {
220
239
221
240
### Picking the Grainsize
222
241
223
-
The `grainsize` is a recommendation on how large each piece of parallel work is
224
-
(how many terms it contains). If one, it will be chosen automatically, but it
225
-
is probably best to choose this manually for each model.
226
-
227
-
To figure out an appropriate grainsize, think about how many terms are in the summation
228
-
and on how many cores the model should run. If there are `N` terms and `M` cores,
229
-
run a quick test model with `grainsize` set roughly to `N / M`. Record the time, cut the
230
-
grainsize in half, and run the test again. Repeat this iteratively until the model runtime
231
-
begins to increase. This is a suitable grainsize for the model, because this ensures the
232
-
caculations can be carried out with the most parallelism without losing too much efficiency.
242
+
The `grainsize` is a recommendation on how large each piece of
243
+
parallel work is (how many terms it contains). It is recommended to
244
+
choose one as a starting point which will select an appropiate value
245
+
automatically.
246
+
247
+
From empirical experience, the automatic grainsize determination works
248
+
well and no further tuning is required in most cases. In order to
249
+
figure out an optimal grainsize, think about how many terms are in the
250
+
summation and on how many cores the model should run. If there are `N`
251
+
terms and `M` cores, run a quick test model with `grainsize` set
252
+
roughly to `N / M`. Record the time, cut the grainsize in half, and
253
+
run the test again. Repeat this iteratively until the model runtime
254
+
begins to increase. This is a suitable grainsize for the model,
255
+
because this ensures the caculations can be carried out with the most
256
+
parallelism without losing too much efficiency.
233
257
234
258
For instance, in a model with `N=10000` and `M = 4`, start with `grainsize = 25000`, and
235
259
sequentially try `grainsize = 12500`, `grainsize = 6250`, etc.
236
260
237
261
It is important to repeat this process until performance gets worse! It is possible
238
-
after many halvings nothing happens, but there might still be a small grainsize that performs better.
262
+
after many halvings nothing happens, but there might still be a smaller grainsize that performs better.
239
263
Even if a sum has many tens of thousands of terms, depending on the internal calculations, a `grainsize`
240
264
of thirty or forty or smaller might be the best, and it is difficult to predict this behavior.
241
-
Without doing these halvings until performance actually gets worse, it is easy to miss this.
265
+
Without doing these halvings until performance actually gets worse, it
0 commit comments