Skip to content

Commit 8fca929

Browse files
committed
refine stan user manual
1 parent 450ec3f commit 8fca929

File tree

1 file changed

+80
-55
lines changed

1 file changed

+80
-55
lines changed

src/stan-users-guide/parallelization.Rmd

Lines changed: 80 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,62 @@
11
# Parallelization {#parallelization.chapter}
22

3-
Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and and `map_rect`.
3+
Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and `map_rect`.
44

5-
The main advantages to `reduce_sum` are:
5+
The advantages of `reduce_sum` are:
66

77
1. `reduce_sum` has a more flexible argument interface, avoiding the packing and unpacking that is necessary with `map_rect`.
8-
2. `reduce_sum` partitions the data for parallelization automatically (this is done manually in `map_rect`).
8+
2. `reduce_sum` partitions data for parallelization automatically (this is done manually in `map_rect`).
99
3. `reduce_sum` is easier to use.
1010

1111
The advantages of `map_rect` are:
1212

1313
1. `map_rect` returns a list of vectors, while `reduce_sum` returns only a real.
14-
2. `map_rect` can be parallelized across multiple computers, while `reduce_sum` can only parallelized across multiple cores.
14+
2. `map_rect` can be parallelized across multiple cores and multiple
15+
computers, while `reduce_sum` can only parallelized across multiple
16+
cores on a single machine.
17+
18+
The actual speedup gained from using these functions will depend on
19+
many details. It is strongly recommended to only parallelize the
20+
computationally most expensive operations in a Stan
21+
program. Oftentimes this is the evaluation of the log likelihood for
22+
the observed data. Since only portions of a Stan program will run in
23+
parallel, the maximal speedup one can achieve is capped, a phenomen
24+
described by [Amdahl's
25+
law](https://en.wikipedia.org/wiki/Amdahl's_law).
1526

1627
## Reduce-Sum { #reduce-sum }
1728

18-
```reduce_sum``` parallelizes operations that can be represented as a sum of functions, `g: U -> real`.
19-
20-
For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
29+
```reduce_sum``` maps evaluation of a function `g: U -> real` to a list of type `U[]`, `{
30+
x1, x2, ... }`, and performs as reduction operation a sum over the
31+
results. For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
2132

2233
```g(x1) + g(x2) + ...```
2334

2435
In probabilistic modeling this comes up when there are N conditionally independent terms in a likelihood. Because of the conditional independence, these terms can be computed in parallel. If dependencies exist between the terms, then this isn't possible. For instance, in evaluating the log density of a Gaussian process ```reduce_sum``` would not be very useful.
2536

26-
```reduce_sum``` takes a function ```f: U[] -> real```, where ```f``` computes the partial sum corresponding to the slice of the sequence ```x``` passed in. For instance:
37+
```reduce_sum``` requires the partial sum function ```f: U[] ->
38+
real```, where ```f``` computes the partial sum corresponding to the
39+
slice of the sequence ```x``` passed in. ```reduce_sum```
40+
exploits the associativity of the sum operation as it holds that:
2741

2842
```
29-
f({ x1, x2, x3 }) = g(x1) + g(x2) + g(x3)
30-
f({ x1 }) = g(x1)
31-
f({ x1, x2, x3 }) = f({ x1, x2 }) + f({ x3 })
43+
g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
44+
= f({ x1, x2 }) + f({ x3 })
45+
= f({ x1 }) + f({ x2, x3 })
46+
= f({ x1 }) + f({ x2 }) + f({ x3 })
3247
```
3348

34-
If the user can write a function ```f: U[] -> real``` to compute the necessary partial sums in the calculation, then ```reduce_sum``` can automatically parallelize the calculations.
35-
36-
If the set of work is represented as an array ```{ x1, x2, x3, ... }```, then mathematically it is possible to rewrite this sum with any combination of partial sums.
37-
38-
For instance, the sum can be written:
39-
40-
1. ```f({ x1, x2, x3, ... })```, summing over all arguments, using one partial sum
41-
2. ```f({ x1, ..., xM }) + reduce({ x(M + 1), x(M + 2), ...})```, computing the first M terms separately from the rest
42-
3. ```f({ x1 }) + f({ x2 }) + f({ x3 }) + ...```, computing each term individually and summing them
43-
44-
The first form uses only one partial sum and no parallelization can be done, the second uses two partial sums and so can be parallelized over two workers, and the last can be parallelized over as many workers as there are elements in the array ```x```. Depending on how the list is sliced up, it is possible to parallelize this calculation over many workers.
49+
If the user can write a function ```f: U[] -> real``` to compute the
50+
necessary partial sums of the calculation, then ```reduce_sum``` can
51+
automatically parallelize the calculations. The exact partitioning
52+
into partial sums is arbitrary as these are mathematical equivalent to
53+
one another. As the partitioning is flexible, it is be adapted to the
54+
available ressources (number of concurrent threads) given to Stan.
4555

46-
```reduce_sum``` is the tool that will allow us to automatically parallelize this calculation.
47-
48-
For efficiency and convenience, ```reduce_sum``` allows for additional shared arguments to be passed to every term in the sum. So for the array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...``` the effective sum (with individual terms) looks like:
56+
For efficiency and convenience, ```reduce_sum``` allows for additional
57+
shared arguments to be passed to every term in the sum. So for the
58+
array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...```
59+
the effective sum (with individual terms) looks like:
4960

5061
```
5162
g(x1, s1, s2, ...) + g(x2, s1, s2, ...) + g(x3, s1, s2, ...) + ...
@@ -59,56 +70,64 @@ f({ x1, x2 }, s1, s2, ...) + f({ x3 }, s1, s2, ...)
5970

6071
where the particular slicing of the ```x``` array can change.
6172

62-
Given this, the signature for reduce_sum is:
73+
Given this, the signature for ```reduce_sum``` is:
6374

6475
```
65-
real reduce_sum(F func, T[] x, int grainsize, T1 s1, T2 s2, ...)
76+
real reduce_sum(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
6677
```
6778

68-
1. ```func``` - User defined function that computes partial sums
79+
1. ```f``` - User defined function that computes partial sums
6980
2. ```x``` - Array to slice, each element corresponds to a term in the summation
7081
3. ```grainsize``` - Target for size of slices
71-
4-. ```s1, s2, ...``` - Arguments shared in every term
82+
4. ```s1, s2, ...``` - Arguments shared in every term
7283

7384
The user-defined partial sum functions have the signature:
7485

7586
```
76-
real func(int start, int end, T[] x_slice, T1 arg1, T2 arg2, ...)
87+
real f(int start, int end, T[] x_slice, T1 s1, T2 s2, ...)
7788
```
7889

7990
and take the arguments:
91+
8092
1. ```start``` - An integer specifying the first term in the partial sum
8193
2. ```end``` - An integer specifying the last term in the partial sum (inclusive)
82-
3. ```x_slice``` - The subset of ```x``` (from ```reduce_sum```) for which this partial sum is responsible (```x[start:end]```)
83-
4-. ```arg1, arg2, ...``` Arguments shared in every term (passed on without modification from the reduce_sum call)
84-
85-
The user-provided function ```func``` is expect to compute the ```start``` through ```end``` terms of the overall sum, accumulate them, and return that value. The user function is passed the subset ```x[start:end]``` as ```x_slice```. ```start``` and ```end``` are passed so that ```func``` can index any of the tailing ```sM``` arguments as necessary. The trailing ```sM``` arguments are passed without modification to every call of ```func```.
94+
3. ```x_slice``` - The subset of ```x``` (from ```reduce_sum```) for
95+
which this partial sum is responsible (```x_slice = x[start:end]```)
96+
4. ```s1, s2, ...``` - Arguments shared in every term (passed on without modification from the ```reduce_sum``` call)
97+
98+
The user-provided function ```f``` is expected to compute the partial
99+
sum with the terms ```start``` through ```end``` of the overall
100+
sum. The user function is passed the subset ```x[start:end]``` as
101+
```x_slice```. ```start``` and ```end``` are passed so that ```f```
102+
can index any of the tailing ```sM``` arguments as necessary. The
103+
trailing ```sM``` arguments are passed without modification to every
104+
call of ```f```.
86105

87106
The ```reduce_sum``` call:
88107

89108
```
90-
real sum = reduce_sum(func, x, grainsize, s1, s2, ...)
109+
real sum = reduce_sum(f, x, grainsize, s1, s2, ...);
91110
```
92111

93112
can be replaced by either:
94113

95114
```
96-
real sum = func(1, size(x), x, s1, s2, ...)
115+
real sum = f(1, size(x), x, s1, s2, ...);
97116
```
98117

99118
or the code:
100119

101120
```
102121
real sum = 0.0;
103122
for(i in 1:size(x)) {
104-
sum = sum + func(i, i, { x[i] }, s1, s2, ...);
123+
sum += f(i, i, { x[i] }, s1, s2, ...);
105124
}
106125
```
107126

108127
### Example: Logistic Regression
109128

110-
Logistic Regression is a useful example to clarify both the syntax
111-
and semantics of `reduce_sum` and how it can be used to speed up a typical
129+
Logistic regression is a useful example to clarify both the syntax
130+
and semantics of ```reduce_sum``` and how it can be used to speed up a typical
112131
model.
113132

114133
A basic logistic regression can be coded in Stan as:
@@ -147,9 +166,9 @@ for(n in 1:N) {
147166

148167
Now it is clear that the calculation is the sum of a number of conditionally
149168
independent Bernoulli log probability statements, which is the condition where
150-
`reduce_sum` is useful.
169+
```reduce_sum``` is useful.
151170

152-
To use `reduce_sum`, a function must be written that can be used to compute
171+
To use ```reduce_sum```, a function must be written that can be used to compute
153172
arbitrary partial sums of the total sum.
154173

155174
Using the interface defined in [Reduce-Sum](#reduce-sum), such a function
@@ -169,14 +188,14 @@ functions {
169188
And the likelihood statement in the model can now be written:
170189

171190
```
172-
target += partial_sum(1, N, y, x, beta); // Sum terms 1 to N in the likelihood
191+
target += partial_sum(1, N, y, x, beta); // Sum terms 1 to N of the likelihood
173192
```
174193

175194
In this example, `y` was chosen to be sliced over because there
176195
is one term in the summation per value of `y`. Technically `x` would have
177196
worked as well. Use whatever conceptually makes the most sense.
178197

179-
Because `x` is a shared argument, it is subset manually with `start:end`.
198+
Because `x` is a shared argument, it is subset accordingly with `start:end`.
180199

181200
With this function, `reduce_sum` can be used to automatically parallelize the
182201
likelihood:
@@ -220,25 +239,31 @@ model {
220239

221240
### Picking the Grainsize
222241

223-
The `grainsize` is a recommendation on how large each piece of parallel work is
224-
(how many terms it contains). If one, it will be chosen automatically, but it
225-
is probably best to choose this manually for each model.
226-
227-
To figure out an appropriate grainsize, think about how many terms are in the summation
228-
and on how many cores the model should run. If there are `N` terms and `M` cores,
229-
run a quick test model with `grainsize` set roughly to `N / M`. Record the time, cut the
230-
grainsize in half, and run the test again. Repeat this iteratively until the model runtime
231-
begins to increase. This is a suitable grainsize for the model, because this ensures the
232-
caculations can be carried out with the most parallelism without losing too much efficiency.
242+
The `grainsize` is a recommendation on how large each piece of
243+
parallel work is (how many terms it contains). It is recommended to
244+
choose one as a starting point which will select an appropiate value
245+
automatically.
246+
247+
From empirical experience, the automatic grainsize determination works
248+
well and no further tuning is required in most cases. In order to
249+
figure out an optimal grainsize, think about how many terms are in the
250+
summation and on how many cores the model should run. If there are `N`
251+
terms and `M` cores, run a quick test model with `grainsize` set
252+
roughly to `N / M`. Record the time, cut the grainsize in half, and
253+
run the test again. Repeat this iteratively until the model runtime
254+
begins to increase. This is a suitable grainsize for the model,
255+
because this ensures the caculations can be carried out with the most
256+
parallelism without losing too much efficiency.
233257

234258
For instance, in a model with `N=10000` and `M = 4`, start with `grainsize = 25000`, and
235259
sequentially try `grainsize = 12500`, `grainsize = 6250`, etc.
236260

237261
It is important to repeat this process until performance gets worse! It is possible
238-
after many halvings nothing happens, but there might still be a small grainsize that performs better.
262+
after many halvings nothing happens, but there might still be a smaller grainsize that performs better.
239263
Even if a sum has many tens of thousands of terms, depending on the internal calculations, a `grainsize`
240264
of thirty or forty or smaller might be the best, and it is difficult to predict this behavior.
241-
Without doing these halvings until performance actually gets worse, it is easy to miss this.
265+
Without doing these halvings until performance actually gets worse, it
266+
is easy to miss this.
242267

243268
## Map-Rect
244269

0 commit comments

Comments
 (0)