Skip to content

Commit df152e4

Browse files
committed
add static version to user guide
1 parent f4e1bce commit df152e4

File tree

1 file changed

+83
-46
lines changed

1 file changed

+83
-46
lines changed

src/stan-users-guide/parallelization.Rmd

Lines changed: 83 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,20 @@
11
# Parallelization {#parallelization.chapter}
22

3-
Stan has two mechanisms for parallelizing calculations used in a model: `reduce_sum` and `map_rect`.
3+
Stan has two mechanisms for parallelizing calculations used in a
4+
model: reduce with summation and rectangular map.
45

5-
The advantages of `reduce_sum` are:
6+
The advantages of reduce with summation are:
67

7-
1. `reduce_sum` has a more flexible argument interface, avoiding the packing and unpacking that is necessary with `map_rect`.
8-
2. `reduce_sum` partitions data for parallelization automatically (this is done manually in `map_rect`).
9-
3. `reduce_sum` is easier to use.
8+
1. More flexible argument interface, avoiding the packing and
9+
unpacking that is necessary with rectanguar map.
10+
2. Partitions data for parallelization automatically (this is done manually in rectanguar map).
11+
3. Is easier to use.
1012

11-
The advantages of `map_rect` are:
13+
The advantages of rectangular map are:
1214

13-
1. `map_rect` returns a list of vectors, while `reduce_sum` returns only a real.
14-
2. `map_rect` can be parallelized across multiple cores and multiple
15-
computers, while `reduce_sum` can only parallelized across multiple
15+
1. Returns a list of vectors, while the reduce summation returns only a scalar.
16+
2. Can be parallelized across multiple cores and multiple
17+
computers, while reduce summation can only parallelized across multiple
1618
cores on a single machine.
1719

1820
The actual speedup gained from using these functions will depend on
@@ -26,18 +28,26 @@ law](https://en.wikipedia.org/wiki/Amdahl's_law).
2628

2729
## Reduce-Sum { #reduce-sum }
2830

29-
```reduce_sum``` maps evaluation of a function `g: U -> real` to a list of type `U[]`, `{
30-
x1, x2, ... }`, and performs as reduction operation a sum over the
31-
results. For instance, for a sequence of ```x``` values of type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
31+
The higher-order reduce with summation facility maps evaluation of a
32+
function `g: U -> real`, which returns a scalar, to a list of type
33+
`U[]`, `{ x1, x2, ... }`, and performs as reduction operation a sum
34+
over the results. For instance, for a sequence of ```x``` values of
35+
type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
3236

3337
```g(x1) + g(x2) + ...```
3438

35-
In probabilistic modeling this comes up when there are N conditionally independent terms in a likelihood. Because of the conditional independence, these terms can be computed in parallel. If dependencies exist between the terms, then this isn't possible. For instance, in evaluating the log density of a Gaussian process ```reduce_sum``` would not be very useful.
39+
In probabilistic modeling this comes up when there are $N$
40+
conditionally independent terms in a likelihood. Because of the
41+
conditional independence, these terms can be computed in parallel. If
42+
dependencies exist between the terms, then this isn't possible. For
43+
instance, in evaluating the log density of a Gaussian process then
44+
summation of independent terms isn't applicable.
3645

37-
```reduce_sum``` requires the partial sum function ```f: U[] ->
38-
real```, where ```f``` computes the partial sum corresponding to the
39-
slice of the sequence ```x``` passed in. ```reduce_sum```
40-
exploits the associativity of the sum operation as it holds that:
46+
For efficiency reasons the reduce function doesn’t work with the
47+
element-wise evaluated function `g`, but instead requires the partial
48+
sum function ```f: U[] -> real```, where ```f``` computes the partial
49+
sum corresponding to a slice of the sequence ```x``` passed in. Due to the
50+
the associativity of the sum reduction it holds that:
4151

4252
```
4353
g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
@@ -46,15 +56,31 @@ g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
4656
= f({ x1 }) + f({ x2 }) + f({ x3 })
4757
```
4858

49-
If the user can write a function ```f: U[] -> real``` to compute the
50-
necessary partial sums of the calculation, then ```reduce_sum``` can
51-
automatically parallelize the calculations. The exact partitioning
52-
into partial sums is arbitrary as these are mathematical equivalent to
53-
one another. As the partitioning is flexible, it is be adapted to the
54-
available ressources (number of concurrent threads) given to Stan.
55-
56-
For efficiency and convenience, ```reduce_sum``` allows for additional
57-
shared arguments to be passed to every term in the sum. So for the
59+
With the partial sum function ```f: U[] -> real``` reduction of a
60+
large number of terms can be evaluated in parallel automatically, since the
61+
overall sum can be partitioned into arbitrary smaller partial
62+
sums. The exact partitioning into the partial sums is not under the
63+
control of the user. However, since the exact numerical result will
64+
depend on the order of summation, Stan provides two versions of the
65+
reduce summation facility:
66+
67+
* `reduce_sum`: Automatically forms partial sums resulting usually in good
68+
performance without further tuning.
69+
* `reduce_sum_static`: Creates for the same input always the same
70+
call graph resulting in stable numerical evaluation. This version
71+
requires setting a sensible tuning parameter for good performance.
72+
73+
The tuning parameter is the so-called `grainsize`. For the
74+
`reduce_sum` version the `grainsize` is merely a suggested partial sum
75+
size while for the `reduce_sum_static` version the `grainsize`
76+
specifies the maximal partial sum size. While for `reduce_sum` a
77+
`grainsize` of one commonly leads to good performance already (since
78+
automatic aggregation is performed), the `reduce_sum_static` variant
79+
requires setting a sensible `grainsize` for good performance as
80+
explained in [more detail below](#reduce-sum-grainsize).
81+
82+
For efficiency and convenience additional
83+
shared arguments can be passed to every term in the sum. So for the
5884
array ```{ x1, x2, ... }``` and the shared arguments ```s1, s2, ...```
5985
the effective sum (with individual terms) looks like:
6086

@@ -70,10 +96,11 @@ f({ x1, x2 }, s1, s2, ...) + f({ x3 }, s1, s2, ...)
7096

7197
where the particular slicing of the ```x``` array can change.
7298

73-
Given this, the signature for ```reduce_sum``` is:
99+
Given this, the signatures are:
74100

75101
```
76102
real reduce_sum(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
103+
real reduce_sum_static(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
77104
```
78105

79106
1. ```f``` - User defined function that computes partial sums
@@ -91,9 +118,10 @@ and take the arguments:
91118

92119
1. ```start``` - An integer specifying the first term in the partial sum
93120
2. ```end``` - An integer specifying the last term in the partial sum (inclusive)
94-
3. ```x_slice``` - The subset of ```x``` (from ```reduce_sum```) for
121+
3. ```x_slice``` - The subset of ```x``` (from ```reduce_sum``` / `reduce_sum_static`) for
95122
which this partial sum is responsible (```x_slice = x[start:end]```)
96-
4. ```s1, s2, ...``` - Arguments shared in every term (passed on without modification from the ```reduce_sum``` call)
123+
4. ```s1, s2, ...``` - Arguments shared in every term (passed on
124+
without modification from the ```reduce_sum``` / `reduce_sum_static` call)
97125

98126
The user-provided function ```f``` is expected to compute the partial
99127
sum with the terms ```start``` through ```end``` of the overall
@@ -103,7 +131,7 @@ can index any of the tailing ```sM``` arguments as necessary. The
103131
trailing ```sM``` arguments are passed without modification to every
104132
call of ```f```.
105133

106-
The ```reduce_sum``` call:
134+
A ```reduce_sum``` (or `reduce_sum_static`) call:
107135

108136
```
109137
real sum = reduce_sum(f, x, grainsize, s1, s2, ...);
@@ -127,7 +155,7 @@ for(i in 1:size(x)) {
127155
### Example: Logistic Regression
128156

129157
Logistic regression is a useful example to clarify both the syntax
130-
and semantics of ```reduce_sum``` and how it can be used to speed up a typical
158+
and semantics of reduce summation and how it can be used to speed up a typical
131159
model.
132160

133161
A basic logistic regression can be coded in Stan as:
@@ -150,7 +178,7 @@ model {
150178
In this model predictions are made about the `N` outputs `y` using the
151179
covariate `x`. The intercept and slope of the linear equation are to be estimated.
152180

153-
The key point to getting this calculation into `reduce_sum`, is recognizing that
181+
The key point to getting this calculation to use reduce summation, is recognizing that
154182
the statement:
155183

156184
```
@@ -166,9 +194,9 @@ for(n in 1:N) {
166194

167195
Now it is clear that the calculation is the sum of a number of conditionally
168196
independent Bernoulli log probability statements, which is the condition where
169-
```reduce_sum``` is useful.
197+
reduce summation is useful.
170198

171-
To use ```reduce_sum```, a function must be written that can be used to compute
199+
To use the reduce summation, a function must be written that can be used to compute
172200
arbitrary partial sums of the total sum.
173201

174202
Using the interface defined in [Reduce-Sum](#reduce-sum), such a function
@@ -197,7 +225,7 @@ worked as well. Use whatever conceptually makes the most sense.
197225

198226
Because `x` is a shared argument, it is subset accordingly with `start:end`.
199227

200-
With this function, `reduce_sum` can be used to automatically parallelize the
228+
With this function, reduce summation can be used to automatically parallelize the
201229
likelihood:
202230

203231
```
@@ -207,7 +235,7 @@ target += reduce_sum(partial_sum, y,
207235
x, beta);
208236
```
209237

210-
`reduce_sum` automatically breaks the sum into roughly `grainsize` sized pieces
238+
The reduce summation facility automatically breaks the sum into roughly `grainsize` sized pieces
211239
and computes them in parallel. `grainsize = 1` specifies that the grainsize should
212240
be estimated automatically. The final model looks like:
213241

@@ -237,12 +265,19 @@ model {
237265
}
238266
```
239267

240-
### Picking the Grainsize
268+
### Picking the Grainsize {#reduce-sum-grainsize}
241269

242270
The `grainsize` is a recommendation on how large each piece of
243-
parallel work is (how many terms it contains). It is recommended to
244-
choose one as a starting point which will select an appropiate value
245-
automatically.
271+
parallel work is (how many terms it contains). When using the
272+
non-static version, it is recommended to choose one as a starting
273+
point as automatic aggregation of partial sums are performed. However,
274+
for the static version the `grainsize` defines the maximal size of the
275+
partial sums, e.g. the static variant will split the input sequence
276+
until all partial sums are just smaller than `grainsize`. Therefore,
277+
for the static version it is more important to select a sensible
278+
value. The rational for choosing a sensible `grainsize` is based on
279+
balancing the overhead implied by creating many small tasks versus
280+
creating fewer large tasks which limits the potential parallelism.
246281

247282
From empirical experience, the automatic grainsize determination works
248283
well and no further tuning is required in most cases. In order to
@@ -258,12 +293,14 @@ parallelism without losing too much efficiency.
258293
For instance, in a model with `N=10000` and `M = 4`, start with `grainsize = 25000`, and
259294
sequentially try `grainsize = 12500`, `grainsize = 6250`, etc.
260295

261-
It is important to repeat this process until performance gets worse! It is possible
262-
after many halvings nothing happens, but there might still be a smaller grainsize that performs better.
263-
Even if a sum has many tens of thousands of terms, depending on the internal calculations, a `grainsize`
264-
of thirty or forty or smaller might be the best, and it is difficult to predict this behavior.
265-
Without doing these halvings until performance actually gets worse, it
266-
is easy to miss this.
296+
It is important to repeat this process until performance gets worse!
297+
It is possible after many halvings nothing happens, but there might
298+
still be a smaller grainsize that performs better. Even if a sum has
299+
many tens of thousands of terms, depending on the internal
300+
calculations, a `grainsize` of thirty or forty or smaller might be the
301+
best, and it is difficult to predict this behavior. Without doing
302+
these halvings until performance actually gets worse, it is easy to
303+
miss this.
267304

268305
## Map-Rect
269306

0 commit comments

Comments
 (0)