Skip to content

Commit 6a26ec9

Browse files
authored
Merge pull request #163 from stan-dev/feature/reduce-sum-edits
Reworded user's guide introduction to reduce_sum and lowered grainsize expections
2 parents 31d12d7 + bd93bf1 commit 6a26ec9

File tree

2 files changed

+62
-66
lines changed

2 files changed

+62
-66
lines changed

src/functions-reference/higher-order_functions.Rmd

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ package MINPACK-1 [@minpack:1980].
135135

136136
The Jacobian of the solution with respect to auxiliary parameters is
137137
computed using the implicit function theorem. Intermediate Jacobians
138-
(of the the algebraic function's output with respect to the unknowns y
138+
(of the algebraic function's output with respect to the unknowns y
139139
and with respect to the auxiliary parameters theta) are computed using
140140
Stan's automatic differentiation.
141141

@@ -410,19 +410,19 @@ exactly. This implies that the order of summation determines the exact
410410
numerical result. For this reason, the higher-order reduce function is
411411
available in two variants:
412412

413-
* `reduce_sum`: Compute partial sums automatically. This usually
414-
results in good performance without further tuning.
415-
* `reduce_sum_static`: For the same input, always create the same call
416-
graph. This results in stable numerical evaluation. This version
417-
requires setting a tuning parameter which controls the maximal size of
418-
partial sums formed.
413+
* `reduce_sum`: Automatically choose partial sums partitioning based on a dynamic
414+
scheduling algorithm.
415+
* `reduce_sum_static`: Compute the same sum as `reduce_sum`, but partition
416+
the input in the same way for given data set (in `reduce_sum` this partitioning
417+
might change depending on computer load). This should result in stable
418+
numerical evaluations.
419419

420420
### Specifying the Reduce-sum Function
421421

422422
The higher-order reduce function takes a partial sum function `f`, an array argument `x`
423423
(with one array element for each term in the sum), a recommended
424424
`grainsize`, and a set of shared arguments. This representation allows
425-
to parallelize the resultant sum.
425+
parallelization of the resultant sum.
426426

427427
<!-- real; reduce_sum; (F f, T[] x, int grainsize, T1 s1, T2 s2, ...); -->
428428
\index{{\tt \bfseries reduce\_sum }!{\tt (F f, T[] x, int grainsize, T1 s1, T2 s2, ...): real}|hyperpage}
@@ -437,16 +437,14 @@ partial sums. `s1, s2, ...` are shared between all terms in the sum.
437437
* *`f`*: function literal referring to a function specifying the
438438
partial sum operation. Refer to the [partial sum function](#functions-partial-sum).
439439
* *`x`*: array of `T`, one for each term of the reduction, `T` can be any type,
440-
* *`grainsize`*: recommended number of terms in each reduce call, set
441-
to 1 to estimate automatically for `reduce_sum` while for
442-
`reduce_sum_static` this determines the maximal size of the partial sums, type `int`,
440+
* *`grainsize`*: For `reduce_sum`, `grainsize` is the recommended size of the partial sum (`grainsize = 1` means pick totally automatically). For `reduce_sum_static`, `grainsize` determines the maximum size of the partial sums, type `int`,
443441
* *`s1`*: first (optional) shared argument, type `T1`, where `T1` can be any type
444442
* *`s2`*: second (optional) shared argument, type `T2`, where `T2` can be any type,
445443
* *`...`*: remainder of shared arguments, each of which can be any type.
446444

447-
### The Partial-sum Function {#functions-partial-sum}
445+
### The Partial sum Function {#functions-partial-sum}
448446

449-
The partial sum function must have the following signature where the types `T`, and the
447+
The partial sum function must have the following signature where the type `T`, and the
450448
types of all the shared arguments (`T1`, `T2`, ...) match those of the original
451449
`reduce_sum` (`reduce_sum_static`) call.
452450

src/stan-users-guide/parallelization.Rmd

Lines changed: 51 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -28,25 +28,22 @@ law](https://en.wikipedia.org/wiki/Amdahl's_law).
2828

2929
## Reduce-Sum { #reduce-sum }
3030

31-
The higher-order reduce with summation facility maps evaluation of a
32-
function `g: U -> real`, which returns a scalar, to a list of type
33-
`U[]`, `{ x1, x2, ... }`, and performs as reduction operation a sum
34-
over the results. For instance, for a sequence of ```x``` values of
35-
type ```U```, ```{ x1, x2, ... }```, we might compute the sum as
36-
`g(x1) + g(x2) + ...`.
37-
38-
In probabilistic modeling this comes up when there are $N$
39-
conditionally independent terms in a likelihood. Because of the
40-
conditional independence, these terms can be computed in parallel. If
41-
dependencies exist between the terms, then this isn't possible. For
42-
instance, in evaluating the log density of a Gaussian process then
43-
summation of independent terms isn't applicable.
31+
It is often necessary in probabilistic modeling to compute the sum of
32+
a number of independent function evaluations. This occurs, for instance, when
33+
evaluating a number of conditionally independent terms in a log-likelihood.
34+
If `g: U -> real` is the function and `{ x1, x2, ... }` is an array of
35+
inputs, then that sum looks like:
36+
37+
`g(x1) + g(x2) + ...`
38+
39+
`reduce_sum` and `reduce_sum_static` are tools for parallelizing these
40+
calculations.
4441

4542
For efficiency reasons the reduce function doesn’t work with the
46-
element-wise evaluated function `g`, but instead requires the partial
47-
sum function ```f: U[] -> real```, where ```f``` computes the partial
48-
sum corresponding to a slice of the sequence ```x``` passed in. Due to the
49-
the associativity of the sum reduction it holds that:
43+
element-wise evaluated function `g`, but instead the partial
44+
sum function `f: U[] -> real`, where `f` computes the partial
45+
sum corresponding to a slice of the sequence `x` passed in. Due to the
46+
associativity of the sum reduction it holds that:
5047

5148
```
5249
g(x1) + g(x2) + g(x3) = f({ x1, x2, x3 })
@@ -63,20 +60,21 @@ control of the user. However, since the exact numerical result will
6360
depend on the order of summation, Stan provides two versions of the
6461
reduce summation facility:
6562

66-
* `reduce_sum`: Automatically forms partial sums resulting usually in good
67-
performance without further tuning.
68-
* `reduce_sum_static`: Creates for the same input always the same
69-
call graph resulting in stable numerical evaluation. This version
70-
requires setting a sensible tuning parameter for good performance.
71-
72-
The tuning parameter is the so-called `grainsize`. For the
73-
`reduce_sum` version, the `grainsize` is merely a suggested partial sum
74-
size, while for the `reduce_sum_static` version the `grainsize`
75-
specifies the maximal partial sum size. While for `reduce_sum` a
76-
`grainsize` of 1 commonly leads to good performance already (since
77-
automatic aggregation is performed), the `reduce_sum_static` variant
78-
requires setting a sensible `grainsize` for good performance as
79-
explained in [more detail below](#reduce-sum-grainsize).
63+
* `reduce_sum`: Automatically choose partial sums partitioning based on a dynamic
64+
scheduling algorithm.
65+
* `reduce_sum_static`: Compute the same sum as `reduce_sum`, but partition
66+
the input in the same way for given data set (in `reduce_sum` this partitioning
67+
might change depending on computer load).
68+
69+
`grainsize` is the one tuning parameter. For `reduce_sum`, `grainsize` is
70+
a suggested partial sum size. A `grainsize` of 1 leaves the partitioning
71+
entirely up to the scheduler. This should be the default way of using
72+
`reduce_sum` unless time is spent carefully picking `grainsize`. For picking a `grainsize`, see details [below](#reduce-sum-grainsize).
73+
74+
For `reduce_sum_static`, `grainsize` specifies the maximal partial sum size.
75+
With `reduce_sum_static` it is more important to choose `grainsize`
76+
carefully since it entirely determines the partitioning of work.
77+
See details [below](#reduce-sum-grainsize).
8078

8179
For efficiency and convenience additional
8280
shared arguments can be passed to every term in the sum. So for the
@@ -222,15 +220,15 @@ accordingly with `start:end`. With this function, reduce summation can
222220
be used to automatically parallelize the likelihood:
223221

224222
```
225-
int grainsize = 100;
223+
int grainsize = 1;
226224
target += reduce_sum(partial_sum, y,
227225
grainsize,
228226
x, beta);
229227
```
230228

231-
The reduce summation facility automatically breaks the sum into roughly `grainsize` sized pieces
232-
and computes them in parallel. `grainsize = 1` specifies that the grainsize should
233-
be estimated automatically. The final model looks as:
229+
The reduce summation facility automatically breaks the sum into pieces
230+
and computes them in parallel. `grainsize = 1` specifies that the
231+
`grainsize` should be estimated automatically. The final model is:
234232

235233
```
236234
functions {
@@ -250,7 +248,7 @@ parameters {
250248
vector[2] beta;
251249
}
252250
model {
253-
int grainsize = 100;
251+
int grainsize = 1;
254252
beta ~ std_normal();
255253
target += reduce_sum(partial_sum, y,
256254
grainsize,
@@ -260,35 +258,35 @@ model {
260258

261259
### Picking the Grainsize {#reduce-sum-grainsize}
262260

263-
The `grainsize` is a recommendation on how large each piece of
264-
parallel work is (how many terms it contains). When using the
265-
non-static version, it is recommended to choose 1 as a starting
266-
point as automatic aggregation of partial sums are performed. However,
267-
for the static version the `grainsize` defines the maximal size of the
268-
partial sums, e.g. the static variant will split the input sequence
269-
until all partial sums are just smaller than `grainsize`. Therefore,
270-
for the static version it is more important to select a sensible
271-
value. The rational for choosing a sensible `grainsize` is based on
261+
The rational for choosing a sensible `grainsize` is based on
272262
balancing the overhead implied by creating many small tasks versus
273263
creating fewer large tasks which limits the potential parallelism.
274264

275-
From empirical experience, the automatic grainsize determination works
276-
well and no further tuning is required in most cases. In order to
277-
figure out an optimal grainsize, think about how many terms are in the
278-
summation and on how many cores the model should run. If there are `N`
265+
In `reduce_sum`, `grainsize` is a recommendation on how to partition
266+
the work in the partial sum into smaller pieces. A `grainsize` of 1
267+
leaves this entirely up to the internal scheduler and should be chosen
268+
if no benchmarking of other grainsizes is done. Ideally this will be
269+
efficient, but there are no guarantees.
270+
271+
In `reduce_sum_static`, `grainsize` is an upper limit on the worksize.
272+
Work will be split until all partial sums are just smaller than `grainsize`
273+
(and the split will happen the same way every time for the same inputs).
274+
For the static version it is more important to select a sensible `grainsize`.
275+
276+
In order to figure out an optimal `grainsize`, if there are `N`
279277
terms and `M` cores, run a quick test model with `grainsize` set
280-
roughly to `N / M`. Record the time, cut the grainsize in half, and
278+
roughly to `N / M`. Record the time, cut the `grainsize` in half, and
281279
run the test again. Repeat this iteratively until the model runtime
282-
begins to increase. This is a suitable grainsize for the model,
280+
begins to increase. This is a suitable `grainsize` for the model,
283281
because this ensures the caculations can be carried out with the most
284282
parallelism without losing too much efficiency.
285283

286284
For instance, in a model with `N=10000` and `M = 4`, start with `grainsize = 25000`, and
287285
sequentially try `grainsize = 12500`, `grainsize = 6250`, etc.
288286

289-
It is important to repeat this process until performance gets worse!
287+
It is important to repeat this process until performance gets worse.
290288
It is possible after many halvings nothing happens, but there might
291-
still be a smaller grainsize that performs better. Even if a sum has
289+
still be a smaller `grainsize` that performs better. Even if a sum has
292290
many tens of thousands of terms, depending on the internal
293291
calculations, a `grainsize` of thirty or forty or smaller might be the
294292
best, and it is difficult to predict this behavior. Without doing

0 commit comments

Comments
 (0)