Skip to content

Commit 021d364

Browse files
committed
Reworeded user's guide introduction to reduce_sum and lowered expectations about what grainsize = 1 can do. (design-doc #17)
1 parent 8831163 commit 021d364

File tree

2 files changed

+53
-53
lines changed

2 files changed

+53
-53
lines changed

src/functions-reference/higher-order_functions.Rmd

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -410,12 +410,12 @@ exactly. This implies that the order of summation determines the exact
410410
numerical result. For this reason, the higher-order reduce function is
411411
available in two variants:
412412

413-
* `reduce_sum`: Compute partial sums automatically. This usually
414-
results in good performance without further tuning.
415-
* `reduce_sum_static`: For the same input, always create the same call
416-
graph. This results in stable numerical evaluation. This version
417-
requires setting a tuning parameter which controls the maximal size of
418-
partial sums formed.
413+
* `reduce_sum`: Automatically choose partial sums partitioning based on a dynamic
414+
scheduling algorithm.
415+
* `reduce_sum_static`: Compute the same sum as `reduce_sum`, but partition
416+
the input in the same way for given data set (in `reduce_sum` this partitioning
417+
might change depending on computer load). This should result in stable
418+
numerical evaluations.
419419

420420
### Specifying the Reduce-sum Function
421421

@@ -437,16 +437,14 @@ partial sums. `s1, s2, ...` are shared between all terms in the sum.
437437
* *`f`*: function literal referring to a function specifying the
438438
partial sum operation. Refer to the [partial sum function](#functions-partial-sum).
439439
* *`x`*: array of `T`, one for each term of the reduction, `T` can be any type,
440-
* *`grainsize`*: recommended number of terms in each reduce call, set
441-
to 1 to estimate automatically for `reduce_sum` while for
442-
`reduce_sum_static` this determines the maximal size of the partial sums, type `int`,
440+
* *`grainsize`*: For `reduce_sum`, `grainsize` is the recommended size of the partial sum. For `reduce_sum_static`, `grainsize` determinse the maximum size of the partial sums, type `int`,
443441
* *`s1`*: first (optional) shared argument, type `T1`, where `T1` can be any type
444442
* *`s2`*: second (optional) shared argument, type `T2`, where `T2` can be any type,
445443
* *`...`*: remainder of shared arguments, each of which can be any type.
446444

447-
### The Partial-sum Function {#functions-partial-sum}
445+
### The Partial sum Function {#functions-partial-sum}
448446

449-
The partial sum function must have the following signature where the types `T`, and the
447+
The partial sum function must have the following signature where the type `T`, and the
450448
types of all the shared arguments (`T1`, `T2`, ...) match those of the original
451449
`reduce_sum` (`reduce_sum_static`) call.
452450

src/stan-users-guide/parallelization.Rmd

Lines changed: 44 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -28,25 +28,21 @@ law](https://en.wikipedia.org/wiki/Amdahl's_law).
2828

2929
## Reduce-Sum { #reduce-sum }
3030

31-
The higher-order reduce with summation facility maps evaluation of a
32-
function `g: U -> real`, which returns a scalar, to a list of type
33-
`U[]`, `{ x1, x2, ... }`, and performs as reduction operation a sum
34-
over the results. For instance, for a sequence of ```x``` values of
35-
type ```U```, ```{ x1, x2, ... }```, we might compute the sum:
31+
It is often necessary in probabilistic modeling to compute the sum of
32+
a number of independent function evaluations. This occurs, for instance, when
33+
evaluating a number of conditionally independent terms in a log-likelihood.
34+
If `g: U -> real` is the function and `{ x1, x2, ... }` is an array of
35+
inputs, then that sum looks like:
3636

3737
`g(x1) + g(x2) + ...`
3838

39-
In probabilistic modeling this comes up when there are $N$
40-
conditionally independent terms in a likelihood. Because of the
41-
conditional independence, these terms can be computed in parallel. If
42-
dependencies exist between the terms, then this isn't possible. For
43-
instance, in evaluating the log density of a Gaussian process then
44-
summation of independent terms isn't applicable.
39+
The `reduce_sum` function is a tool for automatically parallelizing these
40+
calculations.
4541

4642
For efficiency reasons the reduce function doesn’t work with the
47-
element-wise evaluated function `g`, but instead requires the partial
48-
sum function ```f: U[] -> real```, where ```f``` computes the partial
49-
sum corresponding to a slice of the sequence ```x``` passed in. Due to the
43+
element-wise evaluated function `g`, but instead the partial
44+
sum function `f: U[] -> real`, where `f` computes the partial
45+
sum corresponding to a slice of the sequence `x` passed in. Due to the
5046
the associativity of the sum reduction it holds that:
5147

5248
```
@@ -64,20 +60,20 @@ control of the user. However, since the exact numerical result will
6460
depend on the order of summation, Stan provides two versions of the
6561
reduce summation facility:
6662

67-
* `reduce_sum`: Automatically forms partial sums resulting usually in good
68-
performance without further tuning.
69-
* `reduce_sum_static`: Creates for the same input always the same
70-
call graph resulting in stable numerical evaluation. This version
71-
requires setting a sensible tuning parameter for good performance.
72-
73-
The tuning parameter is the so-called `grainsize`. For the
74-
`reduce_sum` version, the `grainsize` is merely a suggested partial sum
75-
size, while for the `reduce_sum_static` version the `grainsize`
76-
specifies the maximal partial sum size. While for `reduce_sum` a
77-
`grainsize` of 1 commonly leads to good performance already (since
78-
automatic aggregation is performed), the `reduce_sum_static` variant
79-
requires setting a sensible `grainsize` for good performance as
80-
explained in [more detail below](#reduce-sum-grainsize).
63+
* `reduce_sum`: Automatically choose partial sums partitioning based on a dynamic
64+
scheduling algorithm.
65+
* `reduce_sum_static`: Compute the same sum as `reduce_sum`, but partition
66+
the input in the same way for given data set (in `reduce_sum` this partitioning
67+
might change depending on computer load).
68+
69+
`grainsize` is the one tuning parameter. For `reduce_sum`, `grainsize` is
70+
a suggested partial sum size. A `grainsize` of 1 leaves the partitioning
71+
entirely up to the scheduler.
72+
73+
For `reduce_sum_static`, `grainsize` specifies the maximal partial sum size.
74+
With `reduce_sum_static` it is more important to choose `grainsize`
75+
carefully since it entirely determines the partitioning of work.
76+
See details in [more detail below](#reduce-sum-grainsize).
8177

8278
For efficiency and convenience additional
8379
shared arguments can be passed to every term in the sum. So for the
@@ -230,7 +226,7 @@ target += reduce_sum(partial_sum, y,
230226
```
231227

232228
The reduce summation facility automatically breaks the sum into roughly `grainsize` sized pieces
233-
and computes them in parallel. `grainsize = 1` specifies that the grainsize should
229+
and computes them in parallel. `grainsize = 1` specifies that the `grainsize` should
234230
be estimated automatically. The final model looks as:
235231

236232
```
@@ -261,35 +257,41 @@ model {
261257

262258
### Picking the Grainsize {#reduce-sum-grainsize}
263259

264-
The `grainsize` is a recommendation on how large each piece of
260+
For `grainsize` is a recommendation on how large each piece of
265261
parallel work is (how many terms it contains). When using the
266262
non-static version, it is recommended to choose 1 as a starting
267263
point as automatic aggregation of partial sums are performed. However,
268264
for the static version the `grainsize` defines the maximal size of the
269-
partial sums, e.g. the static variant will split the input sequence
270-
until all partial sums are just smaller than `grainsize`. Therefore,
271-
for the static version it is more important to select a sensible
272-
value. The rational for choosing a sensible `grainsize` is based on
265+
partial sums, e.g.
266+
267+
The rational for choosing a sensible `grainsize` is based on
273268
balancing the overhead implied by creating many small tasks versus
274269
creating fewer large tasks which limits the potential parallelism.
275270

276-
From empirical experience, the automatic grainsize determination works
277-
well and no further tuning is required in most cases. In order to
278-
figure out an optimal grainsize, think about how many terms are in the
279-
summation and on how many cores the model should run. If there are `N`
271+
In `reduce_sum`, `grainsize` is a recommendation on how to partition
272+
the work in the partial sum into smaller pieces. A `grainsize` of 1
273+
leaves this entirely up to the internal scheduler. Ideally this will be
274+
efficient, but there are no guarantees.
275+
276+
In `reduce_sum_static`, `grainsize` is an upper limit on the worksize.
277+
Work will be split until all partial sums are just smaller than `grainsize`
278+
(and the split will happen the same way every time for the same data).
279+
For the static version it is more important to select a sensible `grainsize`.
280+
281+
In order to figure out an optimal `grainsize`, if there are `N`
280282
terms and `M` cores, run a quick test model with `grainsize` set
281-
roughly to `N / M`. Record the time, cut the grainsize in half, and
283+
roughly to `N / M`. Record the time, cut the `grainsize` in half, and
282284
run the test again. Repeat this iteratively until the model runtime
283-
begins to increase. This is a suitable grainsize for the model,
285+
begins to increase. This is a suitable `grainsize` for the model,
284286
because this ensures the caculations can be carried out with the most
285287
parallelism without losing too much efficiency.
286288

287289
For instance, in a model with `N=10000` and `M = 4`, start with `grainsize = 25000`, and
288290
sequentially try `grainsize = 12500`, `grainsize = 6250`, etc.
289291

290-
It is important to repeat this process until performance gets worse!
292+
It is important to repeat this process until performance gets worse.
291293
It is possible after many halvings nothing happens, but there might
292-
still be a smaller grainsize that performs better. Even if a sum has
294+
still be a smaller `grainsize` that performs better. Even if a sum has
293295
many tens of thousands of terms, depending on the internal
294296
calculations, a `grainsize` of thirty or forty or smaller might be the
295297
best, and it is difficult to predict this behavior. Without doing

0 commit comments

Comments
 (0)