Skip to content

Commit 1b2c33f

Browse files
committed
triton/tut/parallel-shared: Update ngrams exercise text some
1 parent 2385328 commit 1b2c33f

File tree

1 file changed

+63
-33
lines changed

1 file changed

+63
-33
lines changed

triton/tut/parallel-shared.rst

Lines changed: 63 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -257,25 +257,32 @@ Exercises
257257
.. exercise:: Shared memory parallelism1: Test scaling
258258
:class: exercise-ngrams
259259

260-
Test scaling of the ngrams code. How many processors should be used?
260+
Test scaling of the ngrams code. How many processors should be
261+
used? (This is involved enough you might want to follow the steps
262+
in the solution).
263+
264+
To get good stats, compute with ``-n 2 --words`` (word 2-grams),
265+
and constrain to a single processor archicteure (like the Slurm
266+
option``--constraint skl``).
261267

262268
.. solution::
263269

264-
Running the code. Note we don't save the output anywhere. This
265-
will still generate the output but not save it to a disk file.::
270+
Running the code for character ngrams. Note we don't save the output anywhere so
271+
that we don't measure the time of disk writing, and we constrain
272+
to the same type of node (processor architecture) to make sure
273+
our results are comparable::
266274

267275
# 100 books
268-
srun --constraint=skl python3 ngrams/count.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
269-
srun --constraint=skl -c 1 python3 ngrams/count-multi.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null -t auto
270-
srun --constraint=skl -c 2 python3 ngrams/count-multi.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null -t auto
271-
srun --constraint=skl -c 4 python3 ngrams/count-multi.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null -t auto
272-
srun --constraint=skl -c 8 python3 ngrams/count-multi.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null -t auto
273-
274-
## 1000 books:
275-
srun --constraint=skl --time=0-1 --mem=20G -c 4 python3 ngrams/count-multi.py -n 2 --words /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null -t auto
276-
276+
srun --constraint=skl -c 1 python3 ngrams/count.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
277+
srun --constraint=skl -c 1 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
278+
srun --constraint=skl -c 2 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
279+
srun --constraint=skl -c 4 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
280+
srun --constraint=skl -c 8 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
277281

278-
Durations (character ngrams):
282+
Summary table for character ngrams: Speedup is (single core
283+
time)/(multi-core time). "Total core time" is (time × number of
284+
processors) and is a measure of how much computing resources you
285+
actually used:
279286

280287
.. csv-table::
281288
:header-rows: 1
@@ -284,35 +291,58 @@ Exercises
284291
N processors | time | speedup | total core time
285292
single-core version | 24.7 s | | 24.7 s
286293
1 | 23.8 s | 1.04 | 23.8 s
287-
2 | 12.5 s | 1.9 | 25.0 s
288-
4 | 7.2 s | 1.76 | 28.8 s
289-
8 | 5.0 s | 1.44 | 40.0 s
294+
2 | 12.5 s | 1.98 | 25.0 s
295+
4 | 7.2 s | 3.4 | 28.8 s
296+
8 | 5.0 s | 4.9 | 40.0 s
290297

291-
For these, it makes since to go up to 4 cores, since that's how
292-
far you can go with a speedup of 1.5 or greater.
298+
For these, it seems reasonable to go up to 4 cores, since that's
299+
how far you can go with a reasonable speedup and you aren't
300+
wasting too many resources.
301+
302+
Second, let's compute the same thing but with words (`--words`). We see
303+
that the speedup is much worse, and it almost doesn't make sense
304+
to use multi-core at all. This is because word ngrams have much
305+
more output data, and the programs spend more time moving that
306+
data around than computing anything. This is computed for both
307+
100 and 1000 books::
308+
309+
## 1000 books:
310+
srun --constraint=skl --time=0-1 --mem=20G -c 1 python3 ngrams/count.py -n 2 --words /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
311+
srun --constraint=skl --time=0-1 --mem=20G -c 1 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
312+
srun --constraint=skl --time=0-1 --mem=20G -c 2 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
313+
srun --constraint=skl --time=0-1 --mem=20G -c 4 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
314+
srun --constraint=skl --time=0-1 --mem=20G -c 8 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
293315

294-
Durations (word ngrams, for 100 books and 1000 books):
295316

296317
.. csv-table::
297318
:header-rows: 1
298319
:delim: |
299320

300321
N processors | time (100 books)| speedup | core time used | | time (1000 books) | speedup | core time used
301322
single-core code | 22.5 s | | 22.5 | | 170 s | | 170 s
302-
1 | 29.6 s | .75 | 29.6 | | 201 s | .84 | 201 s
303-
2 | 17.7 s | 1.71 | 35.4 | | 116 s | 1.73 | 232 s
304-
4 | 14.4 s | 1.22 | 57.6 | | 82 s | 1.41 | 328 s
305-
8 | 12.1 s | 1.19 | 96.8 | | 79 s | 1.04 | 632 s
306-
307-
For word ngrams, it's justifiable to use two processes because
308-
the speed up there is still more than 1.71. Still, this is
309-
relatively bad compared to what we expect for something that
310-
*should* be perfectly parallel. In this case, the problem is
311-
that the code isn't very efficient and is spending too much time
312-
passing the data around. Using an array job allows every array
313-
task to write separately, and then one single-core job is used
314-
to accumulate the counts. Better yet would be to re-do the code
315-
so that this inefficiency is improved.
323+
1 | 29.6 s | 0.76 | 29.6 | | 201 s | .84 | 201 s
324+
2 | 17.7 s | 1.27 | 35.4 | | 116 s | 1.5 | 232 s
325+
4 | 14.4 s | 1.56 | 57.6 | | 82 s | 2.1 | 328 s
326+
8 | 12.1 s | 1.86 | 96.8 | | 79 s | 2.2 | 632 s
327+
328+
For word ngrams, it doesn't even seem justifiable to use two
329+
processes because the speed up there is not even 1.5. In this
330+
case, the problem is that the code isn't very efficient and is
331+
spending too much time passing the data around. Using an array
332+
job allows every array task to write separately, and then one
333+
single-core job is used to accumulate the counts. Better yet
334+
would be to re-do the code so that this inefficiency is
335+
improved.
336+
337+
From our experience, some of the main slow points of the code are:
338+
339+
- Reading all the individual files (even from the zip file) is slow.
340+
- Writing the results in plain text + JSON is slow: using a
341+
binary format of some form would be better.
342+
- The python ``multiprocessing`` module has to inefficiently
343+
move data between processes. Since this is a heavily
344+
data-based computation (and there are quite a few ngrams it
345+
makes), it is slow.
316346

317347

318348
.. exercise:: Shared memory parallelism 1: Test the example's scaling

0 commit comments

Comments
 (0)