@@ -257,25 +257,32 @@ Exercises
257257.. exercise :: Shared memory parallelism1: Test scaling
258258 :class: exercise-ngrams
259259
260- Test scaling of the ngrams code. How many processors should be used?
260+ Test scaling of the ngrams code. How many processors should be
261+ used? (This is involved enough you might want to follow the steps
262+ in the solution).
263+
264+ To get good stats, compute with ``-n 2 --words `` (word 2-grams),
265+ and constrain to a single processor archicteure (like the Slurm
266+ option``--constraint skl``).
261267
262268 .. solution ::
263269
264- Running the code. Note we don't save the output anywhere. This
265- will still generate the output but not save it to a disk file.::
270+ Running the code for character ngrams. Note we don't save the output anywhere so
271+ that we don't measure the time of disk writing, and we constrain
272+ to the same type of node (processor architecture) to make sure
273+ our results are comparable::
266274
267275 # 100 books
268- srun --constraint=skl python3 ngrams/count.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
269- srun --constraint=skl -c 1 python3 ngrams/count-multi.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null -t auto
270- srun --constraint=skl -c 2 python3 ngrams/count-multi.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null -t auto
271- srun --constraint=skl -c 4 python3 ngrams/count-multi.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null -t auto
272- srun --constraint=skl -c 8 python3 ngrams/count-multi.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null -t auto
273-
274- ## 1000 books:
275- srun --constraint=skl --time=0-1 --mem=20G -c 4 python3 ngrams/count-multi.py -n 2 --words /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null -t auto
276-
276+ srun --constraint=skl -c 1 python3 ngrams/count.py -n 2 /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
277+ srun --constraint=skl -c 1 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
278+ srun --constraint=skl -c 2 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
279+ srun --constraint=skl -c 4 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
280+ srun --constraint=skl -c 8 python3 ngrams/count-multi.py -n 2 -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first100.zip -o /dev/null
277281
278- Durations (character ngrams):
282+ Summary table for character ngrams: Speedup is (single core
283+ time)/(multi-core time). "Total core time" is (time × number of
284+ processors) and is a measure of how much computing resources you
285+ actually used:
279286
280287 .. csv-table ::
281288 :header-rows: 1
@@ -284,35 +291,58 @@ Exercises
284291 N processors | time | speedup | total core time
285292 single-core version | 24.7 s | | 24.7 s
286293 1 | 23.8 s | 1.04 | 23.8 s
287- 2 | 12.5 s | 1.9 | 25.0 s
288- 4 | 7.2 s | 1.76 | 28.8 s
289- 8 | 5.0 s | 1.44 | 40.0 s
294+ 2 | 12.5 s | 1.98 | 25.0 s
295+ 4 | 7.2 s | 3.4 | 28.8 s
296+ 8 | 5.0 s | 4.9 | 40.0 s
290297
291- For these, it makes since to go up to 4 cores, since that's how
292- far you can go with a speedup of 1.5 or greater.
298+ For these, it seems reasonable to go up to 4 cores, since that's
299+ how far you can go with a reasonable speedup and you aren't
300+ wasting too many resources.
301+
302+ Second, let's compute the same thing but with words (`--words `). We see
303+ that the speedup is much worse, and it almost doesn't make sense
304+ to use multi-core at all. This is because word ngrams have much
305+ more output data, and the programs spend more time moving that
306+ data around than computing anything. This is computed for both
307+ 100 and 1000 books::
308+
309+ ## 1000 books:
310+ srun --constraint=skl --time=0-1 --mem=20G -c 1 python3 ngrams/count.py -n 2 --words /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
311+ srun --constraint=skl --time=0-1 --mem=20G -c 1 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
312+ srun --constraint=skl --time=0-1 --mem=20G -c 2 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
313+ srun --constraint=skl --time=0-1 --mem=20G -c 4 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
314+ srun --constraint=skl --time=0-1 --mem=20G -c 8 python3 ngrams/count-multi.py -n 2 --words -t auto /scratch/shareddata/teaching/Gutenberg-Fiction-first1000.zip -o /dev/null
293315
294- Durations (word ngrams, for 100 books and 1000 books):
295316
296317 .. csv-table ::
297318 :header-rows: 1
298319 :delim: |
299320
300321 N processors | time (100 books)| speedup | core time used | | time (1000 books) | speedup | core time used
301322 single-core code | 22.5 s | | 22.5 | | 170 s | | 170 s
302- 1 | 29.6 s | .75 | 29.6 | | 201 s | .84 | 201 s
303- 2 | 17.7 s | 1.71 | 35.4 | | 116 s | 1.73 | 232 s
304- 4 | 14.4 s | 1.22 | 57.6 | | 82 s | 1.41 | 328 s
305- 8 | 12.1 s | 1.19 | 96.8 | | 79 s | 1.04 | 632 s
306-
307- For word ngrams, it's justifiable to use two processes because
308- the speed up there is still more than 1.71. Still, this is
309- relatively bad compared to what we expect for something that
310- *should * be perfectly parallel. In this case, the problem is
311- that the code isn't very efficient and is spending too much time
312- passing the data around. Using an array job allows every array
313- task to write separately, and then one single-core job is used
314- to accumulate the counts. Better yet would be to re-do the code
315- so that this inefficiency is improved.
323+ 1 | 29.6 s | 0.76 | 29.6 | | 201 s | .84 | 201 s
324+ 2 | 17.7 s | 1.27 | 35.4 | | 116 s | 1.5 | 232 s
325+ 4 | 14.4 s | 1.56 | 57.6 | | 82 s | 2.1 | 328 s
326+ 8 | 12.1 s | 1.86 | 96.8 | | 79 s | 2.2 | 632 s
327+
328+ For word ngrams, it doesn't even seem justifiable to use two
329+ processes because the speed up there is not even 1.5. In this
330+ case, the problem is that the code isn't very efficient and is
331+ spending too much time passing the data around. Using an array
332+ job allows every array task to write separately, and then one
333+ single-core job is used to accumulate the counts. Better yet
334+ would be to re-do the code so that this inefficiency is
335+ improved.
336+
337+ From our experience, some of the main slow points of the code are:
338+
339+ - Reading all the individual files (even from the zip file) is slow.
340+ - Writing the results in plain text + JSON is slow: using a
341+ binary format of some form would be better.
342+ - The python ``multiprocessing `` module has to inefficiently
343+ move data between processes. Since this is a heavily
344+ data-based computation (and there are quite a few ngrams it
345+ makes), it is slow.
316346
317347
318348.. exercise :: Shared memory parallelism 1: Test the example's scaling
0 commit comments