@@ -10,17 +10,28 @@ Perlmutter
1010GPU jobs
1111^^^^^^^^
1212
13- Perlmutter has 1536 GPU nodes, each with 4 NVIDIA A100 GPUs -- therefore it is best to use
14- 4 MPI tasks per node.
13+ Perlmutter has 1536 GPU nodes, each with 4 NVIDIA A100
14+ GPUs---therefore it is best to use 4 MPI tasks per node.
1515
16- .. note ::
16+ .. important ::
1717
1818 you need to load the same modules used to compile the executable in
1919 your submission script, otherwise, it will fail at runtime because
2020 it can't find the CUDA libraries.
2121
22- Below is an example that runs on 16 nodes with 4 GPUs per node, and also
23- includes the restart logic to allow for job chaining.
22+ Below is an example that runs on 16 nodes with 4 GPUs per node. It also
23+ does the following:
24+
25+ * Includes logic for automatically restarting from the last checkpoint file
26+ (useful for job-chaining). This is done via the ``find_chk_file `` function.
27+
28+ * Installs a signal handler to create a ``dump_and_stop `` file shortly before
29+ the queue window ends. This ensures that we get a checkpoint at the very
30+ end of the queue window.
31+
32+ * Can post to slack using the :download: `slack_job_start.py
33+ <../../job_scripts/perlmutter/slack_job_start.py>` script---this
34+ requires a webhook to be installed (in a file ``~/.slack.webhook ``).
2435
2536.. literalinclude :: ../../job_scripts/perlmutter/perlmutter.submit
2637 :language: sh
@@ -29,7 +40,12 @@ includes the restart logic to allow for job chaining.
2940
3041 With large reaction networks, you may get GPU out-of-memory errors during
3142 the first burner call. If this happens, you can add
32- ``amrex.the_arena_init_size=0 `` after ``${restartString} `` in the srun call
43+
44+ ::
45+
46+ amrex.the_arena_init_size=0
47+
48+ after ``${restartString} `` in the srun call
3349 so AMReX doesn't reserve 3/4 of the GPU memory for the device arena.
3450
3551.. note ::
@@ -39,6 +55,14 @@ includes the restart logic to allow for job chaining.
3955 warning signal and the end of the allocation by adjusting the
4056 ``#SBATCH --signal=B:URG@<n> `` line at the top of the script.
4157
58+ Also, by default, AMReX will output a plotfile at the same time as a checkpoint file,
59+ which means you'll get one from the ``dump_and_stop ``, which may not be at the same
60+ time intervals as your ``amr.plot_per ``. To suppress this, set:
61+
62+ ::
63+
64+ amr.write_plotfile_with_checkpoint = 0
65+
4266CPU jobs
4367^^^^^^^^
4468
0 commit comments