Skip to content

Commit 075665a

Browse files
authored
Merge pull request #53 from AMReX-Astro/update_nersc
update the NERSC job script docs
2 parents 58657cf + 8bda38e commit 075665a

File tree

1 file changed

+30
-6
lines changed

1 file changed

+30
-6
lines changed

sphinx_docs/source/nersc-workflow.rst

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,17 +10,28 @@ Perlmutter
1010
GPU jobs
1111
^^^^^^^^
1212

13-
Perlmutter has 1536 GPU nodes, each with 4 NVIDIA A100 GPUs -- therefore it is best to use
14-
4 MPI tasks per node.
13+
Perlmutter has 1536 GPU nodes, each with 4 NVIDIA A100
14+
GPUs---therefore it is best to use 4 MPI tasks per node.
1515

16-
.. note::
16+
.. important::
1717

1818
you need to load the same modules used to compile the executable in
1919
your submission script, otherwise, it will fail at runtime because
2020
it can't find the CUDA libraries.
2121

22-
Below is an example that runs on 16 nodes with 4 GPUs per node, and also
23-
includes the restart logic to allow for job chaining.
22+
Below is an example that runs on 16 nodes with 4 GPUs per node. It also
23+
does the following:
24+
25+
* Includes logic for automatically restarting from the last checkpoint file
26+
(useful for job-chaining). This is done via the ``find_chk_file`` function.
27+
28+
* Installs a signal handler to create a ``dump_and_stop`` file shortly before
29+
the queue window ends. This ensures that we get a checkpoint at the very
30+
end of the queue window.
31+
32+
* Can post to slack using the :download:`slack_job_start.py
33+
<../../job_scripts/perlmutter/slack_job_start.py>` script---this
34+
requires a webhook to be installed (in a file ``~/.slack.webhook``).
2435

2536
.. literalinclude:: ../../job_scripts/perlmutter/perlmutter.submit
2637
:language: sh
@@ -29,7 +40,12 @@ includes the restart logic to allow for job chaining.
2940

3041
With large reaction networks, you may get GPU out-of-memory errors during
3142
the first burner call. If this happens, you can add
32-
``amrex.the_arena_init_size=0`` after ``${restartString}`` in the srun call
43+
44+
::
45+
46+
amrex.the_arena_init_size=0
47+
48+
after ``${restartString}`` in the srun call
3349
so AMReX doesn't reserve 3/4 of the GPU memory for the device arena.
3450

3551
.. note::
@@ -39,6 +55,14 @@ includes the restart logic to allow for job chaining.
3955
warning signal and the end of the allocation by adjusting the
4056
``#SBATCH --signal=B:URG@<n>`` line at the top of the script.
4157

58+
Also, by default, AMReX will output a plotfile at the same time as a checkpoint file,
59+
which means you'll get one from the ``dump_and_stop``, which may not be at the same
60+
time intervals as your ``amr.plot_per``. To suppress this, set:
61+
62+
::
63+
64+
amr.write_plotfile_with_checkpoint = 0
65+
4266
CPU jobs
4367
^^^^^^^^
4468

0 commit comments

Comments
 (0)