Skip to content

Commit 0b8ad36

Browse files
committed
deploy: d169575
1 parent ab2a860 commit 0b8ad36

File tree

3 files changed

+102
-9
lines changed

3 files changed

+102
-9
lines changed

_sources/olcf-workflow.rst.txt

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,30 @@ Submitting jobs
3838

3939
Frontier uses SLURM.
4040

41-
Here's a script that runs on GPUs and has the I/O fixes described above.
41+
Here's a script that uses our best practices on Frontier. It uses 64 nodes (512 GPUs)
42+
and does the following:
43+
44+
* Sets the filesystem striping (see https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper)
45+
46+
* Includes logic for automatically restarting from the last checkpoint file
47+
(useful for job-chaining). This is done via the ``find_chk_file`` function.
48+
49+
* Installs a signal handler to create a ``dump_and_stop`` file shortly before
50+
the queue window ends. This ensures that we get a checkpoint at the very
51+
end of the queue window.
52+
53+
* Can do a special check on restart to ensure that we don't hang on
54+
reading the initial checkpoint file (uncomment out the line):
55+
56+
::
57+
58+
(sleep 300; check_restart ) &
59+
60+
This uses the ``check_restart`` function and will kill the job if it doesn't
61+
detect a successful restart within 5 minutes.
62+
63+
* Adds special I/O parameters to the job to work around filesystem issues
64+
(these are defined in ``FILE_IO_PARAMS``.
4265

4366
.. literalinclude:: ../../job_scripts/frontier/frontier.slurm
4467
:language: bash
@@ -51,6 +74,20 @@ The job is submitted as:
5174

5275
where ``frontier.slurm`` is the name of the submission script.
5376

77+
.. note::
78+
79+
If the job times out before writing out a checkpoint (leaving a
80+
``dump_and_stop`` file behind), you can give it more time between the
81+
warning signal and the end of the allocation by adjusting the
82+
``#SBATCH --signal=B:URG@<n>`` line at the top of the script.
83+
84+
Also, by default, AMReX will output a plotfile at the same time as a checkpoint file,
85+
which means you'll get one from the ``dump_and_stop``, which may not be at the same
86+
time intervals as your ``amr.plot_per``. To suppress this, set:
87+
88+
::
89+
90+
amr.write_plotfile_with_checkpoint = 0
5491

5592
Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html
5693

olcf-workflow.html

Lines changed: 63 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,26 @@ <h3>Machine details<a class="headerlink" href="#machine-details" title="Link to
127127
<section id="submitting-jobs">
128128
<h3>Submitting jobs<a class="headerlink" href="#submitting-jobs" title="Link to this heading"></a></h3>
129129
<p>Frontier uses SLURM.</p>
130-
<p>Here’s a script that runs on GPUs and has the I/O fixes described above.</p>
130+
<p>Here’s a script that uses our best practices on Frontier. It uses 64 nodes (512 GPUs)
131+
and does the following:</p>
132+
<ul>
133+
<li><p>Sets the filesystem striping (see <a class="reference external" href="https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper">https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper</a>)</p></li>
134+
<li><p>Includes logic for automatically restarting from the last checkpoint file
135+
(useful for job-chaining). This is done via the <code class="docutils literal notranslate"><span class="pre">find_chk_file</span></code> function.</p></li>
136+
<li><p>Installs a signal handler to create a <code class="docutils literal notranslate"><span class="pre">dump_and_stop</span></code> file shortly before
137+
the queue window ends. This ensures that we get a checkpoint at the very
138+
end of the queue window.</p></li>
139+
<li><p>Can do a special check on restart to ensure that we don’t hang on
140+
reading the initial checkpoint file (uncomment out the line):</p>
141+
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">(</span>sleep<span class="w"> </span><span class="m">300</span><span class="p">;</span><span class="w"> </span>check_restart<span class="w"> </span><span class="o">)</span><span class="w"> </span><span class="p">&amp;</span>
142+
</pre></div>
143+
</div>
144+
<p>This uses the <code class="docutils literal notranslate"><span class="pre">check_restart</span></code> function and will kill the job if it doesn’t
145+
detect a successful restart within 5 minutes.</p>
146+
</li>
147+
<li><p>Adds special I/O parameters to the job to work around filesystem issues
148+
(these are defined in <code class="docutils literal notranslate"><span class="pre">FILE_IO_PARAMS</span></code>.</p></li>
149+
</ul>
131150
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
132151
<span class="c1">#SBATCH -A AST106</span>
133152
<span class="c1">#SBATCH -J subch</span>
@@ -140,6 +159,7 @@ <h3>Submitting jobs<a class="headerlink" href="#submitting-jobs" title="Link to
140159
<span class="c1">#SBATCH --cpus-per-task=7</span>
141160
<span class="c1">#SBATCH --gpus-per-task=1</span>
142161
<span class="c1">#SBATCH --gpu-bind=closest</span>
162+
<span class="c1">#SBATCH --signal=B:URG@300</span>
143163

144164
<span class="nv">EXEC</span><span class="o">=</span>./Castro3d.hip.x86-trento.MPI.HIP.SMPLSDC.ex
145165
<span class="nv">INPUTS</span><span class="o">=</span>inputs_3d.N14.coarse
@@ -152,18 +172,13 @@ <h3>Submitting jobs<a class="headerlink" href="#submitting-jobs" title="Link to
152172

153173
<span class="nb">export</span><span class="w"> </span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$CRAY_LD_LIBRARY_PATH</span>:<span class="nv">$LD_LIBRARY_PATH</span>
154174

155-
<span class="c1"># libfabric workaround</span>
156-
<span class="nb">export</span><span class="w"> </span><span class="nv">FI_MR_CACHE_MONITOR</span><span class="o">=</span>memhooks
157-
158175
<span class="c1"># set the file system striping</span>
159176

160177
<span class="nb">echo</span><span class="w"> </span><span class="nv">$SLURM_SUBMIT_DIR</span>
161178

162179
module<span class="w"> </span>load<span class="w"> </span>lfs-wrapper
163180
lfs<span class="w"> </span>setstripe<span class="w"> </span>-c<span class="w"> </span><span class="m">32</span><span class="w"> </span>-S<span class="w"> </span>10M<span class="w"> </span><span class="nv">$SLURM_SUBMIT_DIR</span>
164181

165-
module<span class="w"> </span>list
166-
167182
<span class="k">function</span><span class="w"> </span>find_chk_file<span class="w"> </span><span class="o">{</span>
168183
<span class="w"> </span><span class="c1"># find_chk_file takes a single argument -- the wildcard pattern</span>
169184
<span class="w"> </span><span class="c1"># for checkpoint files to look through</span>
@@ -206,6 +221,21 @@ <h3>Submitting jobs<a class="headerlink" href="#submitting-jobs" title="Link to
206221
<span class="w"> </span><span class="nv">restartString</span><span class="o">=</span><span class="s2">&quot;amr.restart=</span><span class="si">${</span><span class="nv">restartFile</span><span class="si">}</span><span class="s2">&quot;</span>
207222
<span class="k">fi</span>
208223

224+
225+
<span class="c1"># clean up any run management files left over from previous runs</span>
226+
rm<span class="w"> </span>-f<span class="w"> </span>dump_and_stop
227+
228+
<span class="c1"># The `--signal=B:URG@&lt;n&gt;` option tells slurm to send SIGURG to this batch</span>
229+
<span class="c1"># script n seconds before the runtime limit, so we can exit gracefully.</span>
230+
<span class="k">function</span><span class="w"> </span>sig_handler<span class="w"> </span><span class="o">{</span>
231+
<span class="w"> </span>touch<span class="w"> </span>dump_and_stop
232+
<span class="w"> </span><span class="c1"># disable this signal handler</span>
233+
<span class="w"> </span><span class="nb">trap</span><span class="w"> </span>-<span class="w"> </span>URG
234+
<span class="w"> </span><span class="nb">echo</span><span class="w"> </span><span class="s2">&quot;BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop&quot;</span>
235+
<span class="o">}</span>
236+
<span class="nb">trap</span><span class="w"> </span>sig_handler<span class="w"> </span>URG
237+
238+
209239
<span class="nb">export</span><span class="w"> </span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">1</span>
210240
<span class="nb">export</span><span class="w"> </span><span class="nv">NMPI_PER_NODE</span><span class="o">=</span><span class="m">8</span>
211241
<span class="nb">export</span><span class="w"> </span><span class="nv">TOTAL_NMPI</span><span class="o">=</span><span class="k">$((</span><span class="w"> </span><span class="si">${</span><span class="nv">SLURM_JOB_NUM_NODES</span><span class="si">}</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="si">${</span><span class="nv">NMPI_PER_NODE</span><span class="si">}</span><span class="w"> </span><span class="k">))</span>
@@ -237,7 +267,20 @@ <h3>Submitting jobs<a class="headerlink" href="#submitting-jobs" title="Link to
237267

238268
<span class="o">(</span>sleep<span class="w"> </span><span class="m">300</span><span class="p">;</span><span class="w"> </span>check_restart<span class="w"> </span><span class="o">)</span><span class="w"> </span><span class="p">&amp;</span>
239269

240-
srun<span class="w"> </span>-n<span class="si">${</span><span class="nv">TOTAL_NMPI</span><span class="si">}</span><span class="w"> </span>-N<span class="si">${</span><span class="nv">SLURM_JOB_NUM_NODES</span><span class="si">}</span><span class="w"> </span>--ntasks-per-node<span class="o">=</span><span class="m">8</span><span class="w"> </span>--gpus-per-task<span class="o">=</span><span class="m">1</span><span class="w"> </span>./<span class="nv">$EXEC</span><span class="w"> </span><span class="nv">$INPUTS</span><span class="w"> </span><span class="si">${</span><span class="nv">restartString</span><span class="si">}</span><span class="w"> </span><span class="si">${</span><span class="nv">FILE_IO_PARAMS</span><span class="si">}</span>
270+
<span class="c1"># execute srun in the background then use the builtin wait so the shell can</span>
271+
<span class="c1"># handle the signal</span>
272+
srun<span class="w"> </span>-n<span class="si">${</span><span class="nv">TOTAL_NMPI</span><span class="si">}</span><span class="w"> </span>-N<span class="si">${</span><span class="nv">SLURM_JOB_NUM_NODES</span><span class="si">}</span><span class="w"> </span>--ntasks-per-node<span class="o">=</span><span class="m">8</span><span class="w"> </span>--gpus-per-task<span class="o">=</span><span class="m">1</span><span class="w"> </span>./<span class="nv">$EXEC</span><span class="w"> </span><span class="nv">$INPUTS</span><span class="w"> </span><span class="si">${</span><span class="nv">restartString</span><span class="si">}</span><span class="w"> </span><span class="si">${</span><span class="nv">FILE_IO_PARAMS</span><span class="si">}</span><span class="w"> </span><span class="p">&amp;</span>
273+
<span class="nv">pid</span><span class="o">=</span><span class="nv">$!</span>
274+
<span class="nb">wait</span><span class="w"> </span><span class="nv">$pid</span>
275+
<span class="nv">ret</span><span class="o">=</span><span class="nv">$?</span>
276+
277+
<span class="k">if</span><span class="w"> </span><span class="o">((</span><span class="w"> </span><span class="nv">ret</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">128</span><span class="w"> </span>+<span class="w"> </span><span class="m">23</span><span class="w"> </span><span class="o">))</span><span class="p">;</span><span class="w"> </span><span class="k">then</span>
278+
<span class="w"> </span><span class="c1"># received SIGURG, keep waiting</span>
279+
<span class="w"> </span><span class="nb">wait</span><span class="w"> </span><span class="nv">$pid</span>
280+
<span class="w"> </span><span class="nv">ret</span><span class="o">=</span><span class="nv">$?</span>
281+
<span class="k">fi</span>
282+
283+
<span class="nb">exit</span><span class="w"> </span><span class="nv">$ret</span>
241284
</pre></div>
242285
</div>
243286
<p>The job is submitted as:</p>
@@ -247,6 +290,19 @@ <h3>Submitting jobs<a class="headerlink" href="#submitting-jobs" title="Link to
247290
}
248291
</style><span class="prompt1">sbatch<span class="w"> </span>frontier.slurm</span>
249292
</pre></div></div><p>where <code class="docutils literal notranslate"><span class="pre">frontier.slurm</span></code> is the name of the submission script.</p>
293+
<div class="admonition note">
294+
<p class="admonition-title">Note</p>
295+
<p>If the job times out before writing out a checkpoint (leaving a
296+
<code class="docutils literal notranslate"><span class="pre">dump_and_stop</span></code> file behind), you can give it more time between the
297+
warning signal and the end of the allocation by adjusting the
298+
<code class="docutils literal notranslate"><span class="pre">#SBATCH</span> <span class="pre">--signal=B:URG&#64;&lt;n&gt;</span></code> line at the top of the script.</p>
299+
<p>Also, by default, AMReX will output a plotfile at the same time as a checkpoint file,
300+
which means you’ll get one from the <code class="docutils literal notranslate"><span class="pre">dump_and_stop</span></code>, which may not be at the same
301+
time intervals as your <code class="docutils literal notranslate"><span class="pre">amr.plot_per</span></code>. To suppress this, set:</p>
302+
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>amr.write_plotfile_with_checkpoint<span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span>
303+
</pre></div>
304+
</div>
305+
</div>
250306
<p>Also see the WarpX docs: <a class="reference external" href="https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html">https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html</a></p>
251307
</section>
252308
<section id="gpu-aware-mpi">

0 commit comments

Comments
 (0)