You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: olcf-workflow.html
+63-7Lines changed: 63 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -127,7 +127,26 @@ <h3>Machine details<a class="headerlink" href="#machine-details" title="Link to
127
127
<sectionid="submitting-jobs">
128
128
<h3>Submitting jobs<aclass="headerlink" href="#submitting-jobs" title="Link to this heading"></a></h3>
129
129
<p>Frontier uses SLURM.</p>
130
-
<p>Here’s a script that runs on GPUs and has the I/O fixes described above.</p>
130
+
<p>Here’s a script that uses our best practices on Frontier. It uses 64 nodes (512 GPUs)
131
+
and does the following:</p>
132
+
<ul>
133
+
<li><p>Sets the filesystem striping (see <aclass="reference external" href="https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper">https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper</a>)</p></li>
134
+
<li><p>Includes logic for automatically restarting from the last checkpoint file
135
+
(useful for job-chaining). This is done via the <codeclass="docutils literal notranslate"><spanclass="pre">find_chk_file</span></code> function.</p></li>
136
+
<li><p>Installs a signal handler to create a <codeclass="docutils literal notranslate"><spanclass="pre">dump_and_stop</span></code> file shortly before
137
+
the queue window ends. This ensures that we get a checkpoint at the very
138
+
end of the queue window.</p></li>
139
+
<li><p>Can do a special check on restart to ensure that we don’t hang on
140
+
reading the initial checkpoint file (uncomment out the line):</p>
<spanclass="w"></span><spanclass="nb">echo</span><spanclass="w"></span><spanclass="s2">"BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop"</span>
</pre></div></div><p>where <codeclass="docutils literal notranslate"><spanclass="pre">frontier.slurm</span></code> is the name of the submission script.</p>
293
+
<divclass="admonition note">
294
+
<pclass="admonition-title">Note</p>
295
+
<p>If the job times out before writing out a checkpoint (leaving a
296
+
<codeclass="docutils literal notranslate"><spanclass="pre">dump_and_stop</span></code> file behind), you can give it more time between the
297
+
warning signal and the end of the allocation by adjusting the
298
+
<codeclass="docutils literal notranslate"><spanclass="pre">#SBATCH</span><spanclass="pre">--signal=B:URG@<n></span></code> line at the top of the script.</p>
299
+
<p>Also, by default, AMReX will output a plotfile at the same time as a checkpoint file,
300
+
which means you’ll get one from the <codeclass="docutils literal notranslate"><spanclass="pre">dump_and_stop</span></code>, which may not be at the same
301
+
time intervals as your <codeclass="docutils literal notranslate"><spanclass="pre">amr.plot_per</span></code>. To suppress this, set:</p>
<p>Also see the WarpX docs: <aclass="reference external" href="https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html">https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html</a></p>
0 commit comments