Skip to content

Commit 4c8fc3c

Browse files
authored
Merge pull request #790 from AaltoSciComp/rkdarst/localstorage
triton/usage/localstorage: Big update
2 parents ebd3a83 + 21d50c4 commit 4c8fc3c

File tree

4 files changed

+163
-71
lines changed

4 files changed

+163
-71
lines changed

triton/ref/slurm.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,7 @@
3434
! ``-e ERRORFILE`` ! print errors into file *error*
3535
! ``--exclusive`` ! allocate exclusive access to nodes. For large parallel jobs.
3636
! ``--constraint=FEATURE`` ! request *feature* (see ``slurm features`` for the current list of configured features, or Arch under the :ref:`hardware list <hardware-list>`). Multiple with ``--constraint="hsw|skl"``.
37-
! ``--constraint=localdisk`` ! request nodes that have local disks
38-
! ``--tmp=nnnG`` ! Request ``nnn`` GB of :doc:`local disk storage space </triton/usage/localstorage>`
37+
! ``--tmp=nnnG`` ! request a node with a :doc:`local disk storage space </triton/usage/localstorage>` and ``nnn`` GB of space on it.
3938
! ``--array=0-5,7,10-15`` ! Run job multiple times, use variable ``$SLURM_ARRAY_TASK_ID`` to adjust parameters.
4039
! ``--mail-type=TYPE`` ! notify of events: ``BEGIN``, ``END``, ``FAIL``, ``ALL``, ``REQUEUE`` (not on triton) or ``ALL.`` MUST BE used with ``--mail-user=`` only
4140
! ``[email protected]`` ! Aalto email to send the notification about the job. External email addresses doesn't work.

triton/ref/storage.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,5 @@
66
Home | ``$HOME`` or ``/home/USERNAME/`` | hard quota 10GB | Nightly | all nodes | Small user specific files, no calculation data.
77
Work | ``$WRKDIR`` or ``/scratch/work/USERNAME/`` | 200GB and 1 million files | x | all nodes | Personal working space for every user. Calculation data etc. Quota can be increased on request.
88
Scratch | ``/scratch/DEPT/PROJECT/`` | on request | x | all nodes | Department/group specific project directories.
9-
:doc:`Local temp (disk) </triton/usage/localstorage>` | ``/tmp/`` (nodes with disks only) | local disk size | x | single-node | (Usually fastest) place for single-node calculation data. Removed once user's jobs are finished on the node. Request with ``--tmp=nnnG`` or ``--constraint=localdisk``.
9+
:doc:`Local temp (disk) </triton/usage/localstorage>` | ``/tmp/`` (nodes with disks only) | local disk size | x | single-node | (Usually fastest) place for single-node calculation data. Removed once user's jobs are finished on the node. Request with ``--tmp=nnnG``.
1010
:doc:`Local temp (ramfs) </triton/usage/localstorage>` | ``/dev/shm/`` (and ``/tmp/`` on diskless nodes) | limited by memory | x | single-node | Very fast but small in-memory filesystem

triton/tut/storage.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,8 @@ choose between them. The
3232
(recommended for most work)
3333

3434
* ``/tmp``: temporary local disk space, pre-user mounted in jobs and
35-
automatically cleaned up. Only on nodes with disks
36-
(``--constraint=localdisk``), otherwise it's ramfs
35+
automatically cleaned up. Use ``--tmp=nnnG`` to request at
36+
least ``nnn`` GB of space, otherwise it's ramfs
3737
* ``/dev/shm``: ramfs, in-memory file storage
3838

3939
* See :doc:`remotedata` for how to transfer and access the data

triton/usage/localstorage.rst

Lines changed: 159 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -2,71 +2,132 @@
22
Storage: local drives
33
=====================
44

5-
.. seealso::
5+
.. admonition:: Abstract
6+
7+
- Path is ``/tmp/``
8+
- Local drives are useful for large temporary data or unpacking
9+
many small files before analysis. They are most important for
10+
GPU training data but are useful other times, too.
11+
- Local storage can be either SSD drives (big and reasonably fast),
12+
spinning hard disks (HDDs; older nodes), or ramdisk (using your
13+
job's memory; extremely fast).
14+
- Request local storage with ``--tmp=NNg`` (the space you think you
15+
need; but the space isn't reserved just for you).
16+
- For ramdisk, the space comes out of your ``--mem=`` allocation.
617

7-
:doc:`the storage tutorial <../tut/storage>`.
18+
.. seealso::
819

9-
Local disks on computing nodes are the preferred place for doing your
10-
IO. The general idea is use network storage as a backend and local disk
11-
for actual data processing. **Some nodes have no disks** (local
12-
storage comes out of the job memory, **some older nodes have HDDs**
13-
(spinning disks), and some **SSDs**.
20+
:doc:`The storage tutorial <../tut/storage>`.
21+
22+
Local disks on computing nodes are the preferred place for doing
23+
extensive input/output (IO; reading/writing files). The general idea
24+
is use network storage as a backend and local disk for actual data
25+
processing when it requires many reads or writes. **Different nodes
26+
have different types of disks, Triton is very heterogeneous**:
27+
28+
.. list-table::
29+
:header-rows: 1
30+
31+
- - Type
32+
- Description
33+
- Requesting
34+
- Path
35+
- - Solid-state drives (SSDs)
36+
- Much faster than HDDs but much slower than ramdisk. Generally
37+
GPU nodes have SSDs these days.
38+
- ``--tmp=NNg``. The space is not guaranteed just for you.
39+
- ``/tmp/``
40+
- - Spinning hard disks (HDDs)
41+
- Generally only older CPU nodes have HDDs.
42+
- ``--tmp=NNg`` to specify size you need. The space is not
43+
guaranteed just for you.
44+
- ``/tmp/``
45+
- - Ramdisk
46+
- Uses your jobs memory allocation. Limited space but lightning
47+
fast.
48+
- ``--mem=NNg`` to request enough memory for your job and your
49+
storage.
50+
- ``/tmp/`` on diskless nodes and ``/dev/shm/`` on every node.
51+
52+
See :doc:`../overview` for details on each node's local storage.
53+
54+
The reason that local storage matters is that :doc:`lustre` (scratch)
55+
is not good for many :doc:`smallfiles`. Read those articles for
56+
background.
57+
58+
59+
Background
60+
----------
1461

1562
A general use pattern:
1663

17-
- In the beginning of the job, copy needed input from WRKDIR to ``/tmp``.
64+
- In the beginning of the job, copy needed input from Scratch to ``/tmp``.
1865
- Run your calculation normally reading input from or writing output
1966
to to ``/tmp``.
20-
- In the end copy relevant output to WRKDIR for analysis and further
67+
- In the end copy relevant output to Scratch for analysis and further
2168
usage.
2269

23-
Pros
70+
Pros:
2471

25-
- You get better and steadier IO performance. WRKDIR is shared over all
26-
users making per-user performance actually rather poor.
27-
- You save performance for WRKDIR to those who cannot use local disks.
72+
- You get better and steadier IO performance. Scratch is shared over all
73+
users making per-user performance can be poor at times, especially
74+
for many small files.
75+
- You save performance for Scratch to those who cannot use local disks.
2876
- You get much better performance when using many small files (Lustre
2977
works poorly here) or random access.
30-
- Saves your quota if your code generate lots of data but finally you
31-
need only part of it
78+
- Saves your quota if your code generate lots of data but you only
79+
need to save part of it.
3280
- In general, it is an excellent choice for single-node runs (that is
3381
all job's task run on the same node).
3482

35-
Cons
83+
Cons:
3684

3785
- NOT for the long-term data. Cleaned every time your job is finished.
3886
- Space is more limited (but still can be TBs on some nodes)
3987
- Need some awareness of what is on each node, since they are different
4088
- Small learning curve (must copy files before and after the job).
4189
- Not feasible for cross-node IO (MPI jobs where different tasks
42-
write to the same files). Use WRKDIR instead.
90+
write to the same files). Use Scratch instead.
91+
4392

4493

94+
Usage
95+
-----
4596

46-
How to use local drives on compute nodes
47-
----------------------------------------
97+
``/tmp`` is the temporary directory. It is ramdisk on diskless nodes.
4898

49-
``/tmp`` is the temporary directory. It is per-user (not per-job), if
50-
you get two jobs running on the same node, you get the same ``/tmp``.
51-
It is automatically removed once the last job on a node finishes.
99+
It is per-user (not per-job), if you get two jobs running on the same
100+
node, you get the same ``/tmp``. Thus, it is wise to ``mkdir
101+
/tmp/$SLURM_JOB_ID/`` and use that directory, and delete it once the
102+
job is done.
103+
104+
Everything is automatically removed once the last job on a node
105+
finishes.
52106

53107

54108
Nodes with local disks
55109
~~~~~~~~~~~~~~~~~~~~~~
56110

57-
You can see the nodes with local disks on :doc:`../overview`. (To
58-
double check from within the cluster, you can verify node info with
59-
``sinfo show node NODENAME`` and see the ``localdisk`` tag in
60-
``slurm features``). Disk sizes greatly vary from hundreds of GB to
61-
tens of TB.
111+
You can see the nodes with local disks on :doc:`../overview`. Disk
112+
sizes greatly vary from hundreds of GB (older nodes, when everything
113+
had spinning disks) to tens of TB (new GPU nodes designed for ML
114+
training).
115+
116+
.. admonition:: Verifying node details directly through Slurm
117+
118+
You don't usually need to do this. You can verify node info with
119+
``sinfo show node NODENAME`` and look for ``TmpDisk=`` or
120+
``AvailableFeatures=localdisk``. ``slurm features`` will list all
121+
nodes (look for ``localdisk`` in features).
122+
123+
You can use ``--tmp=nnnG`` (for example ``--tmp=100G``). You can use
124+
``--constraint=localdisk`` to ensure a disk of any type, but you may
125+
as well just specify how much space you need.
62126

63-
You have to use ``--constraint=localdisk`` to ensure that you get a
64-
hard disk. You can use ``--tmp=nnnG`` (for example ``--tmp=100G``) to
65-
request a node with at least that much temporary space. But,
66127
``--tmp`` doesn't allocate this space just for you: it's shared among
67128
all users, including those which didn't request storage space. So,
68-
you *might* not have as much as you think. Beware and handle out of
69-
memory gracefully.
129+
you *might* not have as much as you think. Beware and handle "out of
130+
space" errors gracefully.
70131

71132

72133
Nodes without local disks
@@ -75,7 +136,7 @@ Nodes without local disks
75136
You can still use ``/tmp``, but it is an in-memory ramdisk. This
76137
means it is *very* fast, but is using the actual main memory that is
77138
used by the programs. It comes out of your job's memory allocation,
78-
so use a ``--mem`` amount with enough space for your job and any
139+
so use a ``--mem=nnG`` amount with enough space for your job and any
79140
temporary storage.
80141

81142

@@ -86,51 +147,70 @@ Examples
86147
Interactively
87148
~~~~~~~~~~~~~
88149

89-
How to use /tmp when you login interactively
150+
How to use /tmp when you login interactively, for example space to
151+
decompress a big file.
90152

91153
.. code-block:: console
92154
93-
$ sinteractive --time=1:00:00 # request a node for one hour
94-
(node)$ mkdir /tmp/$SLURM_JOB_ID # create a unique directory, here we use
155+
$ sinteractive --time=1:00:00 --tmp=500G # request a node for one hour
156+
(node)$ mkdir /tmp/$SLURM_JOB_ID # create a unique directory, here we use
95157
(node)$ cd /tmp/$SLURM_JOB_ID
96158
... do what you wanted ...
97-
(node)$ cp your_files $WRKDIR/my/valuable/data # copy what you need
98-
(node)$ cd; rm -rf /tmp/$SLURM_JOB_ID # clean up after yourself
159+
(node)$ cp YOUR_FILES $WRKDIR/my/valuable/data # copy what you need
160+
(node)$ cd; rm -rf /tmp/$SLURM_JOB_ID # clean up after yourself
99161
(node)$ exit
100162
101-
In batch script
102-
~~~~~~~~~~~~~~~
103163
104-
This batch job example that prevents data loss in case program gets
105-
terminated (either because of ``scancel`` or due to time limit).
164+
165+
In batch script - save data if job ends prematurely
166+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
167+
168+
This batch job example that has a trigger (``trap``) that prevents
169+
data loss in case the program gets terminated early (either because of
170+
``scancel``, the time limit, or some other error). It copies the data
171+
to a different location (``$WRKDIR/$SLURM_JOB_ID``) in case of errors
172+
compared to other normal exits.
106173

107174
.. code-block:: slurm
175+
:emphasize-lines: 15-17,26-27
108176
109-
#!/bin/bash
177+
#!/bin/bash
178+
#SBATCH --time=12:00:00
179+
#SBATCH --mem-per-cpu=2500M # time and memory requirements
180+
#SBATCH --output=test-local.out
181+
#SBATCH --tmp=50G
110182
111-
#SBATCH --time=12:00:00
112-
#SBATCH --mem-per-cpu=2500M # time and memory requirements
183+
# The below, if uncommented, will cause the script to abort (and trap
184+
# to run) if there are any unhandled errors.
185+
#set -euo pipefail
113186
114-
mkdir /tmp/$SLURM_JOB_ID # get a directory where you will send all output from your program
115-
cd /tmp/$SLURM_JOB_ID
187+
# get a directory where you will send all output from your program
188+
mkdir /tmp/$SLURM_JOB_ID
189+
cd /tmp/$SLURM_JOB_ID
190+
191+
## set the trap: when killed or exits abnormally you get the
192+
## output copied to $WRKDIR/$SLURM_JOB_ID anyway
193+
trap "rsync -a /tmp/$SLURM_JOB_ID/ $WRKDIR/$SLURM_JOB_ID/ ; exit" TERM EXIT
116194
117-
## set the trap: when killed or exits abnormally you get the
118-
## output copied to $WRKDIR/$SLURM_JOB_ID anyway
119-
trap "mkdir $WRKDIR/$SLURM_JOB_ID; mv -f /tmp/$SLURM_JOB_ID $WRKDIR/$SLURM_JOB_ID; exit" TERM EXIT
195+
## run the program and redirect all IO to a local drive
196+
## assuming that you have your program and input at $WRKDIR
197+
srun $WRKDIR/my_program $WRKDIR/input > output
120198
121-
## run the program and redirect all IO to a local drive
122-
## assuming that you have your program and input at $WRKDIR
123-
srun $WRKDIR/my_program $WRKDIR/input > output
199+
# move your output fully or partially
200+
mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR
124201
125-
mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR # move your output fully or partially
202+
# Un-set the trap since we ended successfully
203+
trap - TERM EXIT
126204
127205
128206
129207
Batch script for thousands input/output files
130208
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
131209

132-
If your job requires a large amount of files as input/output using tar
133-
utility can greatly reduce the load on the ``$WRKDIR``-filesystem.
210+
If your job requires a large amount of files as input/output, you can
211+
store the files in a single archive format (``.tar``, ``.zip``, etc.)
212+
and unpack them to local storage when needed. This can greatly reduce
213+
the load on the scratch filesystem.
134214

135215
Using methods like this is recommended if you're working with thousands
136216
of files.
@@ -139,30 +219,43 @@ Working with tar balls is done in a following fashion:
139219

140220
#. Determine if your input data can be collected into analysis-sized
141221
chunks that can be (if possible) re-used
142-
#. Make a tar ball out of the input data (``tar cf <tar filename>.tar
143-
<input files>``)
222+
#. Make a tar ball out of the input data (``tar cf ARCHIVE_FILENAME.tar
223+
INPUT_FILES ...``)
144224
#. At the beginning of job copy the tar ball into ``/tmp`` and untar it
145-
there (``tar xf <tar filename>.tar``)
225+
there (``tar xf ARCHIVE_FILENAME.tar``)
146226
#. Do the analysis here, in the local disk
147227
#. If output is a large amount of files, tar them and copy them out.
148228
Otherwise write output to ``$WRKDIR``
149229

150230
A sample code is below:
151231

152232
.. code-block:: slurm
233+
:emphasize-lines: 10-11,19-24
153234
154235
#!/bin/bash
155-
156236
#SBATCH --time=12:00:00
157237
#SBATCH --mem-per-cpu=2000M # time and memory requirements
158-
mkdir /tmp/$SLURM_JOB_ID # get a directory where you will put your data
159-
cp $WRKDIR/input.tar /tmp/$SLURM_JOB_ID # copy tarred input files
238+
#SBATCH --tmp=50G
239+
240+
# get a directory where you will put your data and change to it
241+
mkdir /tmp/$SLURM_JOB_ID
160242
cd /tmp/$SLURM_JOB_ID
161243
162-
trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT # set the trap: when killed or exits abnormally you clean up your stuff
244+
# set the trap: when killed or exits abnormally you clean up your stuff
245+
trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT
246+
247+
# untar the files. If we only unpack once, there is no point in
248+
# making an initial copy to local disks.
249+
tar xf $WRKDIR/input.tar
250+
251+
srun MY_PROGRAM input/* # do the analysis, or what ever else, on the input files
163252
164-
tar xf input.tar # untar the files
165-
srun input/* # do the analysis, or what ever else
166-
tar cf output.tar output/* # tar output
253+
# If you generate many output files, tar them before copying them
254+
# back.
255+
# If it's just a few files of output, you can copy back directly
256+
# (or even output them straight to scratch)
257+
tar cf output.tar output/ # tar output (if needed)
167258
mv output.tar $WRKDIR/SOMEDIR # copy results back
168259
260+
# Un-set the trap since we ended successfully
261+
trap - TERM EXIT

0 commit comments

Comments
 (0)