22Storage: local drives
33=====================
44
5- .. seealso ::
5+ .. admonition :: Abstract
6+
7+ - Path is ``/tmp/ ``
8+ - Local drives are useful for large temporary data or unpacking
9+ many small files before analysis. They are most important for
10+ GPU training data but are useful other times, too.
11+ - Local storage can be either SSD drives (big and reasonably fast),
12+ spinning hard disks (HDDs; older nodes), or ramdisk (using your
13+ job's memory; extremely fast).
14+ - Request local storage with ``--tmp=NNg `` (the space you think you
15+ need; but the space isn't reserved just for you).
16+ - For ramdisk, the space comes out of your ``--mem= `` allocation.
617
7- :doc: ` the storage tutorial < ../tut/storage >`.
18+ .. seealso ::
819
9- Local disks on computing nodes are the preferred place for doing your
10- IO. The general idea is use network storage as a backend and local disk
11- for actual data processing. **Some nodes have no disks ** (local
12- storage comes out of the job memory, **some older nodes have HDDs **
13- (spinning disks), and some **SSDs **.
20+ :doc: `The storage tutorial <../tut/storage >`.
21+
22+ Local disks on computing nodes are the preferred place for doing
23+ extensive input/output (IO; reading/writing files). The general idea
24+ is use network storage as a backend and local disk for actual data
25+ processing when it requires many reads or writes. **Different nodes
26+ have different types of disks, Triton is very heterogeneous **:
27+
28+ .. list-table ::
29+ :header-rows: 1
30+
31+ - - Type
32+ - Description
33+ - Requesting
34+ - Path
35+ - - Solid-state drives (SSDs)
36+ - Much faster than HDDs but much slower than ramdisk. Generally
37+ GPU nodes have SSDs these days.
38+ - ``--tmp=NNg ``. The space is not guaranteed just for you.
39+ - ``/tmp/ ``
40+ - - Spinning hard disks (HDDs)
41+ - Generally only older CPU nodes have HDDs.
42+ - ``--tmp=NNg `` to specify size you need. The space is not
43+ guaranteed just for you.
44+ - ``/tmp/ ``
45+ - - Ramdisk
46+ - Uses your jobs memory allocation. Limited space but lightning
47+ fast.
48+ - ``--mem=NNg `` to request enough memory for your job and your
49+ storage.
50+ - ``/tmp/ `` on diskless nodes and ``/dev/shm/ `` on every node.
51+
52+ See :doc: `../overview ` for details on each node's local storage.
53+
54+ The reason that local storage matters is that :doc: `lustre ` (scratch)
55+ is not good for many :doc: `smallfiles `. Read those articles for
56+ background.
57+
58+
59+ Background
60+ ----------
1461
1562A general use pattern:
1663
17- - In the beginning of the job, copy needed input from WRKDIR to ``/tmp ``.
64+ - In the beginning of the job, copy needed input from Scratch to ``/tmp ``.
1865- Run your calculation normally reading input from or writing output
1966 to to ``/tmp ``.
20- - In the end copy relevant output to WRKDIR for analysis and further
67+ - In the end copy relevant output to Scratch for analysis and further
2168 usage.
2269
23- Pros
70+ Pros:
2471
25- - You get better and steadier IO performance. WRKDIR is shared over all
26- users making per-user performance actually rather poor.
27- - You save performance for WRKDIR to those who cannot use local disks.
72+ - You get better and steadier IO performance. Scratch is shared over all
73+ users making per-user performance can be poor at times, especially
74+ for many small files.
75+ - You save performance for Scratch to those who cannot use local disks.
2876- You get much better performance when using many small files (Lustre
2977 works poorly here) or random access.
30- - Saves your quota if your code generate lots of data but finally you
31- need only part of it
78+ - Saves your quota if your code generate lots of data but you only
79+ need to save part of it.
3280- In general, it is an excellent choice for single-node runs (that is
3381 all job's task run on the same node).
3482
35- Cons
83+ Cons:
3684
3785- NOT for the long-term data. Cleaned every time your job is finished.
3886- Space is more limited (but still can be TBs on some nodes)
3987- Need some awareness of what is on each node, since they are different
4088- Small learning curve (must copy files before and after the job).
4189- Not feasible for cross-node IO (MPI jobs where different tasks
42- write to the same files). Use WRKDIR instead.
90+ write to the same files). Use Scratch instead.
91+
4392
4493
94+ Usage
95+ -----
4596
46- How to use local drives on compute nodes
47- ----------------------------------------
97+ ``/tmp `` is the temporary directory. It is ramdisk on diskless nodes.
4898
49- ``/tmp `` is the temporary directory. It is per-user (not per-job), if
50- you get two jobs running on the same node, you get the same ``/tmp ``.
51- It is automatically removed once the last job on a node finishes.
99+ It is per-user (not per-job), if you get two jobs running on the same
100+ node, you get the same ``/tmp ``. Thus, it is wise to ``mkdir
101+ /tmp/$SLURM_JOB_ID/ `` and use that directory, and delete it once the
102+ job is done.
103+
104+ Everything is automatically removed once the last job on a node
105+ finishes.
52106
53107
54108Nodes with local disks
55109~~~~~~~~~~~~~~~~~~~~~~
56110
57- You can see the nodes with local disks on :doc: `../overview `. (To
58- double check from within the cluster, you can verify node info with
59- ``sinfo show node NODENAME `` and see the ``localdisk `` tag in
60- ``slurm features ``). Disk sizes greatly vary from hundreds of GB to
61- tens of TB.
111+ You can see the nodes with local disks on :doc: `../overview `. Disk
112+ sizes greatly vary from hundreds of GB (older nodes, when everything
113+ had spinning disks) to tens of TB (new GPU nodes designed for ML
114+ training).
115+
116+ .. admonition :: Verifying node details directly through Slurm
117+
118+ You don't usually need to do this. You can verify node info with
119+ ``sinfo show node NODENAME `` and look for ``TmpDisk= `` or
120+ ``AvailableFeatures=localdisk ``. ``slurm features `` will list all
121+ nodes (look for ``localdisk `` in features).
122+
123+ You can use ``--tmp=nnnG `` (for example ``--tmp=100G ``). You can use
124+ ``--constraint=localdisk `` to ensure a disk of any type, but you may
125+ as well just specify how much space you need.
62126
63- You have to use ``--constraint=localdisk `` to ensure that you get a
64- hard disk. You can use ``--tmp=nnnG `` (for example ``--tmp=100G ``) to
65- request a node with at least that much temporary space. But,
66127``--tmp `` doesn't allocate this space just for you: it's shared among
67128all users, including those which didn't request storage space. So,
68- you *might * not have as much as you think. Beware and handle out of
69- memory gracefully.
129+ you *might * not have as much as you think. Beware and handle " out of
130+ space" errors gracefully.
70131
71132
72133Nodes without local disks
@@ -75,7 +136,7 @@ Nodes without local disks
75136You can still use ``/tmp ``, but it is an in-memory ramdisk. This
76137means it is *very * fast, but is using the actual main memory that is
77138used by the programs. It comes out of your job's memory allocation,
78- so use a ``--mem `` amount with enough space for your job and any
139+ so use a ``--mem=nnG `` amount with enough space for your job and any
79140temporary storage.
80141
81142
@@ -86,51 +147,70 @@ Examples
86147Interactively
87148~~~~~~~~~~~~~
88149
89- How to use /tmp when you login interactively
150+ How to use /tmp when you login interactively, for example space to
151+ decompress a big file.
90152
91153.. code-block :: console
92154
93- $ sinteractive --time=1:00:00 # request a node for one hour
94- (node)$ mkdir /tmp/$SLURM_JOB_ID # create a unique directory, here we use
155+ $ sinteractive --time=1:00:00 --tmp=500G # request a node for one hour
156+ (node)$ mkdir /tmp/$SLURM_JOB_ID # create a unique directory, here we use
95157 (node)$ cd /tmp/$SLURM_JOB_ID
96158 ... do what you wanted ...
97- (node)$ cp your_files $WRKDIR/my/valuable/data # copy what you need
98- (node)$ cd; rm -rf /tmp/$SLURM_JOB_ID # clean up after yourself
159+ (node)$ cp YOUR_FILES $WRKDIR/my/valuable/data # copy what you need
160+ (node)$ cd; rm -rf /tmp/$SLURM_JOB_ID # clean up after yourself
99161 (node)$ exit
100162
101- In batch script
102- ~~~~~~~~~~~~~~~
103163
104- This batch job example that prevents data loss in case program gets
105- terminated (either because of ``scancel `` or due to time limit).
164+
165+ In batch script - save data if job ends prematurely
166+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
167+
168+ This batch job example that has a trigger (``trap ``) that prevents
169+ data loss in case the program gets terminated early (either because of
170+ ``scancel ``, the time limit, or some other error). It copies the data
171+ to a different location (``$WRKDIR/$SLURM_JOB_ID ``) in case of errors
172+ compared to other normal exits.
106173
107174.. code-block :: slurm
175+ :emphasize-lines: 15-17,26-27
108176
109- #!/bin/bash
177+ #!/bin/bash
178+ #SBATCH --time=12:00:00
179+ #SBATCH --mem-per-cpu=2500M # time and memory requirements
180+ #SBATCH --output=test-local.out
181+ #SBATCH --tmp=50G
110182
111- #SBATCH --time=12:00:00
112- #SBATCH --mem-per-cpu=2500M # time and memory requirements
183+ # The below, if uncommented, will cause the script to abort (and trap
184+ # to run) if there are any unhandled errors.
185+ #set -euo pipefail
113186
114- mkdir /tmp/$SLURM_JOB_ID # get a directory where you will send all output from your program
115- cd /tmp/$SLURM_JOB_ID
187+ # get a directory where you will send all output from your program
188+ mkdir /tmp/$SLURM_JOB_ID
189+ cd /tmp/$SLURM_JOB_ID
190+
191+ ## set the trap: when killed or exits abnormally you get the
192+ ## output copied to $WRKDIR/$SLURM_JOB_ID anyway
193+ trap "rsync -a /tmp/$SLURM_JOB_ID/ $WRKDIR/$SLURM_JOB_ID/ ; exit" TERM EXIT
116194
117- ## set the trap: when killed or exits abnormally you get the
118- ## output copied to $WRKDIR/$SLURM_JOB_ID anyway
119- trap "mkdir $WRKDIR/$SLURM_JOB_ID; mv -f /tmp/$SLURM_JOB_ID $ WRKDIR/$SLURM_JOB_ID; exit" TERM EXIT
195+ ## run the program and redirect all IO to a local drive
196+ ## assuming that you have your program and input at $WRKDIR
197+ srun $WRKDIR/my_program $ WRKDIR/input > output
120198
121- ## run the program and redirect all IO to a local drive
122- ## assuming that you have your program and input at $WRKDIR
123- srun $WRKDIR/my_program $WRKDIR/input > output
199+ # move your output fully or partially
200+ mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR
124201
125- mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR # move your output fully or partially
202+ # Un-set the trap since we ended successfully
203+ trap - TERM EXIT
126204
127205
128206
129207 Batch script for thousands input/output files
130208~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
131209
132- If your job requires a large amount of files as input/output using tar
133- utility can greatly reduce the load on the ``$WRKDIR ``-filesystem.
210+ If your job requires a large amount of files as input/output, you can
211+ store the files in a single archive format (``.tar ``, ``.zip ``, etc.)
212+ and unpack them to local storage when needed. This can greatly reduce
213+ the load on the scratch filesystem.
134214
135215Using methods like this is recommended if you're working with thousands
136216of files.
@@ -139,30 +219,43 @@ Working with tar balls is done in a following fashion:
139219
140220#. Determine if your input data can be collected into analysis-sized
141221 chunks that can be (if possible) re-used
142- #. Make a tar ball out of the input data (``tar cf <tar filename> .tar
143- <input files> ``)
222+ #. Make a tar ball out of the input data (``tar cf ARCHIVE_FILENAME .tar
223+ INPUT_FILES ... ``)
144224#. At the beginning of job copy the tar ball into ``/tmp `` and untar it
145- there (``tar xf <tar filename> .tar ``)
225+ there (``tar xf ARCHIVE_FILENAME .tar ``)
146226#. Do the analysis here, in the local disk
147227#. If output is a large amount of files, tar them and copy them out.
148228 Otherwise write output to ``$WRKDIR ``
149229
150230A sample code is below:
151231
152232.. code-block :: slurm
233+ :emphasize-lines: 10-11,19-24
153234
154235 #!/bin/bash
155-
156236 #SBATCH --time=12:00:00
157237 #SBATCH --mem-per-cpu=2000M # time and memory requirements
158- mkdir /tmp/$SLURM_JOB_ID # get a directory where you will put your data
159- cp $WRKDIR/input.tar /tmp/$SLURM_JOB_ID # copy tarred input files
238+ #SBATCH --tmp=50G
239+
240+ # get a directory where you will put your data and change to it
241+ mkdir /tmp/$SLURM_JOB_ID
160242 cd /tmp/$SLURM_JOB_ID
161243
162- trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT # set the trap: when killed or exits abnormally you clean up your stuff
244+ # set the trap: when killed or exits abnormally you clean up your stuff
245+ trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT
246+
247+ # untar the files. If we only unpack once, there is no point in
248+ # making an initial copy to local disks.
249+ tar xf $WRKDIR/input.tar
250+
251+ srun MY_PROGRAM input/* # do the analysis, or what ever else, on the input files
163252
164- tar xf input.tar # untar the files
165- srun input/* # do the analysis, or what ever else
166- tar cf output.tar output/* # tar output
253+ # If you generate many output files, tar them before copying them
254+ # back.
255+ # If it's just a few files of output, you can copy back directly
256+ # (or even output them straight to scratch)
257+ tar cf output.tar output/ # tar output (if needed)
167258 mv output.tar $WRKDIR/SOMEDIR # copy results back
168259
260+ # Un-set the trap since we ended successfully
261+ trap - TERM EXIT
0 commit comments