Merge pull request #790 from AaltoSciComp/rkdarst/localstorage

rkdarst · web-flow · commit 4c8fc3c45bcc · 2025-10-22T10:23:28.000+03:00
triton/usage/localstorage: Big update
diff --git a/triton/ref/slurm.rst b/triton/ref/slurm.rst
@@ -34,8 +34,7 @@
                             ! ``-e ERRORFILE``               ! print errors into file *error*
                             ! ``--exclusive``                ! allocate exclusive access to nodes.  For large parallel jobs.
                             ! ``--constraint=FEATURE``       ! request *feature* (see ``slurm features`` for the current list of configured features, or Arch under the :ref:`hardware list <hardware-list>`).  Multiple with ``--constraint="hsw|skl"``.
-                            ! ``--constraint=localdisk``     ! request nodes that have local disks
-			    ! ``--tmp=nnnG``                 ! Request ``nnn`` GB of :doc:`local disk storage space </triton/usage/localstorage>`
+			    ! ``--tmp=nnnG``                 ! request a node with a :doc:`local disk storage space </triton/usage/localstorage>` and ``nnn`` GB of space on it.
                             ! ``--array=0-5,7,10-15``        ! Run job multiple times, use variable ``$SLURM_ARRAY_TASK_ID`` to adjust parameters.
                             ! ``--mail-type=TYPE``           ! notify of events: ``BEGIN``, ``END``, ``FAIL``, ``ALL``, ``REQUEUE`` (not on triton) or ``ALL.`` MUST BE used with ``--mail-user=`` only
                             ! ``--mail-user=first.last@aalto.fi`` ! Aalto email to send the notification about the job. External email addresses doesn't work.
diff --git a/triton/ref/storage.rst b/triton/ref/storage.rst
@@ -6,5 +6,5 @@
    Home                            | ``$HOME`` or ``/home/USERNAME/``           | hard quota 10GB             | Nightly   | all nodes                      | Small user specific files, no calculation data.
    Work                            | ``$WRKDIR`` or ``/scratch/work/USERNAME/`` | 200GB and 1 million files   | x         | all nodes                      | Personal working space for every user. Calculation data etc. Quota can be increased on request.
    Scratch                         | ``/scratch/DEPT/PROJECT/``                 | on request                  | x         | all nodes                      | Department/group specific project directories.
-   :doc:`Local temp (disk) </triton/usage/localstorage>` | ``/tmp/`` (nodes with disks only)          | local disk size             | x         | single-node                    | (Usually fastest) place for single-node calculation data.  Removed once user's jobs are finished on the node.  Request with ``--tmp=nnnG`` or ``--constraint=localdisk``.
+   :doc:`Local temp (disk) </triton/usage/localstorage>` | ``/tmp/`` (nodes with disks only)          | local disk size             | x         | single-node                    | (Usually fastest) place for single-node calculation data.  Removed once user's jobs are finished on the node.  Request with ``--tmp=nnnG``.
    :doc:`Local temp (ramfs) </triton/usage/localstorage>`       | ``/dev/shm/`` (and ``/tmp/`` on diskless nodes) | limited by memory      | x         | single-node                    | Very fast but small in-memory filesystem
diff --git a/triton/tut/storage.rst b/triton/tut/storage.rst
@@ -32,8 +32,8 @@ choose between them.  The
          (recommended for most work)
 
      * ``/tmp``: temporary local disk space, pre-user mounted in jobs and
-       automatically cleaned up. Only on nodes with disks
-       (``--constraint=localdisk``), otherwise it's ramfs
+       automatically cleaned up. Use ``--tmp=nnnG`` to request at
+       least ``nnn`` GB of space, otherwise it's ramfs
      * ``/dev/shm``: ramfs, in-memory file storage
 
    * See :doc:`remotedata` for how to transfer and access the data
diff --git a/triton/usage/localstorage.rst b/triton/usage/localstorage.rst
@@ -2,71 +2,132 @@
 Storage: local drives
 =====================
 
-.. seealso::
+.. admonition:: Abstract
+
+   - Path is ``/tmp/``
+   - Local drives are useful for large temporary data or unpacking
+     many small files before analysis.  They are most important for
+     GPU training data but are useful other times, too.
+   - Local storage can be either SSD drives (big and reasonably fast),
+     spinning hard disks (HDDs; older nodes), or ramdisk (using your
+     job's memory; extremely fast).
+   - Request local storage with ``--tmp=NNg`` (the space you think you
+     need; but the space isn't reserved just for you).
+   - For ramdisk, the space comes out of your ``--mem=`` allocation.
 
-   :doc:`the storage tutorial <../tut/storage>`.
+.. seealso::
 
-Local disks on computing nodes are the preferred place for doing your
-IO. The general idea is use network storage as a backend and local disk
-for actual data processing.  **Some nodes have no disks** (local
-storage comes out of the job memory, **some older nodes have HDDs**
-(spinning disks), and some **SSDs**.
+   :doc:`The storage tutorial <../tut/storage>`.
+
+Local disks on computing nodes are the preferred place for doing
+extensive input/output (IO; reading/writing files).  The general idea
+is use network storage as a backend and local disk for actual data
+processing when it requires many reads or writes.  **Different nodes
+have different types of disks, Triton is very heterogeneous**:
+
+.. list-table::
+   :header-rows: 1
+
+   - - Type
+     - Description
+     - Requesting
+     - Path
+   - - Solid-state drives (SSDs)
+     - Much faster than HDDs but much slower than ramdisk.  Generally
+       GPU nodes have SSDs these days.
+     - ``--tmp=NNg``.  The space is not guaranteed just for you.
+     - ``/tmp/``
+   - - Spinning hard disks (HDDs)
+     - Generally only older CPU nodes have HDDs.
+     - ``--tmp=NNg`` to specify size you need.  The space is not
+       guaranteed just for you.
+     - ``/tmp/``
+   - - Ramdisk
+     - Uses your jobs memory allocation.  Limited space but lightning
+       fast.
+     - ``--mem=NNg`` to request enough memory for your job and your
+       storage.
+     - ``/tmp/`` on diskless nodes and ``/dev/shm/`` on every node.
+
+See :doc:`../overview` for details on each node's local storage.
+
+The reason that local storage matters is that :doc:`lustre` (scratch)
+is not good for many :doc:`smallfiles`.  Read those articles for
+background.
+
+
+Background
+----------
 
 A general use pattern:
 
-- In the beginning of the job, copy needed input from WRKDIR to ``/tmp``.
+- In the beginning of the job, copy needed input from Scratch to ``/tmp``.
 - Run your calculation normally reading input from or writing output
   to to ``/tmp``.
-- In the end copy relevant output to WRKDIR for analysis and further
+- In the end copy relevant output to Scratch for analysis and further
   usage.
 
-Pros
+Pros:
 
--  You get better and steadier IO performance. WRKDIR is shared over all
-   users making per-user performance actually rather poor.
--  You save performance for WRKDIR to those who cannot use local disks.
+-  You get better and steadier IO performance. Scratch is shared over all
+   users making per-user performance can be poor at times, especially
+   for many small files.
+-  You save performance for Scratch to those who cannot use local disks.
 -  You get much better performance when using many small files (Lustre
    works poorly here) or random access.
--  Saves your quota if your code generate lots of data but finally you
-   need only part of it
+-  Saves your quota if your code generate lots of data but you only
+   need to save part of it.
 -  In general, it is an excellent choice for single-node runs (that is
    all job's task run on the same node).
 
-Cons
+Cons:
 
 -  NOT for the long-term data. Cleaned every time your job is finished.
 -  Space is more limited (but still can be TBs on some nodes)
 -  Need some awareness of what is on each node, since they are different
 -  Small learning curve (must copy files before and after the job).
 -  Not feasible for cross-node IO (MPI jobs where different tasks
-   write to the same files). Use WRKDIR instead.
+   write to the same files). Use Scratch instead.
+
 
 
+Usage
+-----
 
-How to use local drives on compute nodes
-----------------------------------------
+``/tmp`` is the temporary directory.  It is ramdisk on diskless nodes.
 
-``/tmp`` is the temporary directory.  It is per-user (not per-job), if
-you get two jobs running on the same node, you get the same ``/tmp``.
-It is automatically removed once the last job on a node finishes.
+It is per-user (not per-job), if you get two jobs running on the same
+node, you get the same ``/tmp``.  Thus, it is wise to ``mkdir
+/tmp/$SLURM_JOB_ID/`` and use that directory, and delete it once the
+job is done.
+
+Everything is automatically removed once the last job on a node
+finishes.
 
 
 Nodes with local disks
 ~~~~~~~~~~~~~~~~~~~~~~
 
-You can see the nodes with local disks on :doc:`../overview`.  (To
-double check from within the cluster, you can verify node info with
-``sinfo show node NODENAME`` and see the ``localdisk`` tag in
-``slurm features``).  Disk sizes greatly vary from hundreds of GB to
-tens of TB.
+You can see the nodes with local disks on :doc:`../overview`.  Disk
+sizes greatly vary from hundreds of GB (older nodes, when everything
+had spinning disks) to tens of TB (new GPU nodes designed for ML
+training).
+
+.. admonition:: Verifying node details directly through Slurm
+
+   You don't usually need to do this.  You can verify node info with
+   ``sinfo show node NODENAME`` and look for ``TmpDisk=`` or
+   ``AvailableFeatures=localdisk``.  ``slurm features`` will list all
+   nodes (look for ``localdisk`` in features).
+
+You can use ``--tmp=nnnG`` (for example ``--tmp=100G``).  You can use
+``--constraint=localdisk`` to ensure a disk of any type, but you may
+as well just specify how much space you need.
 
-You have to use ``--constraint=localdisk`` to ensure that you get a
-hard disk.  You can use ``--tmp=nnnG`` (for example ``--tmp=100G``) to
-request a node with at least that much temporary space.  But,
 ``--tmp`` doesn't allocate this space just for you: it's shared among
 all users, including those which didn't request storage space.  So,
-you *might* not have as much as you think.  Beware and handle out of
-memory gracefully.
+you *might* not have as much as you think.  Beware and handle "out of
+space" errors gracefully.
 
 
 Nodes without local disks
@@ -75,7 +136,7 @@ Nodes without local disks
 You can still use ``/tmp``, but it is an in-memory ramdisk.  This
 means it is *very* fast, but is using the actual main memory that is
 used by the programs.  It comes out of your job's memory allocation,
-so use a ``--mem`` amount with enough space for your job and any
+so use a ``--mem=nnG`` amount with enough space for your job and any
 temporary storage.
 
 
@@ -86,51 +147,70 @@ Examples
 Interactively
 ~~~~~~~~~~~~~
 
-How to use /tmp when you login interactively
+How to use /tmp when you login interactively, for example space to
+decompress a big file.
 
 .. code-block:: console
 
-    $ sinteractive --time=1:00:00              # request a node for one hour
-    (node)$ mkdir /tmp/$SLURM_JOB_ID       # create a unique directory, here we use
+    $ sinteractive --time=1:00:00 --tmp=500G         # request a node for one hour
+    (node)$ mkdir /tmp/$SLURM_JOB_ID                 # create a unique directory, here we use
     (node)$ cd /tmp/$SLURM_JOB_ID
     ... do what you wanted ...
-    (node)$ cp your_files $WRKDIR/my/valuable/data  # copy what you need
-    (node)$ cd; rm -rf /tmp/$SLURM_JOB_ID  # clean up after yourself
+    (node)$ cp YOUR_FILES $WRKDIR/my/valuable/data   # copy what you need
+    (node)$ cd; rm -rf /tmp/$SLURM_JOB_ID            # clean up after yourself
     (node)$ exit
 
-In batch script
-~~~~~~~~~~~~~~~
 
-This batch job example that prevents data loss in case program gets
-terminated (either because of ``scancel`` or due to time limit).
+
+In batch script - save data if job ends prematurely
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This batch job example that has a trigger (``trap``) that prevents
+data loss in case the program gets terminated early (either because of
+``scancel``, the time limit, or some other error).  It copies the data
+to a different location (``$WRKDIR/$SLURM_JOB_ID``) in case of errors
+compared to other normal exits.
 
 .. code-block:: slurm
+   :emphasize-lines: 15-17,26-27
 
-    #!/bin/bash
+   #!/bin/bash
+   #SBATCH --time=12:00:00
+   #SBATCH --mem-per-cpu=2500M            # time and memory requirements
+   #SBATCH --output=test-local.out
+   #SBATCH --tmp=50G
 
-    #SBATCH --time=12:00:00
-    #SBATCH --mem-per-cpu=2500M                                   # time and memory requirements
+   # The below, if uncommented, will cause the script to abort (and trap
+   # to run) if there are any unhandled errors.
+   #set -euo pipefail
 
-    mkdir /tmp/$SLURM_JOB_ID                                      # get a directory where you will send all output from your program
-    cd /tmp/$SLURM_JOB_ID
+   # get a directory where you will send all output from your program
+   mkdir /tmp/$SLURM_JOB_ID
+   cd /tmp/$SLURM_JOB_ID
+
+   ## set the trap: when killed or exits abnormally you get the
+   ## output copied to $WRKDIR/$SLURM_JOB_ID anyway
+   trap "rsync -a /tmp/$SLURM_JOB_ID/ $WRKDIR/$SLURM_JOB_ID/ ; exit" TERM EXIT
 
-    ## set the trap: when killed or exits abnormally you get the
-    ## output copied to $WRKDIR/$SLURM_JOB_ID anyway
-    trap "mkdir $WRKDIR/$SLURM_JOB_ID; mv -f /tmp/$SLURM_JOB_ID $WRKDIR/$SLURM_JOB_ID; exit" TERM EXIT
+   ## run the program and redirect all IO to a local drive
+   ## assuming that you have your program and input at $WRKDIR
+   srun $WRKDIR/my_program $WRKDIR/input > output
 
-    ## run the program and redirect all IO to a local drive
-    ## assuming that you have your program and input at $WRKDIR
-    srun $WRKDIR/my_program $WRKDIR/input > output
+   # move your output fully or partially
+   mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR
 
-    mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR                   # move your output fully or partially
+   # Un-set the trap since we ended successfully
+   trap - TERM EXIT
 
 
 
 Batch script for thousands input/output files
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-If your job requires a large amount of files as input/output using tar
-utility can greatly reduce the load on the ``$WRKDIR``-filesystem.
+If your job requires a large amount of files as input/output, you can
+store the files in a single archive format (``.tar``, ``.zip``, etc.)
+and unpack them to local storage when needed.  This can greatly reduce
+the load on the scratch filesystem.
 
 Using methods like this is recommended if you're working with thousands
 of files.
@@ -139,30 +219,43 @@ Working with tar balls is done in a following fashion:
 
 #. Determine if your input data can be collected into analysis-sized
    chunks that can be (if possible) re-used
-#. Make a tar ball out of the input data (``tar cf <tar filename>.tar
-   <input files>``)
+#. Make a tar ball out of the input data (``tar cf ARCHIVE_FILENAME.tar
+   INPUT_FILES ...``)
 #. At the beginning of job copy the tar ball into ``/tmp`` and untar it
-   there (``tar xf <tar filename>.tar``)
+   there (``tar xf ARCHIVE_FILENAME.tar``)
 #. Do the analysis here, in the local disk
 #. If output is a large amount of files, tar them and copy them out.
    Otherwise write output to ``$WRKDIR``
 
 A sample code is below:
 
 .. code-block:: slurm
+   :emphasize-lines: 10-11,19-24
 
     #!/bin/bash
-
     #SBATCH --time=12:00:00
     #SBATCH --mem-per-cpu=2000M                       # time and memory requirements
-    mkdir /tmp/$SLURM_JOB_ID                          # get a directory where you will put your data
-    cp $WRKDIR/input.tar /tmp/$SLURM_JOB_ID           # copy tarred input files
+    #SBATCH --tmp=50G
+
+    # get a directory where you will put your data and change to it
+    mkdir /tmp/$SLURM_JOB_ID
     cd /tmp/$SLURM_JOB_ID
 
-    trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT  # set the trap: when killed or exits abnormally you clean up your stuff
+    # set the trap: when killed or exits abnormally you clean up your stuff
+    trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT
+
+    # untar the files.  If we only unpack once, there is no point in
+    # making an initial copy to local disks.
+    tar xf $WRKDIR/input.tar
+
+    srun MY_PROGRAM input/*                           # do the analysis, or what ever else, on the input files
 
-    tar xf input.tar                                  # untar the files
-    srun  input/*                                     # do the analysis, or what ever else
-    tar cf output.tar output/*                        # tar output
+    # If you generate many output files, tar them before copying them
+    # back.
+    # If it's just a few files of output, you can copy back directly
+    # (or even output them straight to scratch)
+    tar cf output.tar output/                         # tar output (if needed)
     mv output.tar $WRKDIR/SOMEDIR                     # copy results back
 
+   # Un-set the trap since we ended successfully
+    trap - TERM EXIT