Add lustre tuning guide (#166)

fawzi · web-flow · commit e796cc05b6d6 · 2025-06-25T18:12:07.000+02:00
diff --git a/docs/alps/storage.md b/docs/alps/storage.md
@@ -19,6 +19,8 @@ HPC storage is provided by independent clusters, composed of servers and physica
 
 Capstor and Iopsstor are on the same Slingshot network as Alps, while VAST is on the CSCS Ethernet network.
 
+See the [Lustre guide][ref-guides-storage-lustre] for some hints on how to get the best performance out of the filesystem.
+
 The mounts, and how they are used for Scratch, Store, and Home file systems that are mounted on clusters are documented in the [file system docs][ref-storage-fs].
 
 [](){#ref-alps-capstor}
diff --git a/docs/guides/storage.md b/docs/guides/storage.md
@@ -113,10 +113,52 @@ To set up a default so all newly created folders and dirs inside or your desired
 !!! info
     For more information read the setfacl man page: `man setfacl`.
 
+[](){#ref-guides-storage-lustre}
+## Lustre tuning
+[Capstor][ref-alps-capstor] and [Iopsstor][ref-alps-iopsstor] are both [Lustre](https://lustre.org) filesystem.
+
+![Lustre architecture](../images/storage/lustre.png)
+
+As shown in the schema above, Lustre uses *metadata* servers to store and query metadata, which is basically what is shown by `ls`: directory structure, file permission, and modification dates.
+Its performance is roughly the same on [Capstor][ref-alps-capstor] and [Iopsstor][ref-alps-iopsstor].
+This data is globally synchronized, which means Lustre is not well suited to handling many small files, see the discussion on [how to handle many small files][ref-guides-storage-small-files].
+
+The data itself is subdivided in blocks of size `<blocksize>` and is stored by Object Storage Servers (OSS) in one or more Object Storage Targets (OST).
+The blocksize and number of OSTs to use is defined by the striping settings, which are applied to a path, with new files and directories ihneriting them from their parent directory.
+The `lfs getstripe <path>` command can be used to get information on the stripe settings of a path.
+For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout.
+The simplest way to have the correct layout is to copy to a directory with the correct layout
+
+!!! tip "A blocksize of 4MB gives good throughput, without being overly big..."
+    ... so it is a good choice when reading a file sequentially or in large chunks, but if one reads shorter chunks in random order it might be better to reduce the size, the performance will be smaller, but the performance of your application might actually increase.
+    See the [Lustre documentation](https://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace) for more information.
+
+
+!!! example "Settings for large files"
+    ```console
+    lfs setstripe --stripe-count -1 --stripe-size 4M <big_files_dir>`
+    ```
+Lustre also supports composite layouts, switching from one layout to another at a given size `--component-end` (`-E`).
+With it it is possible to create a Progressive file layout switching `--stripe-count` (`-c`), `--stripe-size` (`-S`), so that fewer locks are required for smaller files, but load is distributed for larger files.
+
+!!! example "Good default settings"
+    ```console
+    lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <base_dir>
+    ```
+
+### Iopsstor vs Capstor
+
+[Iopsstor][ref-alps-iopsstor] uses SSD as OST, thus random access is quick, and the performance of the single OST is high.
+[Capstor][ref-alps-capstor] on another hand uses harddisks, it has a larger capacity, and  it also have many more OSS, thus the total bandwidth is larger.
+See for example the [ML filesystem guide][ref-mlp-storage-suitability].
+
+[](){#ref-guides-storage-small-files}
 ## Many small files vs. HPC File Systems
 
 Workloads that read or create many small files are not well-suited to parallel file systems, which are designed for parallel and distributed I/O.
 
+In some cases, and if enough memory is available it might be worth to unpack/repack the small files to in-memory filesystems like `/dev/shm/$USER` or `/tmp`, which are *much* faster, or to use a squashfs filesystem that is stored as a single large file on Lustre.
+
 Workloads that do not play nicely with Lustre include:
 
 * Configuration and compiling applications.
diff --git a/docs/images/storage/lustre.png b/docs/images/storage/lustre.png
diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md
@@ -52,17 +52,19 @@ Use scratch to store datasets that will be accessed by jobs, and for job output.
 Scratch is per user - each user gets separate scratch path and quota.
 
 * The environment variable `SCRATCH=/iopsstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch.
-* There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`. 
+* There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`.
 
 !!! warning "scratch cleanup policy"
     Files that have not been accessed in 30 days are automatically deleted.
 
     **Scratch is not intended for permanent storage**: transfer files back to the capstor project storage after job runs.
 
+[](){#ref-mlp-storage-suitability}
 !!! note "file system suitability"
     The Capstor scratch filesystem is based on HDDs and is optimized for large, sequential read and write operations.
     We recommend using Capstor for storing **checkpoint files** and other **large, contiguous outputs** generated by your training runs.
     In contrast, Iopstor uses high-performance NVMe drives, which excel at handling **IOPS-intensive workloads** involving frequent, random access. This makes it a better choice for storing **training datasets**, especially when accessed randomly during machine learning training.
+    See the [Lustre guide][ref-guides-storage-lustre] for some hints on how to get the best performance out of the filesystem.
 
 ### Scratch Usage Recommendations
 
diff --git a/docs/storage/filesystems.md b/docs/storage/filesystems.md
@@ -84,6 +84,7 @@ Daily [snapshots][ref-storage-snapshots] for the last seven days are provided in
 ## Scratch
 
 The Scratch file system is a fast workspace tuned for use by parallel jobs, with an emphasis on performance over reliability, hosted on the [Capstor][ref-alps-capstor] Lustre filesystem.
+See the [Lustre guide][ref-guides-storage-lustre] for some hints on how to get the best performance out of the filesystem.
 
 All users on Alps get their own Scratch path, `/capstor/scratch/cscs/$USER`, which is pointed to by the variable `$SCRATCH` on the [HPC Platform][ref-platform-hpcp] and [Climate and Weather Platform][ref-platform-cwp] clusters Eiger, Daint and Santis.
 
@@ -123,6 +124,7 @@ Please ensure that you move important data to a file system with backups, for ex
 ## Store
 
 Store is a large, medium-performance, storage on the [Capstor][ref-alps-capstor] Lustre file system for sharing data within a project, and for medium term data storage.
+See the [Lustre guide][ref-guides-storage-lustre] for some hints on how to get the best preformance out of the filesystem.
 
 Space on Store is allocated per-project, with a path created for each project.
 To accomodate the different customers and projects on Alps, the project paths are organised as follows: