Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/guides/lustre-tuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Lustre Tuning
`/capstor/` and `/iopsstor` are both [lustre](https://lustre.org) filesystem.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Lustre Tuning
`/capstor/` and `/iopsstor` are both [lustre](https://lustre.org) filesystem.
# Lustre tuning
[Capstor][ref-alps-capstor] and [Iopsstor][ref-alps-iopsstor] are both [Lustre](https://lustre.org) filesystems.

In general, do have a quick read of https://docs.cscs.ch/contributing/ for some guidelines on formatting etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed references, merged with storage guide, and added some cross references.

Lustre is an open-source, parallel file system used in HPC systems.
As shown in ![Lustre architecture](/images/storage/lustre.png) uses *metadata* servers to store and query metadata which is basically what is shown by `ls`: directory structure, file permission, modification dates,..
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image doesn't show up in the rendered output. Could you make sure that it's correctly rendered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because it refers to the image in a place that will have the image only after it is merged, in local serving it works, or should I add the image explicitly somewhere?

This data is globally synchronized, which means that handling many small files is not especially suited for lustre, and the perfomrance of that part is similar on both capstor and iopsstor.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This data is globally synchronized, which means that handling many small files is not especially suited for lustre, and the perfomrance of that part is similar on both capstor and iopsstor.
This data is globally synchronized, which means that handling many small files is not especially suited for Lustre, and the perfomrance of that part is similar on both capstor and iopsstor.

here and elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This data is globally synchronized, which means that handling many small files is not especially suited for lustre, and the perfomrance of that part is similar on both capstor and iopsstor.
This data is globally synchronized, which means that handling many small files is not well suited for lustre.

Leave out the second part? Iopsstor should handle small files slightly better than Capstor, no? I may very well also be mistaken in which case it's obviously good to leave it in in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance of the metadata part is roughly the same on iopsstor and capstor, it depends on the number of metadata servers and users adding load to it. For a while backup was putting extra load on capstor. Datawriting of many small files might indeed be slightly better on iopsstor because SSD are faster

With many small files, a local filesystems like `/dev/shmem/$USER` or "/tmp", if enough memory can be spared for it, can be *much* faster, and offset the packing/unpacking work. Alternatively using a squashed filesystems can be a good option.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With many small files, a local filesystems like `/dev/shmem/$USER` or "/tmp", if enough memory can be spared for it, can be *much* faster, and offset the packing/unpacking work. Alternatively using a squashed filesystems can be a good option.
With many small files, an in-memory filesystem like `/dev/shm/$USER` or "/tmp", if enough memory can be spared for it, can be *much* faster, and offset the packing/unpacking work. Alternatively using a squashed filesystems can be a good option.

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of overlap with https://docs.cscs.ch/guides/storage/#many-small-files-vs-hpc-file-systems. Can we link or avoid some duplication of the motivation etc.?

Actually, do you think this guide could fit inside the existing storage guide?


The data itself is subdivided in blocks of size `<blocksize>` and is stored by Object Storage Servers (OSS) in one or more Object Storage Targets (OST).
The blocksize and number of OSTs to use is defined by the striping settings. A new file or directory ihnerits them from its parent directory. The `lfs getstripe <path>` command can be used to get information on the actual stripe settings. For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout. The simplest way to have the correct layout is to copy to a directory with the correct layout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The blocksize and number of OSTs to use is defined by the striping settings. A new file or directory ihnerits them from its parent directory. The `lfs getstripe <path>` command can be used to get information on the actual stripe settings. For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout. The simplest way to have the correct layout is to copy to a directory with the correct layout
The blocksize and number of OSTs to use is defined by the striping settings.
A new file or directory ihnerits them from its parent directory. The `lfs getstripe <path>` command can be used to get information on the actual stripe settings.
For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout.
The simplest way to have the correct layout is to copy to a directory with the correct layout.


A blocksize of 4MB gives good throughput, without being overly big, so it is a good choice when reading a file sequentially or in large chuncks, but if one reads shorter chuncks in random order it might be better to reduce the size, the performance will be smaller, but the performance of your application might actually increase.
https://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace

!!! example "Good large files settings"
```console
lfs setstripe --stripe-count -1 --stripe-size 4M <big_files_dir>`
```

Lustre also supports composite layouts, switching from one layout to another at a given size `--component-end` (`-E`).
With it it is possible to create a Progressive file layout switching `--stripe-count` (`-c`), `--stripe-size` (`-S`), so that fewer locks are required for smaller files, but load is distributed for larger files.

!!! example "Good default settings"
```console
lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <base_dir>
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: do we know the defaults on capstor/ioppstor? It might be useful for users to know when they should actually care about changing the settings. Or are these "Good default settings" actually our defaults?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question, I saw some commands like this before but I honestly don't really get what will happen to the file in details, so if you/anyone tested it'd be great to explain it a bit, please ? :)
e.g. does it create 64MB stripes? is this the optimal size? (I recall getting better results reading files a bit larger, but no idea how that translates)

and @msimberg says, if this is a good definition, can we please make it the default?
does writing require different values vs. reading?

QPM might good a time to talk to storage and find someone to run these tests?

Thanks and thank you for submitting this too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition means first 4M use just one OST (less locks, less overhead), until 64M 4 OST, then all OST and 4MB blocksize, I tested it, but not heavily. I discussed already with @mpasserini , but he is not keen on changing the things especially as capstor might be replaced for MLP soon. I am willing to push users that are willing to set things, and if we then have more test coverage really change the default


## Iopsstor vs Capstor

`iopsstor` uses SSD as OST, thus random access is quick, and the performance of the single OST is high. `capstor` on another hand uses harddisks, it has a larger capacity, and it also have many more OSS, thus the total bandwidth is larger.

!!! Note
ML model training normally has better performance if reading from iopsstor (random access, difficult to predict access pattern). Checkpoint can be done to capstor (very good for contiguous access).
Binary file added docs/images/storage/lustre.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ nav:
- guides/index.md
- 'Internet Access on Alps': guides/internet-access.md
- 'Storage': guides/storage.md
- 'Lustre tuning': guides/lustre-tuning.md
- 'Using the terminal': guides/terminal.md
- 'MLP Tutorials':
- guides/mlp_tutorials/index.md
Expand Down