Skip to content

Conversation

@fawzi
Copy link
Contributor

@fawzi fawzi commented Jun 24, 2025

Added the basic info and commands to improve lustre performance.
I did not describe things that we do not use (like pool), or that normally should not be used like index/offset.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@fawzi
Copy link
Contributor Author

fawzi commented Jun 24, 2025

probably we should link to it from MLP and maybe other places too

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fawzi, this is very nice to have.

It might be a good idea to at least cross-link (i.e. possibly both directions) with https://docs.cscs.ch/guides/storage/, https://docs.cscs.ch/storage/filesystems/, and https://docs.cscs.ch/alps/storage/ for better discoverability? The platform pages (e.g. https://docs.cscs.ch/platforms/hpcp/#file-systems-and-storage) also have some storage related content.

The info that you're adding is unique and new, but we might have to globally optimize a bit how we group all storage-related info.

Comment on lines 1 to 2
# Lustre Tuning
`/capstor/` and `/iopsstor` are both [lustre](https://lustre.org) filesystem.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Lustre Tuning
`/capstor/` and `/iopsstor` are both [lustre](https://lustre.org) filesystem.
# Lustre tuning
[Capstor][ref-alps-capstor] and [Iopsstor][ref-alps-iopsstor] are both [Lustre](https://lustre.org) filesystems.

In general, do have a quick read of https://docs.cscs.ch/contributing/ for some guidelines on formatting etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed references, merged with storage guide, and added some cross references.

# Lustre Tuning
`/capstor/` and `/iopsstor` are both [lustre](https://lustre.org) filesystem.
Lustre is an open-source, parallel file system used in HPC systems.
As shown in ![Lustre architecture](/images/storage/lustre.png) uses *metadata* servers to store and query metadata which is basically what is shown by `ls`: directory structure, file permission, modification dates,..
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image doesn't show up in the rendered output. Could you make sure that it's correctly rendered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because it refers to the image in a place that will have the image only after it is merged, in local serving it works, or should I add the image explicitly somewhere?

`/capstor/` and `/iopsstor` are both [lustre](https://lustre.org) filesystem.
Lustre is an open-source, parallel file system used in HPC systems.
As shown in ![Lustre architecture](/images/storage/lustre.png) uses *metadata* servers to store and query metadata which is basically what is shown by `ls`: directory structure, file permission, modification dates,..
This data is globally synchronized, which means that handling many small files is not especially suited for lustre, and the perfomrance of that part is similar on both capstor and iopsstor.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This data is globally synchronized, which means that handling many small files is not especially suited for lustre, and the perfomrance of that part is similar on both capstor and iopsstor.
This data is globally synchronized, which means that handling many small files is not especially suited for Lustre, and the perfomrance of that part is similar on both capstor and iopsstor.

here and elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This data is globally synchronized, which means that handling many small files is not especially suited for lustre, and the perfomrance of that part is similar on both capstor and iopsstor.
This data is globally synchronized, which means that handling many small files is not well suited for lustre.

Leave out the second part? Iopsstor should handle small files slightly better than Capstor, no? I may very well also be mistaken in which case it's obviously good to leave it in in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance of the metadata part is roughly the same on iopsstor and capstor, it depends on the number of metadata servers and users adding load to it. For a while backup was putting extra load on capstor. Datawriting of many small files might indeed be slightly better on iopsstor because SSD are faster

Lustre is an open-source, parallel file system used in HPC systems.
As shown in ![Lustre architecture](/images/storage/lustre.png) uses *metadata* servers to store and query metadata which is basically what is shown by `ls`: directory structure, file permission, modification dates,..
This data is globally synchronized, which means that handling many small files is not especially suited for lustre, and the perfomrance of that part is similar on both capstor and iopsstor.
With many small files, a local filesystems like `/dev/shmem/$USER` or "/tmp", if enough memory can be spared for it, can be *much* faster, and offset the packing/unpacking work. Alternatively using a squashed filesystems can be a good option.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With many small files, a local filesystems like `/dev/shmem/$USER` or "/tmp", if enough memory can be spared for it, can be *much* faster, and offset the packing/unpacking work. Alternatively using a squashed filesystems can be a good option.
With many small files, an in-memory filesystem like `/dev/shm/$USER` or "/tmp", if enough memory can be spared for it, can be *much* faster, and offset the packing/unpacking work. Alternatively using a squashed filesystems can be a good option.

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of overlap with https://docs.cscs.ch/guides/storage/#many-small-files-vs-hpc-file-systems. Can we link or avoid some duplication of the motivation etc.?

Actually, do you think this guide could fit inside the existing storage guide?

With many small files, a local filesystems like `/dev/shmem/$USER` or "/tmp", if enough memory can be spared for it, can be *much* faster, and offset the packing/unpacking work. Alternatively using a squashed filesystems can be a good option.

The data itself is subdivided in blocks of size `<blocksize>` and is stored by Object Storage Servers (OSS) in one or more Object Storage Targets (OST).
The blocksize and number of OSTs to use is defined by the striping settings. A new file or directory ihnerits them from its parent directory. The `lfs getstripe <path>` command can be used to get information on the actual stripe settings. For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout. The simplest way to have the correct layout is to copy to a directory with the correct layout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The blocksize and number of OSTs to use is defined by the striping settings. A new file or directory ihnerits them from its parent directory. The `lfs getstripe <path>` command can be used to get information on the actual stripe settings. For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout. The simplest way to have the correct layout is to copy to a directory with the correct layout
The blocksize and number of OSTs to use is defined by the striping settings.
A new file or directory ihnerits them from its parent directory. The `lfs getstripe <path>` command can be used to get information on the actual stripe settings.
For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout.
The simplest way to have the correct layout is to copy to a directory with the correct layout.

Comment on lines 22 to 25
!!! example "Good default settings"
```console
lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <base_dir>
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: do we know the defaults on capstor/ioppstor? It might be useful for users to know when they should actually care about changing the settings. Or are these "Good default settings" actually our defaults?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question, I saw some commands like this before but I honestly don't really get what will happen to the file in details, so if you/anyone tested it'd be great to explain it a bit, please ? :)
e.g. does it create 64MB stripes? is this the optimal size? (I recall getting better results reading files a bit larger, but no idea how that translates)

and @msimberg says, if this is a good definition, can we please make it the default?
does writing require different values vs. reading?

QPM might good a time to talk to storage and find someone to run these tests?

Thanks and thank you for submitting this too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition means first 4M use just one OST (less locks, less overhead), until 64M 4 OST, then all OST and 4MB blocksize, I tested it, but not heavily. I discussed already with @mpasserini , but he is not keen on changing the things especially as capstor might be replaced for MLP soon. I am willing to push users that are willing to set things, and if we then have more test coverage really change the default

@fawzi fawzi requested a review from mpasserini as a code owner June 25, 2025 09:27
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

7 similar comments
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/166

Copy link
Member

@bcumming bcumming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Fawzi.

These are a good starting point for us to develop more detailed docs over time.

I made a few tweaks and pushed them, to speed up the process.

@bcumming bcumming merged commit e796cc0 into eth-cscs:main Jun 25, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants