|
| 1 | +[](){#ref-guides-storage} |
| 2 | +# Storage |
| 3 | + |
| 4 | +## Many small files vs. HPC File Systems |
| 5 | + |
| 6 | +Workloads that read or create many small files are not well-suited to parallel file systems, which are designed for parallel and distributed I/O. |
| 7 | + |
| 8 | +Workloads that do not play nicely with Lustre include: |
| 9 | + |
| 10 | +* Configuration and compiling applications. |
| 11 | +* Using Python virtual environments |
| 12 | + |
| 13 | +At first it can seem strange that a "high-performance" file system is significantly slower than a laptop drive for a "simple" task like compilation or loading Python modules, however Lustre is designed for high-bandwidth parallel file access from many nodes at the same time, with the attendant trade offs this implies. |
| 14 | + |
| 15 | +Meta data lookups on Lustre are expensive compared to your laptop, where the local file system is able to agressively cache meta data. |
| 16 | + |
| 17 | +### Python virtual environments with uenv |
| 18 | + |
| 19 | +Python virtual environments can be very slow on Lustre, for example a simple `import numpy` command run on Lustre might take seconds, compared to milliseconds on your laptop. |
| 20 | + |
| 21 | +The main reasons for this include: |
| 22 | + |
| 23 | +* Python virtual environments contain many small files, on which Python performs `stat()`, `open()` and `read()` commands when loading a module. |
| 24 | +* Python pre-compiles `.pyc` files for each `.py` file in a project. |
| 25 | +* All of these operations create a lot of meta-data lookups. |
| 26 | + |
| 27 | +As a result, using virtual environments can be slow, and these problems are only exacerbated when the virtual environment is loaded simultaneously by many ranks in an MPI job. |
| 28 | + |
| 29 | +One solution is to use the tool `mksquashfs` to compresses the contents of a directory - files, inodes and sub-directories - into a single file. |
| 30 | +This file can be mounted as a read-only file [Squashfs](https://en.wikipedia.org/wiki/SquashFS) file system, which is much faster because a single file is accessed instead of the many small files that were in the original environment. |
| 31 | + |
| 32 | + |
| 33 | +#### Step 1: create the virtual environment |
| 34 | + |
| 35 | +The first step is to create the virtual environment using the usual workflow. |
| 36 | +This might be slow, because we are not optimising this stage for file system performance. |
| 37 | + |
| 38 | +```bash |
| 39 | +# for the example create a working path on SCRATCH |
| 40 | +mkdir $SCRATCH/sqfs-demo |
| 41 | +cd $SCRATCH/sqfs-demo |
| 42 | + |
| 43 | +# start the uenv |
| 44 | +# in this case the "default" view of prgenv-gnu provides python, cray-mpich, and |
| 45 | +# other useful tools |
| 46 | +uenv start prgenv-gnu/24.11:v1 --view=default |
| 47 | + |
| 48 | +# create and activate the empty venv |
| 49 | +python -m venv ./.pyenv |
| 50 | +source ./.pyenv/bin/activate |
| 51 | + |
| 52 | +# install software in the virtual environment |
| 53 | +# in this case we install install pytorch |
| 54 | +pip install torch torchvision torchaudio \ |
| 55 | + --index-url https://download.pytorch.org/whl/cu126 |
| 56 | +``` |
| 57 | + |
| 58 | +??? example "how many files did that create?" |
| 59 | + An inode is created for every file, directory and symlink on a file system. |
| 60 | + In order to optimise performance, we want to reduce the number of inodes (i.e. the number of files and directories). |
| 61 | + |
| 62 | + The following command can be used to count the number of inodes: |
| 63 | + ``` |
| 64 | + find $SCRATCH/sqfs-demo/.pyenv -exec stat --format="%i" {} + | sort -u | wc -l |
| 65 | + ``` |
| 66 | + `find` is used to list every path and file, and `stat` is called on each of these to get the inode, and then `sort` and `wc` are used to count the number of unique inodes. |
| 67 | + |
| 68 | + In our "simple" pytorch example, I counted **22806 inodes**! |
| 69 | + |
| 70 | +#### Step 2: make a squashfs image of the virtual environment |
| 71 | + |
| 72 | +The next step is to create a single squashfs file that contains the whole `$SCRATCH/sqfs-demo/.pyenv` path. |
| 73 | + |
| 74 | +This is performed using the `mksquashfs` command, that is installed on all Alps clusters. |
| 75 | + |
| 76 | +``` |
| 77 | +mksquashfs $SCRATCH/sqfs-demo/.pyenv pyenv.squashfs \ |
| 78 | + -no-recovery -noappend -Xcompression-level 3 |
| 79 | +``` |
| 80 | + |
| 81 | +!!! hint |
| 82 | + The `-Xcompression-level` flag sets the compression level to a value between 1 and 9, with 9 being the most compressed. |
| 83 | + We find that level 3 provides a good trade off between the size of the compressed image and performance: both [uenv][ref-uenv] and the [container-engine][ref-container-engine] use level 3. |
| 84 | + |
| 85 | +??? warning "I am seeing errors of the form `Unrecognised xattr prefix...`" |
| 86 | + You can safely ignore the (possibly many) warning messages of the form: |
| 87 | + ``` |
| 88 | + Unrecognised xattr prefix lustre.lov |
| 89 | + Unrecognised xattr prefix system.posix_acl_access |
| 90 | + Unrecognised xattr prefix lustre.lov |
| 91 | + Unrecognised xattr prefix system.posix_acl_default |
| 92 | + ``` |
| 93 | + |
| 94 | +!!! tip |
| 95 | + The default installed version of `mksquashfs` on Alps does not support the best `zstd` compression method. |
| 96 | + Every uenv contains a better version of `mksquashfs`, which is used by the uenv to compress itself when it is built. |
| 97 | + |
| 98 | + The exact location inside the uenv depends on the target architecure, and version, and will be of the form: |
| 99 | + ``` |
| 100 | + /user-environment/linux-sles15-${arch}/gcc-7.5.0/squashfs-${version}-${hash}/bin/mksquashfs |
| 101 | + ``` |
| 102 | + Use this version for the best results, though it is also perfectly fine to use the system version. |
| 103 | + |
| 104 | +#### Step 3: use the squashfs |
| 105 | + |
| 106 | +To use the optimised virtual environment, mount the squashfs image at the location of the original virtual environment when starting the uenv. |
| 107 | + |
| 108 | +``` |
| 109 | +cd $SCRATCH/sqfs-demo |
| 110 | +uenv start --view=default \ |
| 111 | + prgenv-gnu/24.11:v1,$PWD/pyenv.squashfs:$SCRATCH/sqfs-demo/.pyenv |
| 112 | +cd $SCRATCH/sqfs-demo |
| 113 | +source .pyenv/bin/activate |
| 114 | +``` |
| 115 | + |
| 116 | +Note that the original virtual environment is still installed in `$SCRATCH/sqfs-demo/.pyenv`, however the squashfs image has been mounted on top of it, so the single squashfs file is being accessed instead of the many files in the original version. |
| 117 | + |
| 118 | +A benefit of this approach is that the squashfs file can be copied to a location that is not subject to the Scratch cleaning policy, and mounted from there. |
| 119 | + |
| 120 | +#### Step 4: (optional) regenerate the virtual environment |
| 121 | + |
| 122 | +The squashfs file is immutable - it is not possible to modify the contents of `.pyenv` while it is mounted. |
| 123 | +This means that it is not possible to `pip install` more packages in the virtual environment. |
| 124 | + |
| 125 | +If you need to modify the virtual environment, run the original uenv without the squashfs file mounted, make changes, and run step 2 again to generate a new image. |
| 126 | + |
| 127 | +!!! hint |
| 128 | + If you save the updated copy in a different file, you can now "roll" back to the old version of the environment by mounting the old image. |
| 129 | + |
0 commit comments