Skip to content

Commit 941584b

Browse files
authored
Merge branch 'main' into gb25-round2
2 parents 60cb27f + 742319d commit 941584b

File tree

4 files changed

+282
-2
lines changed

4 files changed

+282
-2
lines changed

.github/actions/spelling/allow.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ NAMD
7474
NICs
7575
NVMe
7676
Nordend
77+
OpenFabrics
7778
OSS
7879
OSSs
7980
OTP

docs/guides/storage.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ To set up a default so all newly created folders and dirs inside or your desired
111111
```
112112

113113
!!! info
114-
For more information read the `setfacl` man page: `man setfacl`.
114+
For more information read the `setfacl` man page: [`man setfacl`](https://linux.die.net/man/1/setfacl).
115115

116116
[](){#ref-guides-storage-lustre}
117117
## Lustre tuning
@@ -127,14 +127,18 @@ The data itself is subdivided in blocks of size `<blocksize>` and is stored by O
127127
The block size and number of OSTs to use is defined by the striping settings, which are applied to a path, with new files and directories inheriting them from their parent directory.
128128
The `lfs getstripe <path>` command can be used to get information on the stripe settings of a path.
129129
For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout.
130-
The simplest way to have the correct layout is to copy to a directory with the correct layout
130+
131+
Striping settings on a directory are only applied to files added after the command is run.
132+
Existing files retain their original layout unless explicitly changed using `lfs migrate <striping settings>`, which takes the same arguments as `lfs setstripe`.
133+
The simplest way to have the correct layout is to copy to a directory with the correct layout.
131134

132135
!!! tip "A block size of 4MB gives good throughput, without being overly big..."
133136
... so it is a good choice when reading a file sequentially or in large chunks, but if one reads shorter chunks in random order it might be better to reduce the size, the performance will be smaller, but the performance of your application might actually increase.
134137
See the [Lustre documentation](https://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace) for more information.
135138

136139

137140
!!! example "Settings for large files"
141+
*Remember:* Settings only apply to files added to the directory after this command.
138142
```console
139143
lfs setstripe --stripe-count -1 --stripe-size 4M <big_files_dir>`
140144
```
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
[](){#ref-communication-mpich}
2+
# MPICH
3+
4+
MPICH is an open-source MPI implementation actively developed in this [github repository](https://github.com/pmodels/mpich).
5+
It can be installed inside containers directly from the source code manually, or using Spack or similar package managers.
6+
7+
## MPICH inside containers
8+
MPICH can be built inside containers, however for native Slingshot performance special care has to be taken to ensure that communication is optimal for all cases:
9+
10+
* Intra-node communication (this is via shared memory, especially `xpmem`)
11+
* Inter-node communication (this should go through the OpenFabrics Interface - OFI)
12+
* Host-to-Host memory communication
13+
* Device-to-Device memory communication
14+
15+
To achieve native performance MPICH must be built with both `libfabric` and `xpmem` support.
16+
Additionally, when building for GH200 nodes, one needs to ensure to build `libfabric` and `mpich` with CUDA support.
17+
18+
At runtime, the container engine [CXI hook][ref-ce-cxi-hook] will replace the libraries `xpmem` and `libfabric` inside the container, with the libraries on the host system.
19+
This will ensure native performance when doing MPI communication.
20+
21+
These are example Dockerfiles that can be used on [Eiger][ref-cluster-eiger] and [Daint][ref-cluster-daint] to build a container image with MPICH and best communication performance.
22+
23+
They are explicit and building manually the necessary packages, however for production one can fall back to Spack to do the building.
24+
=== "Dockerfile.cpu"
25+
```Dockerfile
26+
FROM docker.io/ubuntu:24.04
27+
28+
ARG libfabric_version=1.22.0
29+
ARG mpi_version=4.3.1
30+
ARG osu_version=7.5.1
31+
32+
RUN apt-get update \
33+
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential ca-certificates automake autoconf libtool make gdb strace wget python3 git gfortran \
34+
&& rm -rf /var/lib/apt/lists/*
35+
36+
RUN git clone https://github.com/hpc/xpmem \
37+
&& cd xpmem/lib \
38+
&& gcc -I../include -shared -o libxpmem.so.1 libxpmem.c \
39+
&& ln -s libxpmem.so.1 libxpmem.so \
40+
&& mv libxpmem.so* /usr/lib64 \
41+
&& cp ../include/xpmem.h /usr/include/ \
42+
&& ldconfig \
43+
&& cd ../../ \
44+
&& rm -Rf xpmem
45+
46+
RUN wget -q https://github.com/ofiwg/libfabric/archive/v${libfabric_version}.tar.gz \
47+
&& tar xf v${libfabric_version}.tar.gz \
48+
&& cd libfabric-${libfabric_version} \
49+
&& ./autogen.sh \
50+
&& ./configure --prefix=/usr \
51+
&& make -j$(nproc) \
52+
&& make install \
53+
&& ldconfig \
54+
&& cd .. \
55+
&& rm -rf v${libfabric_version}.tar.gz libfabric-${libfabric_version}
56+
57+
RUN wget -q https://www.mpich.org/static/downloads/${mpi_version}/mpich-${mpi_version}.tar.gz \
58+
&& tar xf mpich-${mpi_version}.tar.gz \
59+
&& cd mpich-${mpi_version} \
60+
&& ./autogen.sh \
61+
&& ./configure --prefix=/usr --enable-fast=O3,ndebug --enable-fortran --enable-cxx --with-device=ch4:ofi --with-libfabric=/usr --with-xpmem=/usr \
62+
&& make -j$(nproc) \
63+
&& make install \
64+
&& ldconfig \
65+
&& cd .. \
66+
&& rm -rf mpich-${mpi_version}.tar.gz mpich-${mpi_version}
67+
68+
RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-v${osu_version}.tar.gz \
69+
&& tar xf osu-micro-benchmarks-v${osu_version}.tar.gz \
70+
&& cd osu-micro-benchmarks-v${osu_version} \
71+
&& ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS=-O3 \
72+
&& make -j$(nproc) \
73+
&& make install \
74+
&& cd .. \
75+
&& rm -rf osu-micro-benchmarks-v${osu_version} osu-micro-benchmarks-v${osu_version}.tar.gz
76+
```
77+
78+
=== "Dockerfile.gpu"
79+
```Dockerfile
80+
FROM docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04
81+
82+
ARG libfabric_version=1.22.0
83+
ARG mpi_version=4.3.1
84+
ARG osu_version=7.5.1
85+
86+
RUN apt-get update \
87+
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential ca-certificates automake autoconf libtool make gdb strace wget python3 git gfortran \
88+
&& rm -rf /var/lib/apt/lists/*
89+
90+
# When building on a machine without a GPU,
91+
# during the build process on Daint the GPU driver and libraries are not imported into the build process
92+
RUN echo '/usr/local/cuda/lib64/stubs' > /etc/ld.so.conf.d/cuda_stubs.conf && ldconfig
93+
94+
RUN git clone https://github.com/hpc/xpmem \
95+
&& cd xpmem/lib \
96+
&& gcc -I../include -shared -o libxpmem.so.1 libxpmem.c \
97+
&& ln -s libxpmem.so.1 libxpmem.so \
98+
&& mv libxpmem.so* /usr/lib \
99+
&& cp ../include/xpmem.h /usr/include/ \
100+
&& ldconfig \
101+
&& cd ../../ \
102+
&& rm -Rf xpmem
103+
104+
RUN wget -q https://github.com/ofiwg/libfabric/archive/v${libfabric_version}.tar.gz \
105+
&& tar xf v${libfabric_version}.tar.gz \
106+
&& cd libfabric-${libfabric_version} \
107+
&& ./autogen.sh \
108+
&& ./configure --prefix=/usr --with-cuda=/usr/local/cuda \
109+
&& make -j$(nproc) \
110+
&& make install \
111+
&& ldconfig \
112+
&& cd .. \
113+
&& rm -rf v${libfabric_version}.tar.gz libfabric-${libfabric_version}
114+
115+
RUN wget -q https://www.mpich.org/static/downloads/${mpi_version}/mpich-${mpi_version}.tar.gz \
116+
&& tar xf mpich-${mpi_version}.tar.gz \
117+
&& cd mpich-${mpi_version} \
118+
&& ./autogen.sh \
119+
&& ./configure --prefix=/usr --enable-fast=O3,ndebug --enable-fortran --enable-cxx --with-device=ch4:ofi --with-libfabric=/usr --with-xpmem=/usr --with-cuda=/usr/local/cuda \
120+
&& make -j$(nproc) \
121+
&& make install \
122+
&& ldconfig \
123+
&& cd .. \
124+
&& rm -rf mpich-${mpi_version}.tar.gz mpich-${mpi_version}
125+
126+
RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-v${osu_version}.tar.gz \
127+
&& tar xf osu-micro-benchmarks-v${osu_version}.tar.gz \
128+
&& cd osu-micro-benchmarks-v${osu_version} \
129+
&& ./configure --prefix=/usr/local --with-cuda=/usr/local/cuda CC=$(which mpicc) CFLAGS=-O3 \
130+
&& make -j$(nproc) \
131+
&& make install \
132+
&& cd .. \
133+
&& rm -rf osu-micro-benchmarks-v${osu_version} osu-micro-benchmarks-v${osu_version}.tar.gz
134+
135+
# Get rid of the stubs libraries, because at runtime the CUDA driver and libraries will be available
136+
RUN rm /etc/ld.so.conf.d/cuda_stubs.conf && ldconfig
137+
```
138+
139+
!!! important "GPU-to-GPU inter-node communication"
140+
To make sure that GPU-to-GPU performance is good for inter-node communication one must set the variable
141+
```console
142+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
143+
```
144+
145+
Once the container is built and pushed to a registry, one can create a [container environment][ref-container-engine].
146+
To verify performance, one can run the `osu_bw` benchmark, which is doing a bandwidth benchmark for different message sizes between two ranks.
147+
For reference this is the expected performance for different memory residency, with inter-node and intra-node communication:
148+
=== "CPU-to-CPU memory intra-node"
149+
```console
150+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
151+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N1 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H
152+
# OSU MPI Bandwidth Test v7.5
153+
# Datatype: MPI_CHAR.
154+
# Size Bandwidth (MB/s)
155+
1 1.19
156+
2 2.37
157+
4 4.78
158+
8 9.61
159+
16 8.71
160+
32 38.38
161+
64 76.89
162+
128 152.89
163+
256 303.63
164+
512 586.09
165+
1024 1147.26
166+
2048 2218.82
167+
4096 4303.92
168+
8192 8165.95
169+
16384 7178.94
170+
32768 9574.09
171+
65536 43786.86
172+
131072 53202.36
173+
262144 64046.90
174+
524288 60504.75
175+
1048576 36400.29
176+
2097152 28694.38
177+
4194304 23906.16
178+
```
179+
180+
=== "CPU-to-CPU memory inter-node"
181+
```console
182+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
183+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H
184+
# OSU MPI Bandwidth Test v7.5
185+
# Datatype: MPI_CHAR.
186+
# Size Bandwidth (MB/s)
187+
1 0.97
188+
2 1.95
189+
4 3.91
190+
8 7.80
191+
16 15.67
192+
32 31.24
193+
64 62.58
194+
128 124.99
195+
256 249.13
196+
512 499.63
197+
1024 1009.57
198+
2048 1989.46
199+
4096 3996.43
200+
8192 7139.42
201+
16384 14178.70
202+
32768 18920.35
203+
65536 22169.18
204+
131072 23226.08
205+
262144 23627.48
206+
524288 23838.28
207+
1048576 23951.16
208+
2097152 24007.73
209+
4194304 24037.14
210+
```
211+
212+
=== "GPU-to-GPU memory intra-node"
213+
```console
214+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
215+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N1 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw D D
216+
# OSU MPI-CUDA Bandwidth Test v7.5
217+
# Datatype: MPI_CHAR.
218+
# Size Bandwidth (MB/s)
219+
1 0.14
220+
2 0.29
221+
4 0.58
222+
8 1.16
223+
16 2.37
224+
32 4.77
225+
64 9.87
226+
128 19.77
227+
256 39.52
228+
512 78.29
229+
1024 158.19
230+
2048 315.93
231+
4096 633.14
232+
8192 1264.69
233+
16384 2543.21
234+
32768 5051.02
235+
65536 10069.17
236+
131072 20178.56
237+
262144 38102.36
238+
524288 64397.91
239+
1048576 84937.73
240+
2097152 104723.15
241+
4194304 115214.94
242+
```
243+
244+
=== "GPU-to-GPU memory inter-node"
245+
```console
246+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
247+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw D D
248+
# OSU MPI-CUDA Bandwidth Test v7.5
249+
# Datatype: MPI_CHAR.
250+
# Size Bandwidth (MB/s)
251+
1 0.09
252+
2 0.18
253+
4 0.37
254+
8 0.74
255+
16 1.48
256+
32 2.96
257+
64 5.91
258+
128 11.80
259+
256 227.08
260+
512 463.72
261+
1024 923.58
262+
2048 1740.73
263+
4096 3505.87
264+
8192 6351.56
265+
16384 13377.55
266+
32768 17226.43
267+
65536 21416.23
268+
131072 22733.04
269+
262144 23335.00
270+
524288 23624.70
271+
1048576 23821.72
272+
2097152 23928.62
273+
4194304 23974.34
274+
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ nav:
6666
- 'Communication Libraries':
6767
- software/communication/index.md
6868
- 'Cray MPICH': software/communication/cray-mpich.md
69+
- 'MPICH': software/communication/mpich.md
6970
- 'OpenMPI': software/communication/openmpi.md
7071
- 'NCCL': software/communication/nccl.md
7172
- 'RCCL': software/communication/rccl.md

0 commit comments

Comments
 (0)