Skip to content

Commit 742319d

Browse files
finkandreasRMelibcumming
authored
add MPICH communication docs (#217)
This documentation is adding a reference how to build a container with MPICH, that allows native communication speeds. Both GPU and CPU examples are included. No example for building outside of a container is provided (but it surely would be possible to build directly baremetal, it is however untested) --------- Co-authored-by: Rocco Meli <[email protected]> Co-authored-by: Ben Cumming <[email protected]>
1 parent d1c6107 commit 742319d

File tree

3 files changed

+276
-0
lines changed

3 files changed

+276
-0
lines changed

.github/actions/spelling/allow.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ NAMD
7474
NICs
7575
NVMe
7676
Nordend
77+
OpenFabrics
7778
OSS
7879
OSSs
7980
OTP
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
[](){#ref-communication-mpich}
2+
# MPICH
3+
4+
MPICH is an open-source MPI implementation actively developed in this [github repository](https://github.com/pmodels/mpich).
5+
It can be installed inside containers directly from the source code manually, or using Spack or similar package managers.
6+
7+
## MPICH inside containers
8+
MPICH can be built inside containers, however for native Slingshot performance special care has to be taken to ensure that communication is optimal for all cases:
9+
10+
* Intra-node communication (this is via shared memory, especially `xpmem`)
11+
* Inter-node communication (this should go through the OpenFabrics Interface - OFI)
12+
* Host-to-Host memory communication
13+
* Device-to-Device memory communication
14+
15+
To achieve native performance MPICH must be built with both `libfabric` and `xpmem` support.
16+
Additionally, when building for GH200 nodes, one needs to ensure to build `libfabric` and `mpich` with CUDA support.
17+
18+
At runtime, the container engine [CXI hook][ref-ce-cxi-hook] will replace the libraries `xpmem` and `libfabric` inside the container, with the libraries on the host system.
19+
This will ensure native performance when doing MPI communication.
20+
21+
These are example Dockerfiles that can be used on [Eiger][ref-cluster-eiger] and [Daint][ref-cluster-daint] to build a container image with MPICH and best communication performance.
22+
23+
They are explicit and building manually the necessary packages, however for production one can fall back to Spack to do the building.
24+
=== "Dockerfile.cpu"
25+
```Dockerfile
26+
FROM docker.io/ubuntu:24.04
27+
28+
ARG libfabric_version=1.22.0
29+
ARG mpi_version=4.3.1
30+
ARG osu_version=7.5.1
31+
32+
RUN apt-get update \
33+
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential ca-certificates automake autoconf libtool make gdb strace wget python3 git gfortran \
34+
&& rm -rf /var/lib/apt/lists/*
35+
36+
RUN git clone https://github.com/hpc/xpmem \
37+
&& cd xpmem/lib \
38+
&& gcc -I../include -shared -o libxpmem.so.1 libxpmem.c \
39+
&& ln -s libxpmem.so.1 libxpmem.so \
40+
&& mv libxpmem.so* /usr/lib64 \
41+
&& cp ../include/xpmem.h /usr/include/ \
42+
&& ldconfig \
43+
&& cd ../../ \
44+
&& rm -Rf xpmem
45+
46+
RUN wget -q https://github.com/ofiwg/libfabric/archive/v${libfabric_version}.tar.gz \
47+
&& tar xf v${libfabric_version}.tar.gz \
48+
&& cd libfabric-${libfabric_version} \
49+
&& ./autogen.sh \
50+
&& ./configure --prefix=/usr \
51+
&& make -j$(nproc) \
52+
&& make install \
53+
&& ldconfig \
54+
&& cd .. \
55+
&& rm -rf v${libfabric_version}.tar.gz libfabric-${libfabric_version}
56+
57+
RUN wget -q https://www.mpich.org/static/downloads/${mpi_version}/mpich-${mpi_version}.tar.gz \
58+
&& tar xf mpich-${mpi_version}.tar.gz \
59+
&& cd mpich-${mpi_version} \
60+
&& ./autogen.sh \
61+
&& ./configure --prefix=/usr --enable-fast=O3,ndebug --enable-fortran --enable-cxx --with-device=ch4:ofi --with-libfabric=/usr --with-xpmem=/usr \
62+
&& make -j$(nproc) \
63+
&& make install \
64+
&& ldconfig \
65+
&& cd .. \
66+
&& rm -rf mpich-${mpi_version}.tar.gz mpich-${mpi_version}
67+
68+
RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-v${osu_version}.tar.gz \
69+
&& tar xf osu-micro-benchmarks-v${osu_version}.tar.gz \
70+
&& cd osu-micro-benchmarks-v${osu_version} \
71+
&& ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS=-O3 \
72+
&& make -j$(nproc) \
73+
&& make install \
74+
&& cd .. \
75+
&& rm -rf osu-micro-benchmarks-v${osu_version} osu-micro-benchmarks-v${osu_version}.tar.gz
76+
```
77+
78+
=== "Dockerfile.gpu"
79+
```Dockerfile
80+
FROM docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04
81+
82+
ARG libfabric_version=1.22.0
83+
ARG mpi_version=4.3.1
84+
ARG osu_version=7.5.1
85+
86+
RUN apt-get update \
87+
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential ca-certificates automake autoconf libtool make gdb strace wget python3 git gfortran \
88+
&& rm -rf /var/lib/apt/lists/*
89+
90+
# When building on a machine without a GPU,
91+
# during the build process on Daint the GPU driver and libraries are not imported into the build process
92+
RUN echo '/usr/local/cuda/lib64/stubs' > /etc/ld.so.conf.d/cuda_stubs.conf && ldconfig
93+
94+
RUN git clone https://github.com/hpc/xpmem \
95+
&& cd xpmem/lib \
96+
&& gcc -I../include -shared -o libxpmem.so.1 libxpmem.c \
97+
&& ln -s libxpmem.so.1 libxpmem.so \
98+
&& mv libxpmem.so* /usr/lib \
99+
&& cp ../include/xpmem.h /usr/include/ \
100+
&& ldconfig \
101+
&& cd ../../ \
102+
&& rm -Rf xpmem
103+
104+
RUN wget -q https://github.com/ofiwg/libfabric/archive/v${libfabric_version}.tar.gz \
105+
&& tar xf v${libfabric_version}.tar.gz \
106+
&& cd libfabric-${libfabric_version} \
107+
&& ./autogen.sh \
108+
&& ./configure --prefix=/usr --with-cuda=/usr/local/cuda \
109+
&& make -j$(nproc) \
110+
&& make install \
111+
&& ldconfig \
112+
&& cd .. \
113+
&& rm -rf v${libfabric_version}.tar.gz libfabric-${libfabric_version}
114+
115+
RUN wget -q https://www.mpich.org/static/downloads/${mpi_version}/mpich-${mpi_version}.tar.gz \
116+
&& tar xf mpich-${mpi_version}.tar.gz \
117+
&& cd mpich-${mpi_version} \
118+
&& ./autogen.sh \
119+
&& ./configure --prefix=/usr --enable-fast=O3,ndebug --enable-fortran --enable-cxx --with-device=ch4:ofi --with-libfabric=/usr --with-xpmem=/usr --with-cuda=/usr/local/cuda \
120+
&& make -j$(nproc) \
121+
&& make install \
122+
&& ldconfig \
123+
&& cd .. \
124+
&& rm -rf mpich-${mpi_version}.tar.gz mpich-${mpi_version}
125+
126+
RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-v${osu_version}.tar.gz \
127+
&& tar xf osu-micro-benchmarks-v${osu_version}.tar.gz \
128+
&& cd osu-micro-benchmarks-v${osu_version} \
129+
&& ./configure --prefix=/usr/local --with-cuda=/usr/local/cuda CC=$(which mpicc) CFLAGS=-O3 \
130+
&& make -j$(nproc) \
131+
&& make install \
132+
&& cd .. \
133+
&& rm -rf osu-micro-benchmarks-v${osu_version} osu-micro-benchmarks-v${osu_version}.tar.gz
134+
135+
# Get rid of the stubs libraries, because at runtime the CUDA driver and libraries will be available
136+
RUN rm /etc/ld.so.conf.d/cuda_stubs.conf && ldconfig
137+
```
138+
139+
!!! important "GPU-to-GPU inter-node communication"
140+
To make sure that GPU-to-GPU performance is good for inter-node communication one must set the variable
141+
```console
142+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
143+
```
144+
145+
Once the container is built and pushed to a registry, one can create a [container environment][ref-container-engine].
146+
To verify performance, one can run the `osu_bw` benchmark, which is doing a bandwidth benchmark for different message sizes between two ranks.
147+
For reference this is the expected performance for different memory residency, with inter-node and intra-node communication:
148+
=== "CPU-to-CPU memory intra-node"
149+
```console
150+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
151+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N1 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H
152+
# OSU MPI Bandwidth Test v7.5
153+
# Datatype: MPI_CHAR.
154+
# Size Bandwidth (MB/s)
155+
1 1.19
156+
2 2.37
157+
4 4.78
158+
8 9.61
159+
16 8.71
160+
32 38.38
161+
64 76.89
162+
128 152.89
163+
256 303.63
164+
512 586.09
165+
1024 1147.26
166+
2048 2218.82
167+
4096 4303.92
168+
8192 8165.95
169+
16384 7178.94
170+
32768 9574.09
171+
65536 43786.86
172+
131072 53202.36
173+
262144 64046.90
174+
524288 60504.75
175+
1048576 36400.29
176+
2097152 28694.38
177+
4194304 23906.16
178+
```
179+
180+
=== "CPU-to-CPU memory inter-node"
181+
```console
182+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
183+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H
184+
# OSU MPI Bandwidth Test v7.5
185+
# Datatype: MPI_CHAR.
186+
# Size Bandwidth (MB/s)
187+
1 0.97
188+
2 1.95
189+
4 3.91
190+
8 7.80
191+
16 15.67
192+
32 31.24
193+
64 62.58
194+
128 124.99
195+
256 249.13
196+
512 499.63
197+
1024 1009.57
198+
2048 1989.46
199+
4096 3996.43
200+
8192 7139.42
201+
16384 14178.70
202+
32768 18920.35
203+
65536 22169.18
204+
131072 23226.08
205+
262144 23627.48
206+
524288 23838.28
207+
1048576 23951.16
208+
2097152 24007.73
209+
4194304 24037.14
210+
```
211+
212+
=== "GPU-to-GPU memory intra-node"
213+
```console
214+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
215+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N1 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw D D
216+
# OSU MPI-CUDA Bandwidth Test v7.5
217+
# Datatype: MPI_CHAR.
218+
# Size Bandwidth (MB/s)
219+
1 0.14
220+
2 0.29
221+
4 0.58
222+
8 1.16
223+
16 2.37
224+
32 4.77
225+
64 9.87
226+
128 19.77
227+
256 39.52
228+
512 78.29
229+
1024 158.19
230+
2048 315.93
231+
4096 633.14
232+
8192 1264.69
233+
16384 2543.21
234+
32768 5051.02
235+
65536 10069.17
236+
131072 20178.56
237+
262144 38102.36
238+
524288 64397.91
239+
1048576 84937.73
240+
2097152 104723.15
241+
4194304 115214.94
242+
```
243+
244+
=== "GPU-to-GPU memory inter-node"
245+
```console
246+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
247+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw D D
248+
# OSU MPI-CUDA Bandwidth Test v7.5
249+
# Datatype: MPI_CHAR.
250+
# Size Bandwidth (MB/s)
251+
1 0.09
252+
2 0.18
253+
4 0.37
254+
8 0.74
255+
16 1.48
256+
32 2.96
257+
64 5.91
258+
128 11.80
259+
256 227.08
260+
512 463.72
261+
1024 923.58
262+
2048 1740.73
263+
4096 3505.87
264+
8192 6351.56
265+
16384 13377.55
266+
32768 17226.43
267+
65536 21416.23
268+
131072 22733.04
269+
262144 23335.00
270+
524288 23624.70
271+
1048576 23821.72
272+
2097152 23928.62
273+
4194304 23974.34
274+
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ nav:
6666
- 'Communication Libraries':
6767
- software/communication/index.md
6868
- 'Cray MPICH': software/communication/cray-mpich.md
69+
- 'MPICH': software/communication/mpich.md
6970
- 'OpenMPI': software/communication/openmpi.md
7071
- 'NCCL': software/communication/nccl.md
7172
- 'RCCL': software/communication/rccl.md

0 commit comments

Comments
 (0)