Skip to content

Commit 49a1d36

Browse files
committed
add MPICH communication docs
1 parent 0f5c808 commit 49a1d36

File tree

2 files changed

+271
-0
lines changed

2 files changed

+271
-0
lines changed
Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
[](){#ref-communication-mpich}
2+
# MPICH
3+
4+
MPICH is an open-source MPI implementation actively developed in this [github repository](https://github.com/pmodels/mpich).
5+
It can be installed inside containers directly from the source code manually, but also using Spack as package manager can install it.
6+
7+
## MPICH inside containers
8+
MPICH can be built inside containers, however for native Slingshot performance special care has to be taken, to ensure that communication is optimal for all cases:
9+
10+
* Intra-node communication (this is via shared memory, especially `xpmem`)
11+
* Inter-node communication (this should go through the openfabrics interface OFI)
12+
* Host-to-Host memory communication
13+
* Device-to-Device memory communication
14+
15+
To achieve native performance one needs to ensure to build MPICH with `libfabric` and `xpmem` support.
16+
Additionally, when building for GH200 nodes one needs to ensure to build `libfabric` and `mpich` with `CUDA` support.
17+
18+
At container runtime the [CXI hook][ref-ce-cxi-hook] will replace the libraries `xpmem` and `libfabric` inside the container, with the libraries on the host system.
19+
This will ensure native peformance when doing MPI communication.
20+
21+
This are example Dockerfiles that can be used on `Eiger` and `Daint` to build a container image with MPICH and best communication performance.
22+
They are quite explicit and building manually the necessary packages, however for real-life one should fall back to Spack to do the building.
23+
=== "Dockerfile.cpu"
24+
```Dockerfile
25+
FROM docker.io/ubuntu:24.04
26+
27+
ARG libfabric_version=1.22.0
28+
ARG mpi_version=4.3.1
29+
ARG osu_version=7.5.1
30+
31+
RUN apt-get update \
32+
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential ca-certificates automake autoconf libtool make gdb strace wget python3 git gfortran \
33+
&& rm -rf /var/lib/apt/lists/*
34+
35+
RUN git clone https://github.com/hpc/xpmem \
36+
&& cd xpmem/lib \
37+
&& gcc -I../include -shared -o libxpmem.so.1 libxpmem.c \
38+
&& ln -s libxpmem.so.1 libxpmem.so \
39+
&& mv libxpmem.so* /usr/lib64 \
40+
&& cp ../include/xpmem.h /usr/include/ \
41+
&& ldconfig \
42+
&& cd ../../ \
43+
&& rm -Rf xpmem
44+
45+
RUN wget -q https://github.com/ofiwg/libfabric/archive/v${libfabric_version}.tar.gz \
46+
&& tar xf v${libfabric_version}.tar.gz \
47+
&& cd libfabric-${libfabric_version} \
48+
&& ./autogen.sh \
49+
&& ./configure --prefix=/usr \
50+
&& make -j$(nproc) \
51+
&& make install \
52+
&& ldconfig \
53+
&& cd .. \
54+
&& rm -rf v${libfabric_version}.tar.gz libfabric-${libfabric_version}
55+
56+
RUN wget -q https://www.mpich.org/static/downloads/${mpi_version}/mpich-${mpi_version}.tar.gz \
57+
&& tar xf mpich-${mpi_version}.tar.gz \
58+
&& cd mpich-${mpi_version} \
59+
&& ./autogen.sh \
60+
&& ./configure --prefix=/usr --enable-fast=O3,ndebug --enable-fortran --enable-cxx --with-device=ch4:ofi --with-libfabric=/usr --with-xpmem=/usr \
61+
&& make -j$(nproc) \
62+
&& make install \
63+
&& ldconfig \
64+
&& cd .. \
65+
&& rm -rf mpich-${mpi_version}.tar.gz mpich-${mpi_version}
66+
67+
RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-v${osu_version}.tar.gz \
68+
&& tar xf osu-micro-benchmarks-v${osu_version}.tar.gz \
69+
&& cd osu-micro-benchmarks-v${osu_version} \
70+
&& ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS=-O3 \
71+
&& make -j$(nproc) \
72+
&& make install \
73+
&& cd .. \
74+
&& rm -rf osu-micro-benchmarks-v${osu_version} osu-micro-benchmarks-v${osu_version}.tar.gz
75+
```
76+
77+
=== "Dockerfile.gpu"
78+
```Dockerfile
79+
FROM docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04
80+
81+
ARG libfabric_version=1.22.0
82+
ARG mpi_version=4.3.1
83+
ARG osu_version=7.5.1
84+
85+
RUN apt-get update \
86+
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential ca-certificates automake autoconf libtool make gdb strace wget python3 git gfortran \
87+
&& rm -rf /var/lib/apt/lists/*
88+
89+
RUN echo '/usr/local/cuda/lib64/stubs' > /etc/ld.so.conf.d/cuda_stubs.conf && ldconfig
90+
91+
RUN git clone https://github.com/hpc/xpmem \
92+
&& cd xpmem/lib \
93+
&& gcc -I../include -shared -o libxpmem.so.1 libxpmem.c \
94+
&& ln -s libxpmem.so.1 libxpmem.so \
95+
&& mv libxpmem.so* /usr/lib \
96+
&& cp ../include/xpmem.h /usr/include/ \
97+
&& ldconfig \
98+
&& cd ../../ \
99+
&& rm -Rf xpmem
100+
101+
RUN wget -q https://github.com/ofiwg/libfabric/archive/v${libfabric_version}.tar.gz \
102+
&& tar xf v${libfabric_version}.tar.gz \
103+
&& cd libfabric-${libfabric_version} \
104+
&& ./autogen.sh \
105+
&& ./configure --prefix=/usr --with-cuda=/usr/local/cuda \
106+
&& make -j$(nproc) \
107+
&& make install \
108+
&& ldconfig \
109+
&& cd .. \
110+
&& rm -rf v${libfabric_version}.tar.gz libfabric-${libfabric_version}
111+
112+
RUN wget -q https://www.mpich.org/static/downloads/${mpi_version}/mpich-${mpi_version}.tar.gz \
113+
&& tar xf mpich-${mpi_version}.tar.gz \
114+
&& cd mpich-${mpi_version} \
115+
&& ./autogen.sh \
116+
&& ./configure --prefix=/usr --enable-fast=O3,ndebug --enable-fortran --enable-cxx --with-device=ch4:ofi --with-libfabric=/usr --with-xpmem=/usr --with-cuda=/usr/local/cuda \
117+
&& make -j$(nproc) \
118+
&& make install \
119+
&& ldconfig \
120+
&& cd .. \
121+
&& rm -rf mpich-${mpi_version}.tar.gz mpich-${mpi_version}
122+
123+
RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-v${osu_version}.tar.gz \
124+
&& tar xf osu-micro-benchmarks-v${osu_version}.tar.gz \
125+
&& cd osu-micro-benchmarks-v${osu_version} \
126+
&& ./configure --prefix=/usr/local --with-cuda=/usr/local/cuda CC=$(which mpicc) CFLAGS=-O3 \
127+
&& make -j$(nproc) \
128+
&& make install \
129+
&& cd .. \
130+
&& rm -rf osu-micro-benchmarks-v${osu_version} osu-micro-benchmarks-v${osu_version}.tar.gz
131+
132+
RUN rm /etc/ld.so.conf.d/cuda_stubs.conf && ldconfig
133+
```
134+
135+
!!! important "GPU-to-GPU inter-node communication"
136+
To make sure that GPU-to-GPU performance is good for inter-node communication one must set the variable
137+
```console
138+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
139+
```
140+
141+
Once the container is built and pushed to a registry, one can create a [container environment][ref-container-engine].
142+
To verify performance, one can run the `osu_bw` benchmark, which is doing a bandwidth benchmark for different message sizes between two ranks.
143+
For reference this is the expected performance for different memory residency, with inter-node and intra-node communication:
144+
=== "CPU-to-CPU memory intra-node"
145+
```console
146+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
147+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N1 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H
148+
# OSU MPI Bandwidth Test v7.5
149+
# Datatype: MPI_CHAR.
150+
# Size Bandwidth (MB/s)
151+
1 1.19
152+
2 2.37
153+
4 4.78
154+
8 9.61
155+
16 8.71
156+
32 38.38
157+
64 76.89
158+
128 152.89
159+
256 303.63
160+
512 586.09
161+
1024 1147.26
162+
2048 2218.82
163+
4096 4303.92
164+
8192 8165.95
165+
16384 7178.94
166+
32768 9574.09
167+
65536 43786.86
168+
131072 53202.36
169+
262144 64046.90
170+
524288 60504.75
171+
1048576 36400.29
172+
2097152 28694.38
173+
4194304 23906.16
174+
```
175+
176+
=== "CPU-to-CPU memory inter-node"
177+
```console
178+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
179+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H
180+
# OSU MPI Bandwidth Test v7.5
181+
# Datatype: MPI_CHAR.
182+
# Size Bandwidth (MB/s)
183+
1 0.97
184+
2 1.95
185+
4 3.91
186+
8 7.80
187+
16 15.67
188+
32 31.24
189+
64 62.58
190+
128 124.99
191+
256 249.13
192+
512 499.63
193+
1024 1009.57
194+
2048 1989.46
195+
4096 3996.43
196+
8192 7139.42
197+
16384 14178.70
198+
32768 18920.35
199+
65536 22169.18
200+
131072 23226.08
201+
262144 23627.48
202+
524288 23838.28
203+
1048576 23951.16
204+
2097152 24007.73
205+
4194304 24037.14
206+
```
207+
208+
=== "GPU-to-GPU memory intra-node"
209+
```console
210+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
211+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N1 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw D D
212+
# OSU MPI-CUDA Bandwidth Test v7.5
213+
# Datatype: MPI_CHAR.
214+
# Size Bandwidth (MB/s)
215+
1 0.14
216+
2 0.29
217+
4 0.58
218+
8 1.16
219+
16 2.37
220+
32 4.77
221+
64 9.87
222+
128 19.77
223+
256 39.52
224+
512 78.29
225+
1024 158.19
226+
2048 315.93
227+
4096 633.14
228+
8192 1264.69
229+
16384 2543.21
230+
32768 5051.02
231+
65536 10069.17
232+
131072 20178.56
233+
262144 38102.36
234+
524288 64397.91
235+
1048576 84937.73
236+
2097152 104723.15
237+
4194304 115214.94
238+
```
239+
240+
=== "GPU-to-GPU memory inter-node"
241+
```console
242+
$ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1
243+
$ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw D D
244+
# OSU MPI-CUDA Bandwidth Test v7.5
245+
# Datatype: MPI_CHAR.
246+
# Size Bandwidth (MB/s)
247+
1 0.09
248+
2 0.18
249+
4 0.37
250+
8 0.74
251+
16 1.48
252+
32 2.96
253+
64 5.91
254+
128 11.80
255+
256 227.08
256+
512 463.72
257+
1024 923.58
258+
2048 1740.73
259+
4096 3505.87
260+
8192 6351.56
261+
16384 13377.55
262+
32768 17226.43
263+
65536 21416.23
264+
131072 22733.04
265+
262144 23335.00
266+
524288 23624.70
267+
1048576 23821.72
268+
2097152 23928.62
269+
4194304 23974.34
270+
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ nav:
6666
- 'Communication Libraries':
6767
- software/communication/index.md
6868
- 'Cray MPICH': software/communication/cray-mpich.md
69+
- 'MPICH': software/communication/mpich.md
6970
- 'OpenMPI': software/communication/openmpi.md
7071
- 'NCCL': software/communication/nccl.md
7172
- 'RCCL': software/communication/rccl.md

0 commit comments

Comments
 (0)