|
| 1 | +[](){#ref-communication-mpich} |
| 2 | +# MPICH |
| 3 | + |
| 4 | +MPICH is an open-source MPI implementation actively developed in this [github repository](https://github.com/pmodels/mpich). |
| 5 | +It can be installed inside containers directly from the source code manually, or using Spack or similar package managers. |
| 6 | + |
| 7 | +## MPICH inside containers |
| 8 | +MPICH can be built inside containers, however for native Slingshot performance special care has to be taken to ensure that communication is optimal for all cases: |
| 9 | + |
| 10 | +* Intra-node communication (this is via shared memory, especially `xpmem`) |
| 11 | +* Inter-node communication (this should go through the OpenFabrics Interface - OFI) |
| 12 | +* Host-to-Host memory communication |
| 13 | +* Device-to-Device memory communication |
| 14 | + |
| 15 | +To achieve native performance MPICH must be built with both `libfabric` and `xpmem` support. |
| 16 | +Additionally, when building for GH200 nodes, one needs to ensure to build `libfabric` and `mpich` with CUDA support. |
| 17 | + |
| 18 | +At runtime, the container engine [CXI hook][ref-ce-cxi-hook] will replace the libraries `xpmem` and `libfabric` inside the container, with the libraries on the host system. |
| 19 | +This will ensure native performance when doing MPI communication. |
| 20 | + |
| 21 | +These are example Dockerfiles that can be used on [Eiger][ref-cluster-eiger] and [Daint][ref-cluster-daint] to build a container image with MPICH and best communication performance. |
| 22 | + |
| 23 | +They are explicit and building manually the necessary packages, however for production one can fall back to Spack to do the building. |
| 24 | +=== "Dockerfile.cpu" |
| 25 | + ```Dockerfile |
| 26 | + FROM docker.io/ubuntu:24.04 |
| 27 | + |
| 28 | + ARG libfabric_version=1.22.0 |
| 29 | + ARG mpi_version=4.3.1 |
| 30 | + ARG osu_version=7.5.1 |
| 31 | + |
| 32 | + RUN apt-get update \ |
| 33 | + && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential ca-certificates automake autoconf libtool make gdb strace wget python3 git gfortran \ |
| 34 | + && rm -rf /var/lib/apt/lists/* |
| 35 | + |
| 36 | + RUN git clone https://github.com/hpc/xpmem \ |
| 37 | + && cd xpmem/lib \ |
| 38 | + && gcc -I../include -shared -o libxpmem.so.1 libxpmem.c \ |
| 39 | + && ln -s libxpmem.so.1 libxpmem.so \ |
| 40 | + && mv libxpmem.so* /usr/lib64 \ |
| 41 | + && cp ../include/xpmem.h /usr/include/ \ |
| 42 | + && ldconfig \ |
| 43 | + && cd ../../ \ |
| 44 | + && rm -Rf xpmem |
| 45 | + |
| 46 | + RUN wget -q https://github.com/ofiwg/libfabric/archive/v${libfabric_version}.tar.gz \ |
| 47 | + && tar xf v${libfabric_version}.tar.gz \ |
| 48 | + && cd libfabric-${libfabric_version} \ |
| 49 | + && ./autogen.sh \ |
| 50 | + && ./configure --prefix=/usr \ |
| 51 | + && make -j$(nproc) \ |
| 52 | + && make install \ |
| 53 | + && ldconfig \ |
| 54 | + && cd .. \ |
| 55 | + && rm -rf v${libfabric_version}.tar.gz libfabric-${libfabric_version} |
| 56 | + |
| 57 | + RUN wget -q https://www.mpich.org/static/downloads/${mpi_version}/mpich-${mpi_version}.tar.gz \ |
| 58 | + && tar xf mpich-${mpi_version}.tar.gz \ |
| 59 | + && cd mpich-${mpi_version} \ |
| 60 | + && ./autogen.sh \ |
| 61 | + && ./configure --prefix=/usr --enable-fast=O3,ndebug --enable-fortran --enable-cxx --with-device=ch4:ofi --with-libfabric=/usr --with-xpmem=/usr \ |
| 62 | + && make -j$(nproc) \ |
| 63 | + && make install \ |
| 64 | + && ldconfig \ |
| 65 | + && cd .. \ |
| 66 | + && rm -rf mpich-${mpi_version}.tar.gz mpich-${mpi_version} |
| 67 | + |
| 68 | + RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-v${osu_version}.tar.gz \ |
| 69 | + && tar xf osu-micro-benchmarks-v${osu_version}.tar.gz \ |
| 70 | + && cd osu-micro-benchmarks-v${osu_version} \ |
| 71 | + && ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS=-O3 \ |
| 72 | + && make -j$(nproc) \ |
| 73 | + && make install \ |
| 74 | + && cd .. \ |
| 75 | + && rm -rf osu-micro-benchmarks-v${osu_version} osu-micro-benchmarks-v${osu_version}.tar.gz |
| 76 | + ``` |
| 77 | + |
| 78 | +=== "Dockerfile.gpu" |
| 79 | + ```Dockerfile |
| 80 | + FROM docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04 |
| 81 | + |
| 82 | + ARG libfabric_version=1.22.0 |
| 83 | + ARG mpi_version=4.3.1 |
| 84 | + ARG osu_version=7.5.1 |
| 85 | + |
| 86 | + RUN apt-get update \ |
| 87 | + && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends build-essential ca-certificates automake autoconf libtool make gdb strace wget python3 git gfortran \ |
| 88 | + && rm -rf /var/lib/apt/lists/* |
| 89 | + |
| 90 | + # When building on a machine without a GPU, |
| 91 | + # during the build process on Daint the GPU driver and libraries are not imported into the build process |
| 92 | + RUN echo '/usr/local/cuda/lib64/stubs' > /etc/ld.so.conf.d/cuda_stubs.conf && ldconfig |
| 93 | + |
| 94 | + RUN git clone https://github.com/hpc/xpmem \ |
| 95 | + && cd xpmem/lib \ |
| 96 | + && gcc -I../include -shared -o libxpmem.so.1 libxpmem.c \ |
| 97 | + && ln -s libxpmem.so.1 libxpmem.so \ |
| 98 | + && mv libxpmem.so* /usr/lib \ |
| 99 | + && cp ../include/xpmem.h /usr/include/ \ |
| 100 | + && ldconfig \ |
| 101 | + && cd ../../ \ |
| 102 | + && rm -Rf xpmem |
| 103 | + |
| 104 | + RUN wget -q https://github.com/ofiwg/libfabric/archive/v${libfabric_version}.tar.gz \ |
| 105 | + && tar xf v${libfabric_version}.tar.gz \ |
| 106 | + && cd libfabric-${libfabric_version} \ |
| 107 | + && ./autogen.sh \ |
| 108 | + && ./configure --prefix=/usr --with-cuda=/usr/local/cuda \ |
| 109 | + && make -j$(nproc) \ |
| 110 | + && make install \ |
| 111 | + && ldconfig \ |
| 112 | + && cd .. \ |
| 113 | + && rm -rf v${libfabric_version}.tar.gz libfabric-${libfabric_version} |
| 114 | + |
| 115 | + RUN wget -q https://www.mpich.org/static/downloads/${mpi_version}/mpich-${mpi_version}.tar.gz \ |
| 116 | + && tar xf mpich-${mpi_version}.tar.gz \ |
| 117 | + && cd mpich-${mpi_version} \ |
| 118 | + && ./autogen.sh \ |
| 119 | + && ./configure --prefix=/usr --enable-fast=O3,ndebug --enable-fortran --enable-cxx --with-device=ch4:ofi --with-libfabric=/usr --with-xpmem=/usr --with-cuda=/usr/local/cuda \ |
| 120 | + && make -j$(nproc) \ |
| 121 | + && make install \ |
| 122 | + && ldconfig \ |
| 123 | + && cd .. \ |
| 124 | + && rm -rf mpich-${mpi_version}.tar.gz mpich-${mpi_version} |
| 125 | + |
| 126 | + RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-v${osu_version}.tar.gz \ |
| 127 | + && tar xf osu-micro-benchmarks-v${osu_version}.tar.gz \ |
| 128 | + && cd osu-micro-benchmarks-v${osu_version} \ |
| 129 | + && ./configure --prefix=/usr/local --with-cuda=/usr/local/cuda CC=$(which mpicc) CFLAGS=-O3 \ |
| 130 | + && make -j$(nproc) \ |
| 131 | + && make install \ |
| 132 | + && cd .. \ |
| 133 | + && rm -rf osu-micro-benchmarks-v${osu_version} osu-micro-benchmarks-v${osu_version}.tar.gz |
| 134 | + |
| 135 | + # Get rid of the stubs libraries, because at runtime the CUDA driver and libraries will be available |
| 136 | + RUN rm /etc/ld.so.conf.d/cuda_stubs.conf && ldconfig |
| 137 | + ``` |
| 138 | + |
| 139 | +!!! important "GPU-to-GPU inter-node communication" |
| 140 | + To make sure that GPU-to-GPU performance is good for inter-node communication one must set the variable |
| 141 | + ```console |
| 142 | + $ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 |
| 143 | + ``` |
| 144 | + |
| 145 | +Once the container is built and pushed to a registry, one can create a [container environment][ref-container-engine]. |
| 146 | +To verify performance, one can run the `osu_bw` benchmark, which is doing a bandwidth benchmark for different message sizes between two ranks. |
| 147 | +For reference this is the expected performance for different memory residency, with inter-node and intra-node communication: |
| 148 | +=== "CPU-to-CPU memory intra-node" |
| 149 | + ```console |
| 150 | + $ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 |
| 151 | + $ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N1 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H |
| 152 | + # OSU MPI Bandwidth Test v7.5 |
| 153 | + # Datatype: MPI_CHAR. |
| 154 | + # Size Bandwidth (MB/s) |
| 155 | + 1 1.19 |
| 156 | + 2 2.37 |
| 157 | + 4 4.78 |
| 158 | + 8 9.61 |
| 159 | + 16 8.71 |
| 160 | + 32 38.38 |
| 161 | + 64 76.89 |
| 162 | + 128 152.89 |
| 163 | + 256 303.63 |
| 164 | + 512 586.09 |
| 165 | + 1024 1147.26 |
| 166 | + 2048 2218.82 |
| 167 | + 4096 4303.92 |
| 168 | + 8192 8165.95 |
| 169 | + 16384 7178.94 |
| 170 | + 32768 9574.09 |
| 171 | + 65536 43786.86 |
| 172 | + 131072 53202.36 |
| 173 | + 262144 64046.90 |
| 174 | + 524288 60504.75 |
| 175 | + 1048576 36400.29 |
| 176 | + 2097152 28694.38 |
| 177 | + 4194304 23906.16 |
| 178 | + ``` |
| 179 | + |
| 180 | +=== "CPU-to-CPU memory inter-node" |
| 181 | + ```console |
| 182 | + $ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 |
| 183 | + $ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw H H |
| 184 | + # OSU MPI Bandwidth Test v7.5 |
| 185 | + # Datatype: MPI_CHAR. |
| 186 | + # Size Bandwidth (MB/s) |
| 187 | + 1 0.97 |
| 188 | + 2 1.95 |
| 189 | + 4 3.91 |
| 190 | + 8 7.80 |
| 191 | + 16 15.67 |
| 192 | + 32 31.24 |
| 193 | + 64 62.58 |
| 194 | + 128 124.99 |
| 195 | + 256 249.13 |
| 196 | + 512 499.63 |
| 197 | + 1024 1009.57 |
| 198 | + 2048 1989.46 |
| 199 | + 4096 3996.43 |
| 200 | + 8192 7139.42 |
| 201 | + 16384 14178.70 |
| 202 | + 32768 18920.35 |
| 203 | + 65536 22169.18 |
| 204 | + 131072 23226.08 |
| 205 | + 262144 23627.48 |
| 206 | + 524288 23838.28 |
| 207 | + 1048576 23951.16 |
| 208 | + 2097152 24007.73 |
| 209 | + 4194304 24037.14 |
| 210 | + ``` |
| 211 | + |
| 212 | +=== "GPU-to-GPU memory intra-node" |
| 213 | + ```console |
| 214 | + $ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 |
| 215 | + $ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N1 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw D D |
| 216 | + # OSU MPI-CUDA Bandwidth Test v7.5 |
| 217 | + # Datatype: MPI_CHAR. |
| 218 | + # Size Bandwidth (MB/s) |
| 219 | + 1 0.14 |
| 220 | + 2 0.29 |
| 221 | + 4 0.58 |
| 222 | + 8 1.16 |
| 223 | + 16 2.37 |
| 224 | + 32 4.77 |
| 225 | + 64 9.87 |
| 226 | + 128 19.77 |
| 227 | + 256 39.52 |
| 228 | + 512 78.29 |
| 229 | + 1024 158.19 |
| 230 | + 2048 315.93 |
| 231 | + 4096 633.14 |
| 232 | + 8192 1264.69 |
| 233 | + 16384 2543.21 |
| 234 | + 32768 5051.02 |
| 235 | + 65536 10069.17 |
| 236 | + 131072 20178.56 |
| 237 | + 262144 38102.36 |
| 238 | + 524288 64397.91 |
| 239 | + 1048576 84937.73 |
| 240 | + 2097152 104723.15 |
| 241 | + 4194304 115214.94 |
| 242 | + ``` |
| 243 | + |
| 244 | +=== "GPU-to-GPU memory inter-node" |
| 245 | + ```console |
| 246 | + $ export MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 |
| 247 | + $ srun --mpi=pmi2 -t00:05:00 --environment=$PWD/osu_gpu.toml -n2 -N2 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw D D |
| 248 | + # OSU MPI-CUDA Bandwidth Test v7.5 |
| 249 | + # Datatype: MPI_CHAR. |
| 250 | + # Size Bandwidth (MB/s) |
| 251 | + 1 0.09 |
| 252 | + 2 0.18 |
| 253 | + 4 0.37 |
| 254 | + 8 0.74 |
| 255 | + 16 1.48 |
| 256 | + 32 2.96 |
| 257 | + 64 5.91 |
| 258 | + 128 11.80 |
| 259 | + 256 227.08 |
| 260 | + 512 463.72 |
| 261 | + 1024 923.58 |
| 262 | + 2048 1740.73 |
| 263 | + 4096 3505.87 |
| 264 | + 8192 6351.56 |
| 265 | + 16384 13377.55 |
| 266 | + 32768 17226.43 |
| 267 | + 65536 21416.23 |
| 268 | + 131072 22733.04 |
| 269 | + 262144 23335.00 |
| 270 | + 524288 23624.70 |
| 271 | + 1048576 23821.72 |
| 272 | + 2097152 23928.62 |
| 273 | + 4194304 23974.34 |
| 274 | + ``` |
0 commit comments