Skip to content

Commit 3c232e7

Browse files
committed
Use libfabric fork with CUDA GDR hotfix
1 parent 6ba4a64 commit 3c232e7

File tree

2 files changed

+4
-2
lines changed

2 files changed

+4
-2
lines changed

examples/container/comm-fwk/Containerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,8 @@ RUN git clone --branch ${libcxi_version} --depth 1 https://github.com/HewlettPac
8383

8484
# Install libfabric
8585
ARG libfabric_version=2.1.0
86-
RUN git clone --branch v${libfabric_version} --depth 1 https://github.com/ofiwg/libfabric.git \
86+
#RUN git clone --branch v${libfabric_version} --depth 1 https://github.com/ofiwg/libfabric.git \
87+
RUN git clone --branch cuda_gdrcopy_unregister_fix --depth 1 https://github.com/Madeeks/libfabric.git \
8788
&& cd libfabric \
8889
&& ./autogen.sh \
8990
&& ./configure --prefix=/usr --with-cuda=/usr/local/cuda --enable-cuda-dlopen --enable-gdrcopy-dlopen --enable-xpmem=/usr --enable-cxi --enable-lnx --enable-efa \

examples/container/comm-fwk/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Builds of the image are currently hosted on the Quay.io registry at https://quay
1313
- CUDA 12.8.1 (including other dependencies like NCCL from the NVIDIA Docker Hub image)
1414
- GDRCopy 2.5
1515
- XPMEM (commit 0d0bad4e1d07b38d53ecc8f20786bb1328c446da - corresponds to version 2.6.5-36 in Spack)
16-
- Libfabric 2.1.0 with the following providers explicitly enabled:
16+
- A patched Libfabric 2.1.0-dev (see notes) with the following providers explicitly enabled:
1717
- CXI
1818
- AWS EFA
1919
- LINKx
@@ -25,6 +25,7 @@ Builds of the image are currently hosted on the Quay.io registry at https://quay
2525
## Notes
2626

2727
- This image and its derivatives are self-sufficient with respect to Slingshot connectivity, and do not require hooks to inject a custom CXI stack from the host.
28+
- The libfabric in this image contains an experimental fix for CUDA GDR support (more details [here](https://github.com/ofiwg/libfabric/issues/10865#issuecomment-2735866065)); the fix is applied on top of libfabric's `main` commit at the moment of writing (commit faf13301a4a9628b6c9a28a06d936258c6d368af), and the code is available from [this forked branch](https://github.com/Madeeks/libfabric/tree/cuda_gdrcopy_unregister_fix).
2829
- The libfabric EFA provider is included to leave open the possibility to experiment with derived images on AWS infrastructure as well.
2930
- The libfabric LINKx provider is included to allow for experimentation.
3031
- Although only the libfabric framework and its CXI provider are required to support the Slingshot network, this image also packages the UCX communication framework to allow building a broader set of software (e.g. some OpenSHMEM implementations) and supporting optimized Infiniband communication as well.

0 commit comments

Comments
 (0)