From 937582ee76c83aa1ccbcb41ae25f45b648532be5 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 5 Sep 2025 14:38:17 +0200 Subject: [PATCH 1/3] Add alternative workaround for Cray MPICH hangs --- docs/software/communication/cray-mpich.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/software/communication/cray-mpich.md b/docs/software/communication/cray-mpich.md index af6691cd..39c623f5 100644 --- a/docs/software/communication/cray-mpich.md +++ b/docs/software/communication/cray-mpich.md @@ -77,6 +77,12 @@ Cray MPICH may sometimes hang on larger runs. export FI_MR_CACHE_MONITOR=disabled ``` + The option + ```bash + export FI_MR_CACHE_MONITOR=userfaultfd + ``` + may also avoid hangs, and typically performs better than completely disabling the cache monitor. + Performance may be negatively affected by this option. #### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication From f2c12c56ce3c50b62112b87162100fb8f7ceb2c0 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 5 Sep 2025 14:46:06 +0200 Subject: [PATCH 2/3] Add workaround for slow intra-node communication with Cray MPICH --- docs/software/communication/cray-mpich.md | 25 +++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/docs/software/communication/cray-mpich.md b/docs/software/communication/cray-mpich.md index 39c623f5..3132edaa 100644 --- a/docs/software/communication/cray-mpich.md +++ b/docs/software/communication/cray-mpich.md @@ -94,6 +94,31 @@ Note that this has a performance impact for small message sizes, so it should on export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0 ``` +[](){#ref-communication-cray-mpich-slow-intranode} +#### Slow intra-node host communication with Cray MPICH + +Cray MPICH can perform badly when doing intra-node CPU-CPU memory communication. + +!!! info "Workaround" + In some situations Cray MPICH can perform better when communication is done over the NICs, even within a node. + To force Cray MPICH to use NICs for all communication, set: + + ```bash + export MPIR_CVAR_NO_LOCAL=1 + ``` + + Whenever possible, prefer using GPU-GPU communication instead of CPU-CPU communication. + It can even be beneficial to transfer data to the GPU only for the communication even if the buffer originally is in CPU memory. + +#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication + +This error message is sometimes triggered by applications that use GPU Direct MPI calls when they trigger a bug in gdrcopy (a low-level library used to copy buffers between GPUs). +Setting the following option will completely disable gdrcopy. +Note that this has a performance impact for small message sizes, so it should only be enabled on a case-by-case basis. +```bash +export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0 +``` + ### Resolved issues #### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication From 8875a4936c58d4c456737dc3a3edb9ce8be08cb4 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Fri, 5 Sep 2025 15:10:04 +0200 Subject: [PATCH 3/3] Apply suggestion from @msimberg --- docs/software/communication/cray-mpich.md | 9 --------- 1 file changed, 9 deletions(-) diff --git a/docs/software/communication/cray-mpich.md b/docs/software/communication/cray-mpich.md index 3132edaa..8bac8559 100644 --- a/docs/software/communication/cray-mpich.md +++ b/docs/software/communication/cray-mpich.md @@ -110,15 +110,6 @@ Cray MPICH can perform badly when doing intra-node CPU-CPU memory communication. Whenever possible, prefer using GPU-GPU communication instead of CPU-CPU communication. It can even be beneficial to transfer data to the GPU only for the communication even if the buffer originally is in CPU memory. -#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication - -This error message is sometimes triggered by applications that use GPU Direct MPI calls when they trigger a bug in gdrcopy (a low-level library used to copy buffers between GPUs). -Setting the following option will completely disable gdrcopy. -Note that this has a performance impact for small message sizes, so it should only be enabled on a case-by-case basis. -```bash -export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0 -``` - ### Resolved issues #### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication