Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions docs/software/communication/cray-mpich.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,12 @@ Cray MPICH may sometimes hang on larger runs.
export FI_MR_CACHE_MONITOR=disabled
```

The option
```bash
export FI_MR_CACHE_MONITOR=userfaultfd
```
may also avoid hangs, and typically performs better than completely disabling the cache monitor.

Performance may be negatively affected by this option.

#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication
Expand All @@ -88,6 +94,31 @@ Note that this has a performance impact for small message sizes, so it should on
export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0
```

[](){#ref-communication-cray-mpich-slow-intranode}
#### Slow intra-node host communication with Cray MPICH

Cray MPICH can perform badly when doing intra-node CPU-CPU memory communication.

!!! info "Workaround"
In some situations Cray MPICH can perform better when communication is done over the NICs, even within a node.
To force Cray MPICH to use NICs for all communication, set:

```bash
export MPIR_CVAR_NO_LOCAL=1
```

Whenever possible, prefer using GPU-GPU communication instead of CPU-CPU communication.
It can even be beneficial to transfer data to the GPU only for the communication even if the buffer originally is in CPU memory.

#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication

This error message is sometimes triggered by applications that use GPU Direct MPI calls when they trigger a bug in gdrcopy (a low-level library used to copy buffers between GPUs).
Setting the following option will completely disable gdrcopy.
Note that this has a performance impact for small message sizes, so it should only be enabled on a case-by-case basis.
```bash
export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0
```

### Resolved issues

#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication
Expand Down
Loading