Skip to content

Commit f823d98

Browse files
committed
Add known issues
1 parent 8e921e3 commit f823d98

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

docs/software/container-engine.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -871,3 +871,20 @@ abc
871871
```
872872
873873
Notice the section `[annotations]` disabling Slurm and CXI hooks.
874+
875+
## Using NCCL from remote SSH terminals
876+
877+
We are aware of an issue when enabling both [the AWS OFI NCCL hook][ref-ce-aws-ofi-hook] and [the SSH hook][ref-ce-ssh-hook], and launching programs using NCCL from Bash sessions connected via SSH.
878+
The issue manifests with messages reporting `Error: network 'AWS Libfabric' not found`.
879+
880+
In addition to setting up a server for remote connections, the SSH hook also performs actions intended to improve the user experience. One of these is creating a script to be loaded by Bash in order to propagate the container job environment variables when connecting through SSH.
881+
The script is translating the value of the `NCCL_NET` variable as `"'AWS Libfabric'"`, that is with additional quotes compared to the original value set by the AWS OFI NCCL hook. The quoted string induces NCCL to look for a network which is not defined, resulting in the unrecoverable error mentioned earlier.
882+
883+
As a workaround, resetting the NCCL_NET variable to the correct value is effective in allowing NCCL to use the AWS OFI plugin and access the Slingshot network, e.g. `export NCCL_NET="AWS Libfabric"`.
884+
885+
## Mounting home directories when using the SSH hook
886+
887+
Mounting individual home directories (usually located on the `/users` filesystem) overrides the files created by the SSH hook in `${HOME}/.ssh`, including the one which includes the authorized key entered in the EDF through the corresponding annotation. In other words, when using the SSH hook and bind mounting the user's own home folder or the whole `/users`, it is necessary to authorize manually the desired key.
888+
889+
It is generally NOT recommended to mount home folders inside containers, due to the risk of exposing personal data to programs inside the container.
890+
Defining a mount related to `/users` in the EDF should only be done when there is a specific reason to do so, or the container image being deployed is trusted.

0 commit comments

Comments
 (0)