You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/software/container-engine/known-issue.md
+29Lines changed: 29 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,3 +79,32 @@ The use of `--environment` as `#SBATCH` is known to cause **unexpected behaviors
79
79
-**Nested use of `--environment`**: running `srun --environment` in `#SBATCH --environment` results in double-entering EDF containers, causing unexpected errors in the underlying container runtime.
80
80
81
81
To avoid any unexpected confusion, users are advised **not** to use `--environment` as `#SBATCH`. If users encounter a problem while using this, it's recommended to move `--environment` from `#SBATCH` to each `srun` and see if the problem disappears.
82
+
83
+
[](){#ref-ce-no-user-id}
84
+
## Container start fails with `id: cannot find name for user ID`
85
+
86
+
If your slurm job using a container fails to start with an error message similar to:
87
+
```console
88
+
slurmstepd: error: pyxis: container start failed with error code: 1
89
+
slurmstepd: error: pyxis: container exited too soon
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
97
+
slurmstepd: error: Failed to invoke spank plugin stack
98
+
srun: error: nid001234: task 0: Exited with exit code 1
99
+
srun: Terminating StepId=12345.0
100
+
```
101
+
it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
102
+
If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
103
+
You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
Copy file name to clipboardExpand all lines: docs/software/container-engine/run.md
+3-26Lines changed: 3 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,32 +24,9 @@ There are three ways to do so:
24
24
!!! note "Shared container at the node-level"
25
25
For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.
26
26
27
-
??? warning "Container start failure with `id: cannot find name for user ID`"
28
-
If your slurm job using a container fails to start with an error message similar to:
29
-
```console
30
-
slurmstepd: error: pyxis: container start failed with error code: 1
31
-
slurmstepd: error: pyxis: container exited too soon
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
39
-
slurmstepd: error: Failed to invoke spank plugin stack
40
-
srun: error: nid001234: task 0: Exited with exit code 1
41
-
srun: Terminating StepId=12345.0
42
-
```
43
-
it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
44
-
If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
45
-
You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
46
-
```console
47
-
$ sinfo --nodes=nid006886
48
-
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
49
-
debug up 1:30:00 0 n/a
50
-
normal* up 12:00:00 1 drain$ nid006886
51
-
xfer up 1-00:00:00 0 n/a
52
-
```
27
+
!!! warning "Container start failure with `id: cannot find name for user ID`"
28
+
Containers may fail to start due to user database issues on compute nodes.
29
+
See [this section][ref-ce-no-user-id] for more details.
0 commit comments