Skip to content

Commit adbf2bc

Browse files
committed
Move container user id issue to known issues
1 parent 3608c39 commit adbf2bc

File tree

2 files changed

+32
-26
lines changed

2 files changed

+32
-26
lines changed

docs/software/container-engine/known-issue.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,32 @@ The use of `--environment` as `#SBATCH` is known to cause **unexpected behaviors
7979
- **Nested use of `--environment`**: running `srun --environment` in `#SBATCH --environment` results in double-entering EDF containers, causing unexpected errors in the underlying container runtime.
8080

8181
To avoid any unexpected confusion, users are advised **not** to use `--environment` as `#SBATCH`. If users encounter a problem while using this, it's recommended to move `--environment` from `#SBATCH` to each `srun` and see if the problem disappears.
82+
83+
[](){#ref-ce-no-user-id}
84+
## Container start fails with `id: cannot find name for user ID`
85+
86+
If your slurm job using a container fails to start with an error message similar to:
87+
```console
88+
slurmstepd: error: pyxis: container start failed with error code: 1
89+
slurmstepd: error: pyxis: container exited too soon
90+
slurmstepd: error: pyxis: printing engine log file:
91+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
92+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
93+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
94+
slurmstepd: error: pyxis: mkdir: cannot create directory ‘/iopsstor/scratch/cscs/42’: Permission denied
95+
slurmstepd: error: pyxis: couldn't start container
96+
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
97+
slurmstepd: error: Failed to invoke spank plugin stack
98+
srun: error: nid001234: task 0: Exited with exit code 1
99+
srun: Terminating StepId=12345.0
100+
```
101+
it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
102+
If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
103+
You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
104+
```console
105+
$ sinfo --nodes=nid006886
106+
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
107+
debug up 1:30:00 0 n/a
108+
normal* up 12:00:00 1 drain$ nid006886
109+
xfer up 1-00:00:00 0 n/a
110+
```

docs/software/container-engine/run.md

Lines changed: 3 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -24,32 +24,9 @@ There are three ways to do so:
2424
!!! note "Shared container at the node-level"
2525
For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.
2626

27-
??? warning "Container start failure with `id: cannot find name for user ID`"
28-
If your slurm job using a container fails to start with an error message similar to:
29-
```console
30-
slurmstepd: error: pyxis: container start failed with error code: 1
31-
slurmstepd: error: pyxis: container exited too soon
32-
slurmstepd: error: pyxis: printing engine log file:
33-
slurmstepd: error: pyxis: id: cannot find name for user ID 42
34-
slurmstepd: error: pyxis: id: cannot find name for user ID 42
35-
slurmstepd: error: pyxis: id: cannot find name for user ID 42
36-
slurmstepd: error: pyxis: mkdir: cannot create directory ‘/iopsstor/scratch/cscs/42’: Permission denied
37-
slurmstepd: error: pyxis: couldn't start container
38-
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
39-
slurmstepd: error: Failed to invoke spank plugin stack
40-
srun: error: nid001234: task 0: Exited with exit code 1
41-
srun: Terminating StepId=12345.0
42-
```
43-
it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
44-
If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
45-
You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
46-
```console
47-
$ sinfo --nodes=nid006886
48-
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
49-
debug up 1:30:00 0 n/a
50-
normal* up 12:00:00 1 drain$ nid006886
51-
xfer up 1-00:00:00 0 n/a
52-
```
27+
!!! warning "Container start failure with `id: cannot find name for user ID`"
28+
Containers may fail to start due to user database issues on compute nodes.
29+
See [this section][ref-ce-no-user-id] for more details.
5330

5431
### Use from batch scripts
5532

0 commit comments

Comments
 (0)