Skip to content

Commit ec3e289

Browse files
committed
Add note about missing user ids when starting containers
1 parent 4aecf6d commit ec3e289

File tree

1 file changed

+27
-0
lines changed
  • docs/software/container-engine

1 file changed

+27
-0
lines changed

docs/software/container-engine/run.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,33 @@ There are three ways to do so:
2424
!!! note "Shared container at the node-level"
2525
For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.
2626

27+
!!! warning "Container start failure with `id: cannot find name for user ID`"
28+
If your slurm job using a container fails to start with an error message similar to:
29+
```console
30+
slurmstepd: error: pyxis: container start failed with error code: 1
31+
slurmstepd: error: pyxis: container exited too soon
32+
slurmstepd: error: pyxis: printing engine log file:
33+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
34+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
35+
slurmstepd: error: pyxis: id: cannot find name for user ID 42
36+
slurmstepd: error: pyxis: mkdir: cannot create directory ‘/iopsstor/scratch/cscs/42’: Permission denied
37+
slurmstepd: error: pyxis: couldn't start container
38+
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
39+
slurmstepd: error: Failed to invoke spank plugin stack
40+
srun: error: nid001234: task 0: Exited with exit code 1
41+
srun: Terminating StepId=12345.0
42+
```
43+
it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
44+
If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
45+
You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
46+
```console
47+
$ sinfo --nodes=nid006886
48+
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
49+
debug up 1:30:00 0 n/a
50+
normal* up 12:00:00 1 drain$ nid006886
51+
xfer up 1-00:00:00 0 n/a
52+
```
53+
2754
### Use from batch scripts
2855

2956
Use `--environment` with the Slurm command (e.g., `srun` or `salloc`):

0 commit comments

Comments
 (0)