-
Notifications
You must be signed in to change notification settings - Fork 56
Description
I try soperator 1.16.1. I have to build populate_jail and worker_slurmd images because the image pull are always failed due to image size. I uses NFS as the shared storage for the test.
After apply slurn-cluster helm chart, the slurm1-populate-jail pod finishes runing and exits after some time. I guess it prepares the jail root for login and worker nodes.
But login and worker nodes all fail with CrashLoopBackOff error. Look into the pod logs give same log messags as below:
Starting slurmd entrypoint script
cgroup v2 detected, creating cgroup for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9fa86a0e_4917_4b4a_a37d_afb81545c892.slice/cri-containerd-d66f3229577110ba455bf50d4cc12422dbef9dd7fb829dec0a6d70ca559b8934.scope
Link users from jail
Link home from jail because slurmd uses it
Bind-mount slurm configs from K8S config map
Make ulimits as big as possible
Apply sysctl limits from /etc/sysctl.conf
vm.max_map_count = 655300
Update linker cache
Complement jail rootfs
+ set -e
+ getopts j:u:wh flag
+ case "${flag}" in
+ jaildir=/mnt/jail
+ getopts j:u:wh flag
+ case "${flag}" in
+ upperdir=/mnt/jail.upper
+ getopts j:u:wh flag
+ case "${flag}" in
+ worker=1
+ getopts j:u:wh flag
+ '[' -z /mnt/jail ']'
+ '[' -z /mnt/jail.upper ']'
+ pushd /mnt/jail
+ echo 'Bind-mount virtual filesystems'
/mnt/jail /
Bind-mount virtual filesystems
+ mount -t proc /proc proc/If it can help, I got the file list of the shared jail volume directory:
$ ls /srv/nfs/kubedata/jail/
assoc_mgr_state fed_mgr_state jwt_hs256.key node_state.old qos_usage trigger_state.old
assoc_mgr_state.old fed_mgr_state.old last_config_lite oci-layout qos_usage.old
assoc_usage heartbeat last_tres part_state repositories
assoc_usage.old index.json last_tres.old part_state.old resv_state
blobs job_state manifest.json priority_last_decay_ran resv_state.old
clustername job_state.old node_state priority_last_decay_ran.old trigger_stateI think jail volume is mounted to /mnt/jail, then /opt/bin/slurm/complement_jail.sh -j /mnt/jail -u /mnt/jail.upper is triggered by container entrypoint script. The script changes working directory to /mnt/jail, it then tries to mount the virtual filesystems but it is obvious the mountpoints are not present.
mount -t proc /proc proc/
mount -t sysfs /sys sys/
mount --rbind /dev dev/
mount --rbind /run run/How this can be workaround or anything wrong with my setup?