-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
i am trying to create a sclurm cluster on kubernetes (azure kubernetes service). but the slurmd pod keeps crashing giving error "Couldn't find the specified plugin name for cgroup/v2 looking at all files". Have mentioned the errors below for pods slurmctld and slurmd.
I have tried to debug it a lot but no luck. Any idea on how to fix this on kubernetes cluster.
i could see that slurmdb is unable to connect with slurmctld as shown below.
> k logs -f slurmdbd-6f59cc7887-4mwwq
slurmdbd: debug2: _slurm_connect: failed to connect to 10.244.3.117:6817: Connection refused
slurmdbd: debug2: Error connecting slurm stream socket at 10.244.3.117:6817: Connection refused
slurmdbd: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:10.244.3.117:6817: Connection refused
slurmdbd: error: slurmdb_send_accounting_update_persist: Unable to open connection to registered cluster linux.
slurmdbd: error: slurm_receive_msg: No response to persist_init
slurmdbd: error: update cluster: Connection refused to linux at 10.244.3.117(6817)
> k logs -f slurmctld-0
slurmctld: debug: sched: Running job scheduler for full queue.
slurmctld: debug: create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state`, No such file or directory
slurmctld: error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: debug: create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state.old`, No such file or directory
slurmctld: No job state file (/var/spool/slurmctld/job_state.old) found
slurmctld: debug2: accounting_storage/slurmdbd: _send_cluster_tres: Sending tres '1=40,2=10,3=0,4=10,5=40,6=0,7=0,8=0' for cluster
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug: slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.5:7132]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-1 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-1/slurmd-1/slurmd-1 assigned to node slurmd-1
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-1
slurmctld: debug: slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.6:38712]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-0 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-0/slurmd-0/slurmd-0 assigned to node slurmd-0
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-0
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug: sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug: sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints
> k logs -f pod/slurmd-0
---> Set shell resource limits ...
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 3547560
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 131072
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
---> Copying MUNGE key ...
---> Starting the MUNGE Authentication service (munged) ...
---> Waiting for slurmctld to become active before starting slurmd...
-- slurmctld is now active ...
---> Starting the Slurm Node Daemon (slurmd) ...
slurmd: CPUs=96 Boards=1 Sockets=2 Cores=48 Threads=1 Memory=886898 TmpDisk=0 Uptime=37960 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug: Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug: CPUs:96 Boards:1 Sockets:2 CoresPerSocket:48 ThreadsPerCore:1
slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
slurmd: error: cannot find cgroup plugin for cgroup/v2
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed
Metadata
Metadata
Assignees
Labels
No labels