Skip to content

Error:- cannot find cgroup plugin for cgroup/v2, slurmd initialization failed #39

@yashhirulkar701

Description

@yashhirulkar701
i am trying to create a sclurm cluster on kubernetes (azure kubernetes service). but the slurmd pod keeps crashing giving error "Couldn't find the specified plugin name for cgroup/v2 looking at all files". Have mentioned the errors below for pods  slurmctld and  slurmd.

I have tried to debug it a lot but no luck. Any idea on how to fix this on kubernetes cluster.

i could see that slurmdb is unable to connect with slurmctld as shown below. 

> k logs -f slurmdbd-6f59cc7887-4mwwq 

slurmdbd: debug2: _slurm_connect: failed to connect to 10.244.3.117:6817: Connection refused
slurmdbd: debug2: Error connecting slurm stream socket at 10.244.3.117:6817: Connection refused
slurmdbd: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:10.244.3.117:6817: Connection refused
slurmdbd: error: slurmdb_send_accounting_update_persist: Unable to open connection to registered cluster linux.
slurmdbd: error: slurm_receive_msg: No response to persist_init
slurmdbd: error: update cluster: Connection refused to linux at 10.244.3.117(6817)



> k logs -f slurmctld-0

slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state`, No such file or directory
slurmctld: error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state.old`, No such file or directory
slurmctld: No job state file (/var/spool/slurmctld/job_state.old) found
slurmctld: debug2: accounting_storage/slurmdbd: _send_cluster_tres: Sending tres '1=40,2=10,3=0,4=10,5=40,6=0,7=0,8=0' for cluster
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.5:7132]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-1 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-1/slurmd-1/slurmd-1 assigned to node slurmd-1
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-1 
slurmctld: debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.6:38712]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-0 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-0/slurmd-0/slurmd-0 assigned to node slurmd-0
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-0 
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints


> k logs -f pod/slurmd-0           
---> Set shell resource limits ...
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3547560
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
---> Copying MUNGE key ...
---> Starting the MUNGE Authentication service (munged) ...
---> Waiting for slurmctld to become active before starting slurmd...
-- slurmctld is now active ...
---> Starting the Slurm Node Daemon (slurmd) ...
slurmd: CPUs=96 Boards=1 Sockets=2 Cores=48 Threads=1 Memory=886898 TmpDisk=0 Uptime=37960 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:96 Boards:1 Sockets:2 CoresPerSocket:48 ThreadsPerCore:1
slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
slurmd: error: cannot find cgroup plugin for cgroup/v2
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions