Error:- cannot find cgroup plugin for cgroup/v2, slurmd initialization failed

```
i am trying to create a sclurm cluster on kubernetes (azure kubernetes service). but the slurmd pod keeps crashing giving error "Couldn't find the specified plugin name for cgroup/v2 looking at all files". Have mentioned the errors below for pods  slurmctld and  slurmd.

I have tried to debug it a lot but no luck. Any idea on how to fix this on kubernetes cluster.

i could see that slurmdb is unable to connect with slurmctld as shown below. 

> k logs -f slurmdbd-6f59cc7887-4mwwq 

slurmdbd: debug2: _slurm_connect: failed to connect to 10.244.3.117:6817: Connection refused
slurmdbd: debug2: Error connecting slurm stream socket at 10.244.3.117:6817: Connection refused
slurmdbd: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:10.244.3.117:6817: Connection refused
slurmdbd: error: slurmdb_send_accounting_update_persist: Unable to open connection to registered cluster linux.
slurmdbd: error: slurm_receive_msg: No response to persist_init
slurmdbd: error: update cluster: Connection refused to linux at 10.244.3.117(6817)



> k logs -f slurmctld-0

slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state`, No such file or directory
slurmctld: error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: debug:  create_mmap_buf: Failed to open file `/var/spool/slurmctld/job_state.old`, No such file or directory
slurmctld: No job state file (/var/spool/slurmctld/job_state.old) found
slurmctld: debug2: accounting_storage/slurmdbd: _send_cluster_tres: Sending tres '1=40,2=10,3=0,4=10,5=40,6=0,7=0,8=0' for cluster
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.5:7132]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-1 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-1/slurmd-1/slurmd-1 assigned to node slurmd-1
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-1 
slurmctld: debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
slurmctld: error: slurm_receive_msg [10.224.0.6:38712]: Zero Bytes were transmitted or received
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug2: found existing node slurmd-0 for dynamic future node registration
slurmctld: debug2: dynamic future node slurmd-0/slurmd-0/slurmd-0 assigned to node slurmd-0
slurmctld: debug2: _slurm_rpc_node_registration complete for slurmd-0 
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler for full queue.
slurmctld: debug2: Testing job time limits and checkpoints


> k logs -f pod/slurmd-0           
---> Set shell resource limits ...
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3547560
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
---> Copying MUNGE key ...
---> Starting the MUNGE Authentication service (munged) ...
---> Waiting for slurmctld to become active before starting slurmd...
-- slurmctld is now active ...
---> Starting the Slurm Node Daemon (slurmd) ...
slurmd: CPUs=96 Boards=1 Sockets=2 Cores=48 Threads=1 Memory=886898 TmpDisk=0 Uptime=37960 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:96 Boards:1 Sockets:2 CoresPerSocket:48 ThreadsPerCore:1
slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
slurmd: error: cannot find cgroup plugin for cgroup/v2
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error:- cannot find cgroup plugin for cgroup/v2, slurmd initialization failed #39

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error:- cannot find cgroup plugin for cgroup/v2, slurmd initialization failed #39

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions