-
Notifications
You must be signed in to change notification settings - Fork 152
Description
All pods trying to write to any volume provisioned by nfs-ganesha hang, claims are mounted on md0 nodes (there's also very high load, ~300 on those nodes, but they're operational) - attempt to list files in the mount directory or perfoming any operation on the pod results in hang application (awaiting I/O) - I've checked and there is traffic visible on port 2049 , there are not much logs in the pod itself, not sure where else to check or how to debug further
- df -h hangs, checking the PVC for nfs-ganesha itself:
xxxx-md0-gn6sq-fvw98:~ # df -h
^C
xxxx-wdh-md0-gn6sq-fvw98:~ # df -h /dev/vdd
Filesystem Size Used Avail Use% Mounted on
/dev/vdd 17T 1.5T 15T 10% /var/lib/kubelet/pods/622ba3dd-e465-49d5-a80b-cd527871dd4b/volumes/kubernetes.io~csi/pvc-ef83ae77-a81a-4750-b502-068183f33509/mount
logs from ganesha.log:
26/05/2025 14:02:06 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
26/05/2025 14:02:06 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED
26/05/2025 14:02:06 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
26/05/2025 14:02:16 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:02:26 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:02:36 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:02:46 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:02:56 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:06 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:16 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:26 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:36 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:36 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] main :MAIN :EVENT :nfs-ganesha Starting: Ganesha Version 4.0.8
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :Could not create credential cache directory: /usr/local/var/run/ganesha (No such file or directory)
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] find_keytab_entry :NFS CB :WARN :Configuration file does not specify default realm while getting default realm name
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT :ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host localhost
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:2)
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
02/06/2025 11:05:29 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:39 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:49 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:59 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:09 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:29 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:39 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:49 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:49 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
02/06/2025 13:42:24 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[svc_1] rpc :TIRPC :EVENT :svc_dg_rendezvous: Bad message rlen: 1
current mounts on node are like this:
xx.xx.xx.xx:/export/pvc-xxxx on /var/lib/kubelet/pods/7xxxx-9157-8f0932d30823/volumes/kubernetes.io~nfs/pvc-xxxx type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=30,retrans=2,sec=sys,mountaddr=xx.xx.xx.xx,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=xx.xx.159.36)
NFS-provisioner pods is healthy and running:
/opt/caas/my-deployment/sylva-core$ kubectl get pod -n nfs-ganesha
NAME READY STATUS RESTARTS AGE
nfs-ganesha-nfs-server-provisioner-0 1/1 Running 0 24h
journalctl on the node shows many message as below:
Jun 01 08:02:00 xxxx-md0-gn6sq-xxx8 kernel: nfs: server xx.xx.xxx.xxx <cluster-IP service> not responding, still trying
rpcinfo on the node:
xxxx-dj2g6:~ # rpcinfo -s <cluster-ip>
program version(s) netid(s) service owner
100000 2,3,4 local,udp,tcp,udp6,tcp6 portmapper superuser
100024 1 tcp6,udp6,tcp,udp status 29
100003 4,3 tcp6,tcp,udp6,udp nfs superuser
100005 3,1 tcp6,tcp,udp6,udp mountd superuser
100021 4 tcp6,tcp,udp6,udp nlockmgr superuser
100011 2,1 tcp6,tcp,udp6,udp rquotad superuser
but when trying to check directly if the nfs v3 is working - it doesnt:
occf-wdh-cp-68966cae37-dj2g6:~ # rpcinfo -T udp cluster-ip 100003 3
rpcinfo: RPC: Timed out
program 100003 version 3 is not available