Skip to content

NFS-Ganesha services are not responding #155

@mohan7427

Description

@mohan7427

All pods trying to write to any volume provisioned by nfs-ganesha hang, claims are mounted on md0 nodes (there's also very high load, ~300 on those nodes, but they're operational) - attempt to list files in the mount directory or perfoming any operation on the pod results in hang application (awaiting I/O) - I've checked and there is traffic visible on port 2049 , there are not much logs in the pod itself, not sure where else to check or how to debug further

  • df -h hangs, checking the PVC for nfs-ganesha itself:
xxxx-md0-gn6sq-fvw98:~ # df -h
^C
xxxx-wdh-md0-gn6sq-fvw98:~ # df -h /dev/vdd

Filesystem      Size  Used Avail Use% Mounted on
/dev/vdd         17T  1.5T   15T  10% /var/lib/kubelet/pods/622ba3dd-e465-49d5-a80b-cd527871dd4b/volumes/kubernetes.io~csi/pvc-ef83ae77-a81a-4750-b502-068183f33509/mount

logs from ganesha.log:

26/05/2025 14:02:06 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
26/05/2025 14:02:06 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
26/05/2025 14:02:06 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
26/05/2025 14:02:16 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:02:26 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:02:36 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:02:46 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:02:56 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:06 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:16 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:26 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:36 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
26/05/2025 14:03:36 : epoch 6834745e : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] main :MAIN :EVENT :nfs-ganesha Starting: Ganesha Version 4.0.8
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :Could not create credential cache directory: /usr/local/var/run/ganesha (No such file or directory)
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] find_keytab_entry :NFS CB :WARN :Configuration file does not specify default realm while getting default realm name
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT :ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host localhost
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:2)
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
02/06/2025 11:05:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
02/06/2025 11:05:29 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:39 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:49 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:05:59 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:09 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:19 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:29 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:39 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:49 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
02/06/2025 11:06:49 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
02/06/2025 13:42:24 : epoch 683d856f : nfs-ganesha-nfs-server-provisioner-0 : nfs-ganesha-20[svc_1] rpc :TIRPC :EVENT :svc_dg_rendezvous: Bad message rlen: 1

current mounts on node are like this:

xx.xx.xx.xx:/export/pvc-xxxx on /var/lib/kubelet/pods/7xxxx-9157-8f0932d30823/volumes/kubernetes.io~nfs/pvc-xxxx type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=30,retrans=2,sec=sys,mountaddr=xx.xx.xx.xx,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=xx.xx.159.36)

NFS-provisioner pods is healthy and running:

/opt/caas/my-deployment/sylva-core$ kubectl get pod -n nfs-ganesha
NAME                                   READY   STATUS    RESTARTS   AGE
nfs-ganesha-nfs-server-provisioner-0   1/1     Running   0          24h

journalctl on the node shows many message as below:

Jun 01 08:02:00 xxxx-md0-gn6sq-xxx8 kernel: nfs: server xx.xx.xxx.xxx <cluster-IP service> not responding, still trying

rpcinfo on the node:

xxxx-dj2g6:~ # rpcinfo -s <cluster-ip>
   program version(s) netid(s)                         service     owner
    100000  2,3,4     local,udp,tcp,udp6,tcp6          portmapper  superuser
    100024  1         tcp6,udp6,tcp,udp                status      29
    100003  4,3       tcp6,tcp,udp6,udp                nfs         superuser
    100005  3,1       tcp6,tcp,udp6,udp                mountd      superuser
    100021  4         tcp6,tcp,udp6,udp                nlockmgr    superuser
    100011  2,1       tcp6,tcp,udp6,udp                rquotad     superuser

but when trying to check directly if the nfs v3 is working - it doesnt:

occf-wdh-cp-68966cae37-dj2g6:~ # rpcinfo -T udp cluster-ip 100003 3
rpcinfo: RPC: Timed out
program 100003 version 3 is not available

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions