-
Notifications
You must be signed in to change notification settings - Fork 7
Description
I restarted the nfs server pod when the NFS server was actively used by a few users. However, my user server pod could not refresh the filebrowser (code-server) for quite some time and it confused me. I think it is because of 90 seconds of GRACE, which claude.ai described as a period blocking new operations, but helping pre-restart clients re-establish various operations.
We use 90 seconds currently by default, perhaps we should lower this and document it?
Following discussion with claude.ai about things, I conclude that "persistent state storage" can be configured, however it didn't lead to avoiding the grace period because with NFSv4 clients are responsible for state as well, so they can do some reclaim specific operations and the server kept providing the 90 second grace.
The nfs-ganesha default is set without knowledge on the context the nfs server operates, but we can assume a k8s context here, where the pod-to-pod network is more reliable than over the internet. With that in mind, I think we should go for the following change:
NFSv4 {
# During the startup grace period, the NFS server only accepts recovery
# related operations, and will block normal operations. It is a
# mechanism to avoid corruption issues by allowing clients to finalize
# things they had a lease for before they lost connectivity, allowing
# other clients to start doing new things.
#
# Lease_Lifetime defaults to 60, and Grace_Period to 90. We retain the
# 1.5 ratio as their relative proportions was of importance.
#
Lease_Lifetime = 20;
Grace_Period = 30;
}
nfs-server logs after restart with low-activity usage
Starting rpcbind
Starting dbus
Starting Ganesha NFS
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] init_logging :LOG :NULL :LOG: Setting log level for all components to NIV_EVENT
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 5.9
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] load_rados_config :CONFIG :WARN :Missing RADOS URLs backend library
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] fsal_init_fds_limit :MDCACHE LRU :EVENT :Setting the system-imposed limit on FDs to 1048576.
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:34): Unknown parameter (NFS_CORE_PARAM)
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] lower_my_caps :NFS STARTUP :EVENT :currently set capabilities are: =ep cap_sys_resource-ep
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT :Callback creds directory (/var/run/ganesha) already exists
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] find_keytab_entry :NFS CB :WARN :Configuration file does not specify default realm while getting default realm name
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT :ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host localhost
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:99)
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
17/12/2025 08:36:17 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:27 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:37 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:47 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:57 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:17 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:27 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:37 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:37 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
claude.ai discussion
Help me understand what GRACE is about in the context of NFS ganesha.
It is something that happens during startup it seems, can clients work with the NFS server while it reports that it is "in grace"? I provide my nfs-ganesha startup logs
claude.ai response
Based on your logs, I can explain what the GRACE period is in NFS-Ganesha:
What is the Grace Period?
The grace period is a special state that an NFSv4 server enters during startup to allow clients to reclaim their previously held state (like file locks, opens, and delegations) after a server restart or failover.
Purpose
When an NFS server crashes or restarts, clients may have had:
- Open file handles
- File locks
- Delegations (cached permissions to read/write)
The grace period gives these clients time to reconnect and reclaim their state before new operations are allowed, preventing conflicts.
Can Clients Work During Grace?
Partially - with restrictions:
- Existing clients that had state before the restart can reclaim their previous locks and opens
- New operations and new clients are typically blocked or delayed during the grace period
- Read-only operations may be allowed depending on configuration
Your Logs Analysis
In your case:
- Grace starts at
08:36:07with duration90seconds - The reaper thread checks every ~10 seconds:
check grace:reclaim complete(0) clid count(0)clid count(0)means no clients had state to reclaim
- Grace ends at
08:37:37(exactly 90 seconds later):NFS Server Now NOT IN GRACE
Since you had no clients with prior state (clid count(0)), the server simply waited the full 90-second duration before exiting grace and allowing normal operations. This is expected behavior on a fresh start with no previous client connections.
When will clients have state to reclaim etc? If a read and/or write operation is aborted?
claude.ai response
Great question! Clients have state to reclaim when the NFS server restarts while clients still have active sessions with open files, locks, or delegations. It's not about individual operations being aborted, but about the server going down while clients are in the middle of longer-term operations.
When Clients Have State to Reclaim
You would see clid count > 0 if:
-
Client has files open → Server restarts → Client reconnects during grace period to reclaim its open files
-
Client holds file locks → Server crashes → Client reconnects to reclaim its locks
-
Client has delegations (cached write permissions) → Server restarts → Client reclaims delegation
Example Scenario
Time 10:00 - Client opens file.txt for writing (state created)
Time 10:05 - NFS server pod crashes/restarts
Time 10:06 - Server enters 90-second grace period
Time 10:06 - Client detects server is back and sends RECLAIM request
→ Now you'd see: clid count(1)
Time 10:07 - Client successfully reclaims its open file handle
Time 10:08 - Grace period ends, normal operations resume
Your Current Situation
In your logs, clid count(0) means:
- Either no clients were connected when the server started
- Or all previous client sessions had cleanly closed before restart
- This is typical for a fresh deployment or planned restart
Single Operations (Read/Write)
Individual read/write operations are typically stateless in NFSv3 or quick enough in NFSv4 that they don't create long-term state needing reclaim. The grace period is specifically for persistent state like file handles that remain open between operations.
If the nfs-server terminated itself via SIGTERM, is a grace period needed?
claude.ai response
Excellent question! The answer is yes, a grace period is still needed, even after a clean SIGTERM shutdown. Here's why:
Why Grace Period is Still Required
The restarted server doesn't know how it was shut down:
- When the server starts up, it has no memory of whether the previous shutdown was clean (SIGTERM) or a crash
- It must assume the worst case (crash scenario) to maintain protocol correctness
- The NFSv4 specification requires a grace period after any restart
What Happens with SIGTERM
During clean shutdown (SIGTERM):
- Server receives signal and begins graceful shutdown
- Server can notify clients it's going down
- Clients have opportunity to cleanly close files and release locks
- State is properly cleaned up
But on restart:
- The new server process has no idea this happened
- It must offer a grace period "just in case" some client still thinks it has state
The Practical Result
This is exactly what you're seeing in your logs:
clid count(0) ← No clients need to reclaim
After SIGTERM, the grace period becomes essentially a safety waiting period with no actual reclaims, but it's still required by the protocol.
Could It Be Avoided?
Theoretically, yes, if:
- Server used persistent state storage across restarts
- It could definitively prove no clients had active state
- Implementation supported skipping grace
But standard NFS-Ganesha always does the 90-second grace period for safety and protocol compliance, regardless of how the previous instance terminated.
Can nfs-ganesha be configured to use persistent state storage ?
claude.ai response
Yes, NFS-Ganesha can be configured to use persistent state storage! In fact, your logs hint at this capability:
load_rados_config :CONFIG :WARN :Missing RADOS URLs backend library
State Storage Backends
NFS-Ganesha supports several backends for persistent client state:
1. RADOS (Ceph)
- Stores state in Ceph object storage
- Most common for clustered/HA deployments
- Requires
libradoslibrary - Configuration block:
RADOS_KVorRADOS_URLS
2. Database Backends
- Can use various databases for state persistence
- Allows state to survive single-server restarts
3. Recovery Backend (Filesystem)
- Stores recovery information on shared filesystem
- Simpler than RADOS but less robust
Benefits of Persistent State
With persistent state storage:
- Faster grace periods: Server can verify which clients actually need to reclaim
- Clustered NFS: Multiple Ganesha servers can share state for high availability
- Grace period shortening: If no clients have state, grace can end early
- Smoother failover: Clients seamlessly reconnect to another server
Your Current Setup
Your warning indicates:
- Ganesha tried to load RADOS config but the library isn't available
- You're running with no persistent state backend
- Each restart treats state as lost → full 90-second grace period
Example Configuration
RADOS_KV {
pool = "nfs-ganesha";
namespace = "ganesha";
UserId = "admin";
watch_url = "rados://pool/namespace/obj";
}
For a single-server setup with planned restarts (SIGTERM), the overhead of persistent state may not be worth it. But for HA/clustered deployments, it's essential.