90 seconds of GRACE -> restarts take more than 90 seconds - too much?

I restarted the nfs server pod when the NFS server was actively used by a few users. However, my user server pod could not refresh the filebrowser (code-server) for quite some time and it confused me. I think it is because of 90 seconds of GRACE, which claude.ai described as a period blocking new operations, but helping pre-restart clients re-establish various operations.

We use 90 seconds currently by default, perhaps we should lower this and document it?

Following discussion with claude.ai about things, I conclude that "persistent state storage" can be configured, however it didn't lead to avoiding the grace period because with NFSv4 clients are responsible for state as well, so they can do some reclaim specific operations and the server kept providing the 90 second grace.

The nfs-ganesha default is set without knowledge on the context the nfs server operates, but we can assume a k8s context here, where the pod-to-pod network is more reliable than over the internet. With that in mind, I think we should go for the following change:

```yaml
    NFSv4 {
        # During the startup grace period, the NFS server only accepts recovery
        # related operations, and will block normal operations. It is a
        # mechanism to avoid corruption issues by allowing clients to finalize
        # things they had a lease for before they lost connectivity, allowing
        # other clients to start doing new things.
        #
        # Lease_Lifetime defaults to 60, and Grace_Period to 90. We retain the
        # 1.5 ratio as their relative proportions was of importance.
        #
        Lease_Lifetime = 20;
        Grace_Period = 30;
    }

```



<details><summary>nfs-server logs after restart with low-activity usage</summary>


```
Starting rpcbind
Starting dbus
Starting Ganesha NFS
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] init_logging :LOG :NULL :LOG: Setting log level for all components to NIV_EVENT
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] main :MAIN :EVENT :ganesha.nfsd Starting: Ganesha Version 5.9
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] load_rados_config :CONFIG :WARN :Missing RADOS URLs backend library
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] fsal_init_fds_limit :MDCACHE LRU :EVENT :Setting the system-imposed limit on FDs to 1048576.
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] config_errs_to_log :CONFIG :WARN :Config File (/etc/ganesha/ganesha.conf:34): Unknown parameter (NFS_CORE_PARAM)
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] lower_my_caps :NFS STARTUP :EVENT :currently set capabilities are: =ep cap_sys_resource-ep
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_rpc_cb_init_ccache :NFS STARTUP :EVENT :Callback creds directory (/var/run/ganesha) already exists
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] find_keytab_entry :NFS CB :WARN :Configuration file does not specify default realm while getting default realm name
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] gssd_refresh_krb5_machine_credential :NFS CB :CRIT :ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host localhost
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_rpc_cb_init_ccache :NFS STARTUP :WARN :gssd_refresh_krb5_machine_credential failed (-1765328160:99)
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
17/12/2025 08:36:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
17/12/2025 08:36:17 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:27 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:37 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:47 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:36:57 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:07 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:17 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:27 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:37 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
17/12/2025 08:37:37 : epoch 69426b77 : server-home-nfs-57f8f5f4bd-h9dxq : ganesha.nfsd-1[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
```

</details>

## claude.ai discussion

>Help me understand what GRACE is about in the context of NFS ganesha.
>
>It is something that happens during startup it seems, can clients work with the NFS server while it reports that it is "in grace"? I provide my nfs-ganesha startup logs

<details><summary>claude.ai response</summary>

Based on your logs, I can explain what the GRACE period is in NFS-Ganesha:

## What is the Grace Period?

The **grace period** is a special state that an NFSv4 server enters during startup to allow clients to **reclaim their previously held state** (like file locks, opens, and delegations) after a server restart or failover.

## Purpose

When an NFS server crashes or restarts, clients may have had:
- Open file handles
- File locks
- Delegations (cached permissions to read/write)

The grace period gives these clients time to reconnect and reclaim their state before new operations are allowed, preventing conflicts.

## Can Clients Work During Grace?

**Partially - with restrictions:**

- **Existing clients** that had state before the restart can **reclaim** their previous locks and opens
- **New operations** and **new clients** are typically **blocked or delayed** during the grace period
- Read-only operations may be allowed depending on configuration

## Your Logs Analysis

In your case:
1. **Grace starts** at `08:36:07` with duration `90` seconds
2. The reaper thread checks every ~10 seconds: `check grace:reclaim complete(0) clid count(0)`
   - `clid count(0)` means **no clients had state to reclaim**
3. **Grace ends** at `08:37:37` (exactly 90 seconds later): `NFS Server Now NOT IN GRACE`

Since you had no clients with prior state (`clid count(0)`), the server simply waited the full 90-second duration before exiting grace and allowing normal operations. This is expected behavior on a fresh start with no previous client connections.

</details>

> When will clients have state to reclaim etc? If a read and/or write operation is aborted?

<details><summary>claude.ai response</summary>

Great question! Clients have state to reclaim when **the NFS server restarts while clients still have active sessions** with open files, locks, or delegations. It's not about individual operations being aborted, but about the server going down while clients are in the middle of longer-term operations.

## When Clients Have State to Reclaim

**You would see `clid count > 0` if:**

1. **Client has files open** → Server restarts → Client reconnects during grace period to reclaim its open files

2. **Client holds file locks** → Server crashes → Client reconnects to reclaim its locks

3. **Client has delegations** (cached write permissions) → Server restarts → Client reclaims delegation

## Example Scenario

```
Time 10:00 - Client opens file.txt for writing (state created)
Time 10:05 - NFS server pod crashes/restarts
Time 10:06 - Server enters 90-second grace period
Time 10:06 - Client detects server is back and sends RECLAIM request
             → Now you'd see: clid count(1)
Time 10:07 - Client successfully reclaims its open file handle
Time 10:08 - Grace period ends, normal operations resume
```

## Your Current Situation

In your logs, `clid count(0)` means:
- Either **no clients were connected** when the server started
- Or **all previous client sessions had cleanly closed** before restart
- This is typical for a **fresh deployment** or **planned restart**

## Single Operations (Read/Write)

Individual read/write operations are typically **stateless** in NFSv3 or quick enough in NFSv4 that they don't create long-term state needing reclaim. The grace period is specifically for **persistent state** like file handles that remain open between operations.

</details>

> If the nfs-server terminated itself via SIGTERM, is a grace period needed?

<details><summary>claude.ai response</summary>

Excellent question! The answer is **yes, a grace period is still needed**, even after a clean SIGTERM shutdown. Here's why:

## Why Grace Period is Still Required

**The restarted server doesn't know how it was shut down:**

- When the server starts up, it has **no memory** of whether the previous shutdown was clean (SIGTERM) or a crash
- It must assume the **worst case** (crash scenario) to maintain protocol correctness
- The NFSv4 specification **requires** a grace period after any restart

## What Happens with SIGTERM

**During clean shutdown (SIGTERM):**
1. Server receives signal and begins graceful shutdown
2. Server can notify clients it's going down
3. Clients have opportunity to cleanly close files and release locks
4. State is properly cleaned up

**But on restart:**
- The new server process has no idea this happened
- It must offer a grace period "just in case" some client still thinks it has state

## The Practical Result

This is exactly what you're seeing in your logs:
```
clid count(0)  ← No clients need to reclaim
```

After SIGTERM, the grace period becomes essentially a **safety waiting period** with no actual reclaims, but it's still required by the protocol.

## Could It Be Avoided?

Theoretically, yes, if:
- Server used **persistent state storage** across restarts
- It could definitively prove no clients had active state
- Implementation supported skipping grace

But standard NFS-Ganesha always does the 90-second grace period for **safety and protocol compliance**, regardless of how the previous instance terminated.

</details>

> Can nfs-ganesha be configured to use persistent state storage ?

<details><summary>claude.ai response</summary>

**Yes, NFS-Ganesha can be configured to use persistent state storage!** In fact, your logs hint at this capability:

```
load_rados_config :CONFIG :WARN :Missing RADOS URLs backend library
```

## State Storage Backends

NFS-Ganesha supports several backends for persistent client state:

### 1. **RADOS (Ceph)**
- Stores state in Ceph object storage
- Most common for clustered/HA deployments
- Requires `librados` library
- Configuration block: `RADOS_KV` or `RADOS_URLS`

### 2. **Database Backends**
- Can use various databases for state persistence
- Allows state to survive single-server restarts

### 3. **Recovery Backend (Filesystem)**
- Stores recovery information on shared filesystem
- Simpler than RADOS but less robust

## Benefits of Persistent State

With persistent state storage:
- **Faster grace periods**: Server can verify which clients actually need to reclaim
- **Clustered NFS**: Multiple Ganesha servers can share state for high availability
- **Grace period shortening**: If no clients have state, grace can end early
- **Smoother failover**: Clients seamlessly reconnect to another server

## Your Current Setup

Your warning indicates:
- Ganesha tried to load RADOS config but the library isn't available
- You're running with **no persistent state backend**
- Each restart treats state as lost → full 90-second grace period

## Example Configuration

```conf
RADOS_KV {
    pool = "nfs-ganesha";
    namespace = "ganesha";
    UserId = "admin";
    watch_url = "rados://pool/namespace/obj";
}
```

For a single-server setup with planned restarts (SIGTERM), the overhead of persistent state may not be worth it. But for HA/clustered deployments, it's essential.

</details>



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

90 seconds of GRACE -> restarts take more than 90 seconds - too much? #78

claude.ai discussion

What is the Grace Period?

Purpose

Can Clients Work During Grace?

Your Logs Analysis

When Clients Have State to Reclaim

Example Scenario

Your Current Situation

Single Operations (Read/Write)

Why Grace Period is Still Required

What Happens with SIGTERM

The Practical Result

Could It Be Avoided?

State Storage Backends

1. RADOS (Ceph)

2. Database Backends

3. Recovery Backend (Filesystem)

Benefits of Persistent State

Your Current Setup

Example Configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

90 seconds of GRACE -> restarts take more than 90 seconds - too much? #78

Description

claude.ai discussion

What is the Grace Period?

Purpose

Can Clients Work During Grace?

Your Logs Analysis

When Clients Have State to Reclaim

Example Scenario

Your Current Situation

Single Operations (Read/Write)

Why Grace Period is Still Required

What Happens with SIGTERM

The Practical Result

Could It Be Avoided?

State Storage Backends

1. RADOS (Ceph)

2. Database Backends

3. Recovery Backend (Filesystem)

Benefits of Persistent State

Your Current Setup

Example Configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions