Skip to content

[Filestore] Don't allow node attr caching in guest if node attr might have changed after the request #5293

@qkrorlqr

Description

@qkrorlqr

Node attr caching in guest can lead to stale node size being cached if size-modifying requests (append via WriteData, size extension via SetNodeAttr, AllocateData) overlap with attr-fetching requests (GetNodeAttr, ListNodes, CreateHandle). If a size-modifying request completes while an attr-fetching request is in flight, we should not let the guest cache the results of this attr-fetching request.

We seem to have this problem for example when we run compilation+linking with -j > 1 on top of filestore. Detailed description of one of the cases follows:

The problem seems to be caused by the fact that sometimes after appending to a file the guest ends up having old (pre-append) inode size cached. An example scenario:

  1. guest sends an append request for file X in dir Y
  2. guest sends a listing request for dir Y
  3. listing partially completes in the backend - prev size (N1) for file X is retrieved
  4. append op completes in the backend - file size increased to N2 - guest receives the response
  5. [not sure whether this step is necessary] the inode for file X is evicted from the guest cache
  6. listing request completes, returns a response which contains size N1 for file X and contains entry_timeout=15s, attr_timeout=15s for this file
  7. guest caches size=N1 for file X for 15 seconds
  8. fstat call is performed by an application within those 15 seconds - size=N1 is returned, the call doesn’t reach the backend
  9. read call with offset>=N1 is performed by an application within those 15 seconds - zeroes are returned, the call doesn’t reach the backend

Example logs:

file created:

2026-02-02T23:14:33.080367Z→proller_test_nfs_postgresql-2-4-0→→→CreateHandle→→→→0.005894s→→→S_OK→→→→{parent_node_id=450, node_name=varbit.o, flags=38, mode=420, node_id=72057594038579030, handle=72325632859339675, size=0}→→→loop_thread_id=3758812→→client_id=local-qemu-32202

listing request:

2026-02-02T23:14:33.086454Z→proller_test_nfs_postgresql-2-4-0→→→ListNodes→→→0.007108s→→→S_OK→→→→{node_id=450, size=641}→loop_thread_id=3758812→→client_id=local-qemu-32202

append and flush - both start AFTER the listing starts and complete BEFORE the listing completes:

2026-02-02T23:14:33.087093Z→proller_test_nfs_postgresql-2-4-0→→→WriteData→→→0.006021s→→→S_OK→→→→[{node_id=72057594038579030, handle=72325632859339675, offset=0, bytes=30936, checksums=2961554979 3477393387 3515750452 3560497666 86651290 4212264968 3125409310 4087214380}]→loop_thread_id=3758812→→client_id=local-qemu-32202                                                                                                                                                                              
2026-02-02T23:14:33.093456Z→proller_test_nfs_postgresql-2-4-0→→→Flush→→→0.000005s→→→S_OK→→→→{data_only=1, node_id=72057594038579030, handle=72325632859339675} 

2026-02-02T23:14:33.086454 - listing starts

2026-02-02T23:14:33.087093 - append starts

2026-02-02T23:14:33.087093 + 0.006021s = 2026-02-02T23:14:33.093114 - append completes

2026-02-02T23:14:33.093456 - flush starts

2026-02-02T23:14:33.093456 + 0.000005s = 2026-02-02T23:14:33.093461 - flush completes

2026-02-02T23:14:33.086454 + 0.007108s = 2026-02-02T23:14:33.093562 - listing completes

the same file was later opened multiple times but no other calls apart from open/flush/close reached the backend:

2026-02-02T23:14:47.838319Z→proller_test_nfs_postgresql-2-4-0→→→CreateHandle→→→→0.000960s→→→S_OK→→→→{parent_node_id=72057594038579030, node_name=, flags=1, mode=0, node_id=72057594038579030, handle=72299628603254894, size=30936}→→→→loop_thread_id=3758812→→client_id=local-qemu-32202                                                                                                                                                                                                                      
2026-02-02T23:14:47.839662Z→proller_test_nfs_postgresql-2-4-0→→→CreateHandle→→→→0.001538s→→→S_OK→→→→{parent_node_id=72057594038579030, node_name=, flags=1, mode=0, node_id=72057594038579030, handle=72162422990483306, size=30936}→→→→loop_thread_id=3758812→→client_id=local-qemu-32202                                                                                                                                                                                                                      
2026-02-02T23:14:47.841470Z→proller_test_nfs_postgresql-2-4-0→→→Flush→→→0.000003s→→→S_OK→→→→{data_only=1, node_id=72057594038579030, handle=72162422990483306}                                                                                          
2026-02-02T23:14:47.841519Z→proller_test_nfs_postgresql-2-4-0→→→DestroyHandle→→→0.000576s→→→S_OK→→→→{node_id=72057594038579030, handle=72162422990483306}→→→loop_thread_id=3758812→→client_id=local-qemu-32202                                          
2026-02-02T23:14:47.842287Z→proller_test_nfs_postgresql-2-4-0→→→DestroyHandle→→→0.001088s→→→S_OK→→→→{node_id=72057594038579030, handle=72299628603254894}→→→loop_thread_id=3758812→→client_id=local-qemu-32202                                          
2026-02-02T23:14:47.842289Z→proller_test_nfs_postgresql-2-4-0→→→Flush→→→0.000001s→→→S_OK→→→→{data_only=1, node_id=72057594038579030, handle=72299628603254894}                                                                                          
2026-02-02T23:14:47.843502Z→proller_test_nfs_postgresql-2-4-0→→→CreateHandle→→→→0.002369s→→→S_OK→→→→{parent_node_id=72057594038579030, node_name=, flags=1, mode=0, node_id=72057594038579030, handle=72187929483325026, size=30936}→→→→loop_thread_id=3758812→→client_id=local-qemu-32202                                                                                                                                                                                                                         
2026-02-02T23:14:47.846701Z→proller_test_nfs_postgresql-2-4-0→→→Flush→→→0.000001s→→→S_OK→→→→{data_only=1, node_id=72057594038579030, handle=72187929483325026}                                                                                          
2026-02-02T23:14:47.846703Z→proller_test_nfs_postgresql-2-4-0→→→DestroyHandle→→→0.002005s→→→S_OK→→→→{node_id=72057594038579030, handle=72187929483325026}→→→loop_thread_id=3758812→→client_id=local-qemu-32202  

here’s what we see in the strace log from the guest (the time is not perfectly in sync with the backend):

23:14:46.310287 newfstatat(AT_FDCWD, "utils/adt/varbit.o", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0
23:14:46.310324 openat(AT_FDCWD, "utils/adt/varbit.o", O_RDONLY) = 712
23:14:46.311495 fstat(712, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23:14:46.311534 read(712, "", 8192)     = 0
23:14:46.311598 fstat(712, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23:14:46.311656 openat(AT_FDCWD, "utils/adt/varbit.o", O_RDONLY) = 713
23:14:46.313302 fstat(713, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23:14:46.313353 lseek(713, 0, SEEK_SET) = 0                                                                               
23:14:46.313398 read(713, "", 16)       = 0                                                                             
23:14:46.313437 close(713)              = 0                                                                                
23:14:46.314251 close(712)              = 0
etc

stat calls returning 0 size, read calls returning nothing

neither of those calls reached the backend (and, as we can see from the backend logs, the backend actually knows about the correct size - look at the logged CreateHandle responses)

Happens when compiling (linux, postgres,…) with ccache and using -j (jobs) more than 1

And the answer to that is that with 1 job we don’t have any overlap between append/list/getattr calls - that’s why there can be no such race

After we set entry_timeout and attr_timeout for the VMs running these tests to 1ms (can’t set to a lower value because it will be treated as no value and the default - 15s - will be used), the problem hasn’t reproduced anymore, which supports the hypothesis

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingfilestoreAdd this label to run only cloud/filestore build and tests on PR

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions