-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Node attr caching in guest can lead to stale node size being cached if size-modifying requests (append via WriteData, size extension via SetNodeAttr, AllocateData) overlap with attr-fetching requests (GetNodeAttr, ListNodes, CreateHandle). If a size-modifying request completes while an attr-fetching request is in flight, we should not let the guest cache the results of this attr-fetching request.
We seem to have this problem for example when we run compilation+linking with -j > 1 on top of filestore. Detailed description of one of the cases follows:
The problem seems to be caused by the fact that sometimes after appending to a file the guest ends up having old (pre-append) inode size cached. An example scenario:
- guest sends an append request for file X in dir Y
- guest sends a listing request for dir Y
- listing partially completes in the backend - prev size (N1) for file X is retrieved
- append op completes in the backend - file size increased to N2 - guest receives the response
- [not sure whether this step is necessary] the inode for file X is evicted from the guest cache
- listing request completes, returns a response which contains size N1 for file X and contains entry_timeout=15s, attr_timeout=15s for this file
- guest caches size=N1 for file X for 15 seconds
- fstat call is performed by an application within those 15 seconds - size=N1 is returned, the call doesn’t reach the backend
- read call with offset>=N1 is performed by an application within those 15 seconds - zeroes are returned, the call doesn’t reach the backend
Example logs:
file created:
2026-02-02T23:14:33.080367Z→proller_test_nfs_postgresql-2-4-0→→→CreateHandle→→→→0.005894s→→→S_OK→→→→{parent_node_id=450, node_name=varbit.o, flags=38, mode=420, node_id=72057594038579030, handle=72325632859339675, size=0}→→→loop_thread_id=3758812→→client_id=local-qemu-32202
listing request:
2026-02-02T23:14:33.086454Z→proller_test_nfs_postgresql-2-4-0→→→ListNodes→→→0.007108s→→→S_OK→→→→{node_id=450, size=641}→loop_thread_id=3758812→→client_id=local-qemu-32202
append and flush - both start AFTER the listing starts and complete BEFORE the listing completes:
2026-02-02T23:14:33.087093Z→proller_test_nfs_postgresql-2-4-0→→→WriteData→→→0.006021s→→→S_OK→→→→[{node_id=72057594038579030, handle=72325632859339675, offset=0, bytes=30936, checksums=2961554979 3477393387 3515750452 3560497666 86651290 4212264968 3125409310 4087214380}]→loop_thread_id=3758812→→client_id=local-qemu-32202
2026-02-02T23:14:33.093456Z→proller_test_nfs_postgresql-2-4-0→→→Flush→→→0.000005s→→→S_OK→→→→{data_only=1, node_id=72057594038579030, handle=72325632859339675}
2026-02-02T23:14:33.086454 - listing starts
2026-02-02T23:14:33.087093 - append starts
2026-02-02T23:14:33.087093 + 0.006021s = 2026-02-02T23:14:33.093114 - append completes
2026-02-02T23:14:33.093456 - flush starts
2026-02-02T23:14:33.093456 + 0.000005s = 2026-02-02T23:14:33.093461 - flush completes
2026-02-02T23:14:33.086454 + 0.007108s = 2026-02-02T23:14:33.093562 - listing completes
the same file was later opened multiple times but no other calls apart from open/flush/close reached the backend:
2026-02-02T23:14:47.838319Z→proller_test_nfs_postgresql-2-4-0→→→CreateHandle→→→→0.000960s→→→S_OK→→→→{parent_node_id=72057594038579030, node_name=, flags=1, mode=0, node_id=72057594038579030, handle=72299628603254894, size=30936}→→→→loop_thread_id=3758812→→client_id=local-qemu-32202
2026-02-02T23:14:47.839662Z→proller_test_nfs_postgresql-2-4-0→→→CreateHandle→→→→0.001538s→→→S_OK→→→→{parent_node_id=72057594038579030, node_name=, flags=1, mode=0, node_id=72057594038579030, handle=72162422990483306, size=30936}→→→→loop_thread_id=3758812→→client_id=local-qemu-32202
2026-02-02T23:14:47.841470Z→proller_test_nfs_postgresql-2-4-0→→→Flush→→→0.000003s→→→S_OK→→→→{data_only=1, node_id=72057594038579030, handle=72162422990483306}
2026-02-02T23:14:47.841519Z→proller_test_nfs_postgresql-2-4-0→→→DestroyHandle→→→0.000576s→→→S_OK→→→→{node_id=72057594038579030, handle=72162422990483306}→→→loop_thread_id=3758812→→client_id=local-qemu-32202
2026-02-02T23:14:47.842287Z→proller_test_nfs_postgresql-2-4-0→→→DestroyHandle→→→0.001088s→→→S_OK→→→→{node_id=72057594038579030, handle=72299628603254894}→→→loop_thread_id=3758812→→client_id=local-qemu-32202
2026-02-02T23:14:47.842289Z→proller_test_nfs_postgresql-2-4-0→→→Flush→→→0.000001s→→→S_OK→→→→{data_only=1, node_id=72057594038579030, handle=72299628603254894}
2026-02-02T23:14:47.843502Z→proller_test_nfs_postgresql-2-4-0→→→CreateHandle→→→→0.002369s→→→S_OK→→→→{parent_node_id=72057594038579030, node_name=, flags=1, mode=0, node_id=72057594038579030, handle=72187929483325026, size=30936}→→→→loop_thread_id=3758812→→client_id=local-qemu-32202
2026-02-02T23:14:47.846701Z→proller_test_nfs_postgresql-2-4-0→→→Flush→→→0.000001s→→→S_OK→→→→{data_only=1, node_id=72057594038579030, handle=72187929483325026}
2026-02-02T23:14:47.846703Z→proller_test_nfs_postgresql-2-4-0→→→DestroyHandle→→→0.002005s→→→S_OK→→→→{node_id=72057594038579030, handle=72187929483325026}→→→loop_thread_id=3758812→→client_id=local-qemu-32202
here’s what we see in the strace log from the guest (the time is not perfectly in sync with the backend):
23:14:46.310287 newfstatat(AT_FDCWD, "utils/adt/varbit.o", {st_mode=S_IFREG|0644, st_size=0, ...}, 0) = 0
23:14:46.310324 openat(AT_FDCWD, "utils/adt/varbit.o", O_RDONLY) = 712
23:14:46.311495 fstat(712, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23:14:46.311534 read(712, "", 8192) = 0
23:14:46.311598 fstat(712, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23:14:46.311656 openat(AT_FDCWD, "utils/adt/varbit.o", O_RDONLY) = 713
23:14:46.313302 fstat(713, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
23:14:46.313353 lseek(713, 0, SEEK_SET) = 0
23:14:46.313398 read(713, "", 16) = 0
23:14:46.313437 close(713) = 0
23:14:46.314251 close(712) = 0
etc
stat calls returning 0 size, read calls returning nothing
neither of those calls reached the backend (and, as we can see from the backend logs, the backend actually knows about the correct size - look at the logged CreateHandle responses)
Happens when compiling (linux, postgres,…) with ccache and using -j (jobs) more than 1
And the answer to that is that with 1 job we don’t have any overlap between append/list/getattr calls - that’s why there can be no such race
After we set entry_timeout and attr_timeout for the VMs running these tests to 1ms (can’t set to a lower value because it will be treated as no value and the default - 15s - will be used), the problem hasn’t reproduced anymore, which supports the hypothesis