Skip to content

Conversation

@ixhamza
Copy link
Member

@ixhamza ixhamza commented Nov 17, 2025

Motivation and Context

This fixes a race condition that causes kernel panics when multiple processes simultaneously access a fresh snapshot. The bug triggers a VERIFY() assertion failure in the AVL tree code when concurrent threads attempt to add identical entries during snapshot automount.

Description

The race condition occurs in zfsctl_snapshot_mount() due to a time-of-check-time-of-use (TOCTOU) bug between checking if a snapshot is mounted and adding it to the AVL tree. The sequence is:

  1. Thread A checks if snapshot is mounted via zfsctl_snapshot_ismounted() - returns FALSE
  2. Thread B checks if snapshot is mounted - also returns FALSE (no AVL entry exists yet)
  3. Both threads release the reader lock and proceed to mount
  4. Both threads call userspace mount helper (call_usermodehelper())
  5. Thread A acquires write lock and adds entry to AVL tree - succeeds
  6. Thread B acquires write lock and attempts to add duplicate entry
  7. AVL tree's VERIFY() assertion fails

The fix adds a pending entry mechanism with per-entry mutex synchronization. The first mount thread creates a pending AVL entry and holds se_mtx during helper execution. Concurrent mounts find the pending entry and return success without spawning duplicate helpers, preventing the AVL panic.

Kernel Stack Trace:

[   24.765626] VERIFY(avl_find(tree, new_node, &where) == NULL) failed
[   24.768232] PANIC at avl.c:625:avl_add()
[   24.769719] Showing stack for process 858
[   24.771962] CPU: 13 UID: 0 PID: 858 Comm: ls Not tainted 6.12.43-production+ #656
[   24.771974] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   24.771978] Call Trace:
[   24.771982]  <TASK>
[   24.771987]  dump_stack_lvl+0x6c/0x90
[   24.772002]  dump_stack+0x10/0x20
[   24.772010]  spl_dumpstack+0x28/0x30
[   24.772019]  spl_panic+0xd3/0xf0
[   24.772027]  ? schedule+0x34/0x100
[   24.772037]  ? __kmalloc_node_noprof+0x13d/0x3f0
[   24.772045]  ? __kvmalloc_node_noprof+0x24/0xe0
[   24.772054]  ? __kvmalloc_node_noprof+0x24/0xe0
[   24.772061]  ? __kvmalloc_node_noprof+0x24/0xe0
[   24.772068]  ? __kmalloc_noprof+0x157/0x3c0
[   24.772074]  ? snapentry_compare_by_name+0x14/0x30
[   24.772082]  spl_assert.constprop.0+0x1a/0x30
[   24.772091]  avl_add+0x7e/0x90
[   24.772099]  zfsctl_snapshot_add+0x34/0x70
[   24.772105]  zfsctl_snapshot_mount+0x51d/0x700
[   24.772117]  zpl_snapdir_automount+0x10/0x40
[   24.772124]  __traverse_mounts+0x8f/0x210
[   24.772134]  step_into+0x33a/0x760
[   24.772140]  walk_component+0x51/0x180
[   24.772145]  path_lookupat+0x6a/0x1a0
[   24.772150]  filename_lookup+0xc8/0x1c0
[   24.772161]  vfs_statx+0x7b/0xe0
[   24.772171]  do_statx+0x45/0x80
[   24.772180]  ? __check_object_size+0x25b/0x2c0
[   24.772191]  ? strncpy_from_user+0x25/0x100
[   24.772199]  ? getname_flags.part.0+0x4b/0x1d0
[   24.772207]  ? getname_flags+0x37/0x60
[   24.772212]  __x64_sys_statx+0x9b/0xe0
[   24.772225]  x64_sys_call+0x16b5/0x2060
[   24.772235]  do_syscall_64+0x69/0x110
[   24.772247]  ? debug_smp_processor_id+0x17/0x20
[   24.772259]  ? __alloc_pages_noprof+0x164/0x310
[   24.772270]  ? debug_smp_processor_id+0x17/0x20
[   24.772277]  ? __mod_memcg_lruvec_state+0xf9/0x1b0
[   24.772288]  ? debug_smp_processor_id+0x17/0x20
[   24.772294]  ? __folio_batch_add_and_move+0xf3/0x110
[   24.772304]  ? set_ptes.isra.0+0x3b/0x80
[   24.772311]  ? _raw_spin_unlock+0x19/0x40
[   24.772322]  ? do_anonymous_page+0x111/0x800
[   24.772331]  ? __pte_offset_map+0x1c/0x180
[   24.772342]  ? __handle_mm_fault+0xbc1/0x1040
[   24.772356]  ? debug_smp_processor_id+0x17/0x20
[   24.772363]  ? __count_memcg_events+0x76/0x110
[   24.772374]  ? count_memcg_events.constprop.0+0x1e/0x40
[   24.772386]  ? debug_smp_processor_id+0x17/0x20
[   24.772558]  ? fpregs_assert_state_consistent+0x29/0x60
[   24.772573]  ? arch_exit_to_user_mode_prepare.isra.0+0x24/0xe0
[   24.772585]  ? irqentry_exit_to_user_mode+0x2d/0x120
[   24.772591]  ? irqentry_exit+0x3b/0x50
[   24.772595]  ? clear_bhb_loop+0x50/0xa0
[   24.772605]  ? clear_bhb_loop+0x50/0xa0
[   24.772612]  ? clear_bhb_loop+0x50/0xa0
[   24.772618]  ? clear_bhb_loop+0x50/0xa0
[   24.772625]  ? clear_bhb_loop+0x50/0xa0
[   24.772632]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   24.772641] RIP: 0033:0x7f2abed8aa2f
[   24.772648] Code: c0 78 16 48 89 e6 48 89 df e8 f2 05 00 00 b8 00 00 00 00 48 83 ec 80 5b c3 b8 ff ff ff ff eb f3 41 89 ca b8 4c 01 00 00 0f 05 <48> 89 c2 48 3d 00 f0 ff ff 77 03 89 d0 c3 f7 d8 48 8b 15 aa 33 0d
[   24.772653] RSP: 002b:00007ffcfd19ee08 EFLAGS: 00000202 ORIG_RAX: 000000000000014c
[   24.772661] RAX: ffffffffffffffda RBX: 00007ffcfd19f4f8 RCX: 00007f2abed8aa2f
[   24.772665] RDX: 0000000000000800 RSI: 00007ffcfd19feee RDI: 00000000ffffff9c
[   24.772668] RBP: 00007ffcfd19ef60 R08: 00007ffcfd19ee40 R09: 0000000000000000
[   24.772671] R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
[   24.772674] R13: 000055b31f123ed8 R14: 00007ffcfd19f510 R15: 000055b31f123ed8
[   24.772682]  </TASK>

How Has This Been Tested?

Reproduction script:

zpool create -f testpool /dev/sdc -O mountpoint=none
zfs create -o mountpoint=/mnt/test -o snapdir=visible testpool/testfs
zfs snapshot testpool/testfs@snap1
for i in {1..10}; do
    ls /mnt/test/.zfs/snapshot/snap1/ &
done

Results:

  • Without fix: Panic occurs immediately on concurrent access to fresh snapshot
  • With fix: No panic with parallel access to snapshots

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 17, 2025
amotin
amotin previously approved these changes Nov 18, 2025
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me wonder why parallel mounts may somehow succeed, considering identical sequential ones are failing. But if this is a state of things, then I have no objections.

@amotin amotin dismissed their stale review November 18, 2025 15:46

It seems multiple successful parallel mount calls result in multiple mounts, which is not right. It needs deeper investigation and possibly different solution.

Multiple threads racing to automount the same snapshot can both spawn
mount helper processes that successfully complete, causing both parent
threads to attempt AVL tree registration and triggering a VERIFY()
panic in avl_add(). This occurs because the fsconfig/fsmount API lacks
the serialization provided by traditional mount() via lock_mount().

The fix adds a per-entry mutex (se_mtx) to zfs_snapentry_t that
serializes mount and unmount operations on the same snapshot. The first
mount thread creates a pending entry with se_spa=NULL and holds se_mtx
during the helper execution. Concurrent mounts find the pending entry
and return success without spawning duplicate helpers. Unmount waits on
se_mtx if a mount is pending, ensuring proper serialization. This allows
different snapshots to mount in parallel while preventing the AVL panic.

Signed-off-by: Ameer Hamza <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants