Skip to content

Bug: kernel null pointer dereference on virtio-mem hotplug #1391

@sharnoff

Description

@sharnoff

Environment

Production

Context

We recently upgraded from kernel version 6.6.64 to 6.12.26 in production (ref #1376).

Since then, we've seen a very small (~1 in a million) rate of our VMs hitting a bug that looks roughly like:

  1. We try to hotplug more memory with virtio-mem
  2. There's an allocation failure while setting up the physical pagetables
  3. That allocation failure isn't handled, and we end up dereferencing the null pointer

See this slack thread for more: https://neondb.slack.com/archives/C0807C9SSJ2/p1748363834798349

Impact

  1. The rate of occurrence is very small
  2. When VMs hit this, the kworker responsible for handling virtio-mem operations exits and does not restart — i.e.

So overall impact is pretty small: the VMs continue operating normally, with memory scaling broken.
(notably in contrast to the kcompactd issue, where the VM will eventually fall over)

Example stack trace(s)

Here's an example of what we saw for a particular VM:

Allocation failure
[ 1259.521867] virtio_mem virtio1: plugged size: 0x0
[ 1259.521939] virtio_mem virtio1: requested size: 0x40000000
[ 1259.534610] kworker/0:2: page allocation failure: order:0, mode:0x920(GFP_ATOMIC|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
[ 1259.534738] CPU: 0 UID: 0 PID: 140 Comm: kworker/0:2 Not tainted 6.12.26 #1
[ 1259.534740] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 1259.534742] Workqueue: events_freezable virtio_mem_run_wq
[ 1259.534748] Call Trace:
[ 1259.534750]  <TASK>
[ 1259.534751]  dump_stack_lvl+0x5b/0x70
[ 1259.534755]  dump_stack+0x10/0x20
[ 1259.534756]  warn_alloc+0x103/0x180
[ 1259.534760]  __alloc_pages_slowpath.constprop.0+0x738/0xf30
[ 1259.534763]  __alloc_pages_noprof+0x1e9/0x340
[ 1259.534765]  alloc_pages_mpol_noprof+0x47/0x100
[ 1259.534767]  alloc_pages_noprof+0x4b/0x80
[ 1259.534768]  get_free_pages_noprof+0xc/0x40
[ 1259.534770]  alloc_low_pages+0xc2/0x150
[ 1259.534772]  phys_pud_init+0x82/0x390
[ 1259.534775]  phys_p4d_init+0x93/0x330
[ 1259.534777]  __kernel_physical_mapping_init+0xa1/0x370
[ 1259.534778]  kernel_physical_mapping_init+0xf/0x20
[ 1259.534780]  init_memory_mapping+0x1fa/0x430
[ 1259.534781]  arch_add_memory+0x2b/0x50
[ 1259.534783]  add_memory_resource+0xe6/0x260
[ 1259.534785]  add_memory_driver_managed+0x78/0xc0
[ 1259.534787]  virtio_mem_add_memory+0x46/0xc0
[ 1259.534789]  virtio_mem_sbm_plug_and_add_mb+0xa3/0x160
[ 1259.534791]  virtio_mem_run_wq+0x1035/0x16c0
[ 1259.534792]  process_one_work+0x17a/0x3c0
[ 1259.534795]  worker_thread+0x2c5/0x3f0
[ 1259.534797]  ? _raw_spin_unlock_irqrestore+0x9/0x30
[ 1259.534799]  ? __pfx_worker_thread+0x10/0x10
[ 1259.534801]  kthread+0xdc/0x110
[ 1259.534804]  ? __pfx_kthread+0x10/0x10
[ 1259.534805]  ret_from_fork+0x35/0x60
[ 1259.534810]  ? __pfx_kthread+0x10/0x10
[ 1259.534811]  ret_from_fork_asm+0x1a/0x30
[ 1259.534814]  </TASK>
[ 1259.534814] Mem-Info:
[ 1259.536035] active_anon:23991 inactive_anon:85286 isolated_anon:0
[ 1259.536035]  active_file:23055 inactive_file:79009 isolated_file:0
[ 1259.536035]  unevictable:0 dirty:5821 writeback:0
[ 1259.536035]  slab_reclaimable:4649 slab_unreclaimable:4807
[ 1259.536035]  mapped:87717 shmem:74808 pagetables:3067
[ 1259.536035]  sec_pagetables:0 bounce:0
[ 1259.536035]  kernel_misc_reclaimable:0
[ 1259.536035]  free:226 free_pcp:2199 free_cma:0
[ 1259.536323] Node 0 active_anon:95964kB inactive_anon:341144kB active_file:92220kB inactive_file:316036kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:350868kB dirty:23284kB writeback:0kB shmem:299232kB writeback_tmp:0kB kernel_stack:2828kB pagetables:12268kB sec_pagetables:0kB all_unreclaimable? no
[ 1259.536526] Node 0 DMA32 free:904kB boost:8676kB min:12532kB low:13496kB high:14460kB reserved_highatomic:2048KB active_anon:95964kB inactive_anon:341144kB active_file:92220kB inactive_file:316036kB unevictable:0kB writepending:23284kB present:1047984kB managed:936648kB mlocked:0kB bounce:0kB free_pcp:8796kB local_pcp:8796kB free_cma:0kB
[ 1259.536748] lowmem_reserve[]: 0 0 0
[ 1259.536788] Node 0 DMA32: 138*4kB (M) 13*8kB (H) 0*16kB 1*32kB (H) 1*64kB (H) 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 880kB
[ 1259.536902] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1259.536981] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1259.537060] 176887 total pagecache pages
[ 1259.537101] 0 pages in swap cache
[ 1259.537141] Free swap  = 16775916kB
[ 1259.537183] Total swap = 16777212kB
[ 1259.537224] 261996 pages RAM
[ 1259.537264] 0 pages HighMem/MovableOnly
[ 1259.537305] 27834 pages reserved
Null pointer dereference (immediately afterwards)
[ 1259.537348] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1259.537404] #PF: supervisor read access in kernel mode
[ 1259.537449] #PF: error_code(0x0000) - not-present page
[ 1259.537496] PGD 423b067 P4D 0 
[ 1259.537538] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1259.537587] CPU: 0 UID: 0 PID: 140 Comm: kworker/0:2 Not tainted 6.12.26 #1
[ 1259.537647] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 1259.537734] Workqueue: events_freezable virtio_mem_run_wq
[ 1259.537784] RIP: 0010:phys_pmd_init+0xf0/0x3a0
[ 1259.537834] Code: 49 c1 e9 12 48 81 e7 00 00 e0 ff 48 8b 4d d0 4c 8d af 00 00 20 00 41 81 e1 f8 0f 00 00 4d 39 fe 4a 8d 1c 08 0f 83 76 01 00 00 <48> 8b 03 48 a9 9f ff ff ff 0f 85 48 ff ff ff f6 45 b8 04 0f 84 e0
[ 1259.537969] RSP: 0018:ff689bbc001bfa10 EFLAGS: 00010287
[ 1259.538007] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 8000000000000163
[ 1259.538064] RDX: 0000000108000000 RSI: 0000000000000000 RDI: 0000000100000000
[ 1259.538122] RBP: ff689bbc001bfa70 R08: 8000000000000163 R09: 0000000000000000
[ 1259.538179] R10: 000000000000000a R11: ffffffff870ce008 R12: 0000000000000000
[ 1259.538237] R13: 0000000100200000 R14: 0000000100000000 R15: 0000000108000000
[ 1259.538294] FS:  0000000000000000(0000) GS:ff22bd333e800000(0000) knlGS:0000000000000000
[ 1259.538366] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1259.538423] CR2: 0000000000000000 CR3: 0000000002594001 CR4: 0000000000371eb0
[ 1259.538483] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1259.538551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1259.538609] Call Trace:
[ 1259.538628]  <TASK>
[ 1259.538648]  phys_pud_init+0xa0/0x390
[ 1259.538688]  phys_p4d_init+0x93/0x330
[ 1259.538717]  __kernel_physical_mapping_init+0xa1/0x370
[ 1259.538768]  kernel_physical_mapping_init+0xf/0x20
[ 1259.538818]  init_memory_mapping+0x1fa/0x430
[ 1259.538868]  arch_add_memory+0x2b/0x50
[ 1259.538908]  add_memory_resource+0xe6/0x260
[ 1259.538949]  add_memory_driver_managed+0x78/0xc0
[ 1259.538999]  virtio_mem_add_memory+0x46/0xc0
[ 1259.539038]  virtio_mem_sbm_plug_and_add_mb+0xa3/0x160
[ 1259.539088]  virtio_mem_run_wq+0x1035/0x16c0
[ 1259.539138]  process_one_work+0x17a/0x3c0
[ 1259.539166]  worker_thread+0x2c5/0x3f0
[ 1259.539196]  ? _raw_spin_unlock_irqrestore+0x9/0x30
[ 1259.539245]  ? __pfx_worker_thread+0x10/0x10
[ 1259.539295]  kthread+0xdc/0x110
[ 1259.539336]  ? __pfx_kthread+0x10/0x10
[ 1259.539377]  ret_from_fork+0x35/0x60
[ 1259.539418]  ? __pfx_kthread+0x10/0x10
[ 1259.539459]  ret_from_fork_asm+0x1a/0x30
[ 1259.539500]  </TASK>
[ 1259.539519] Modules linked in:
[ 1259.539549] CR2: 0000000000000000
[ 1259.539578] ---[ end trace 0000000000000000 ]---
[ 1259.539627] RIP: 0010:phys_pmd_init+0xf0/0x3a0
[ 1259.539678] Code: 49 c1 e9 12 48 81 e7 00 00 e0 ff 48 8b 4d d0 4c 8d af 00 00 20 00 41 81 e1 f8 0f 00 00 4d 39 fe 4a 8d 1c 08 0f 83 76 01 00 00 <48> 8b 03 48 a9 9f ff ff ff 0f 85 48 ff ff ff f6 45 b8 04 0f 84 e0
[ 1259.539813] RSP: 0018:ff689bbc001bfa10 EFLAGS: 00010287
[ 1259.539851] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 8000000000000163
[ 1259.539909] RDX: 0000000108000000 RSI: 0000000000000000 RDI: 0000000100000000
[ 1259.539967] RBP: ff689bbc001bfa70 R08: 8000000000000163 R09: 0000000000000000
[ 1259.540025] R10: 000000000000000a R11: ffffffff870ce008 R12: 0000000000000000
[ 1259.540082] R13: 0000000100200000 R14: 0000000100000000 R15: 0000000108000000
[ 1259.540140] FS:  0000000000000000(0000) GS:ff22bd333e800000(0000) knlGS:0000000000000000
[ 1259.540198] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1259.540257] CR2: 0000000000000000 CR3: 0000000002594001 CR4: 0000000000371eb0
[ 1259.540316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1259.540384] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1259.540442] note: kworker/0:2[140] exited with irqs disabled

This was on a kernel from autoscaling release v0.47.0.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions