Skip to content

Conversation

pvts-mat
Copy link
Contributor

@pvts-mat pvts-mat commented Feb 12, 2025

CVE-2023-4921
VULN-6730

Problem

https://www.cve.org/CVERecord?id=CVE-2023-4921

A use-after-free vulnerability in the Linux kernel's net/sched: sch_qfq component can be exploited to achieve local privilege escalation. When the plug qdisc is used as a class of the qfq qdisc, sending network packets triggers use-after-free in qfq_dequeue() due to the incorrect .peek handler of sch_plug and lack of error checking in agg_dequeue()

Solution

A single commit was identified as a fix for this issue: 8fc134fee27f2263988ae38920bc03da416b03d8

kABI check: passed

python3 /mnt/code/kernel-dist-git/SOURCES/check-kabi \
        -k /mnt/code/kernel-dist-git/SOURCES/Module.kabi_$(uname -m) \
        -s /mnt/build_files/kernel-src-tree-ciqlts8_8-CVE-2023-4921/Module.symvers; echo $?

0

kernel-dist-git state:

Switched to branch 'el-8.8'
Your branch is up to date with 'origin/el-8.8'.

Boot test: passed

See Specific tests for implied boot test passing.

Kselftests: passed relative

Methodology

A mix of kernel-selftests-internal and source-compiled tests were used:

  • kernel-selftests-internal: bpf tests, except:
    • bpf:test_kmod.sh: takes very long time to finish and always fails anyway,
    • bpf:test_progs: unstable, can crash the machine,
    • bpf:test_progs-no_alu32: unstable, can crash the machine.
  • source-compiled: all the rest.

Coverage (including tests skipped during execution)

android, bpf, breakpoints, capabilities, cgroup, core, cpu-hotplug, cpufreq, drivers/net/bonding, drivers/net/team, efivarfs, exec, filesystems, firmware, fpu, ftrace, futex, gpio, intel_pstate, ipc, kcmp, kvm, lib, livepatch, membarrier, memfd, memory-hotplug, mount, mqueue, net, net/forwarding, net/mptcp, netfilter, nsfs, proc, pstore, ptrace, rseq, rtc, sgx, sigaltstack, size, splice, static_keys, sync, sysctl, tc-testing, tdx, timens, timers, tpm2, user, vm, x86, zram

Reference ciqlts8_8 (683666ad1a6d7754125126d580f2994b4e35b3cd)

Four test runs were conducted on the reference kernel.
kselftests–mixed–ciqlts8_8–run1.log
kselftests–mixed–ciqlts8_8–run2.log
kselftests–mixed–ciqlts8_8–run3.log
kselftests–mixed–ciqlts8_8–run4.log

Patch

A single rest run was conducted on the patched kernel.
kselftests–mixed–ciqlts8_8-CVE-2023-4921.log

Comparison

Overview of the results, reduced to the differences:

ktests.xsh  table --where "Summary = 'diff'" kselftests*.log

Column    File
--------  ----------------------------------------------
Status0   kselftests--mixed--ciqlts8_8--run1.log
Status1   kselftests--mixed--ciqlts8_8--run2.log
Status2   kselftests--mixed--ciqlts8_8--run3.log
Status3   kselftests--mixed--ciqlts8_8--run4.log
Status4   kselftests--mixed--ciqlts8_8-CVE-2023-4921.log

TestCase                   Status0  Status1  Status2  Status3  Status4  Summary
bpf:test_xdp_veth.sh       skip     skip     pass     skip     skip     diff
net/mptcp:simult_flows.sh  pass     fail     pass     pass     pass     diff
net:gro.sh                 pass     pass     fail     pass     pass     diff
net:xfrm_policy.sh         fail     fail     pass     fail     fail     diff

No differences in results occured which weren't present in the reference test set already.

New unreliable tests were identified:

  • bpf:test_xdp_veth.sh: For the rpm selftests package. The test may be run and pass

    # selftests: bpf: test_xdp_veth.sh
    # PING 10.1.1.33 (10.1.1.33) 56(84) bytes of data.
    # 64 bytes from 10.1.1.33: icmp_seq=1 ttl=64 time=0.559 ms
    # 
    # --- 10.1.1.33 ping statistics ---
    # 1 packets transmitted, 1 received, 0% packet loss, time 0ms
    # rtt min/avg/max/mdev = 0.559/0.559/0.559/0.000 ms
    # selftests: xdp_veth [PASS]
    ok 19 selftests: bpf: test_xdp_veth.sh
    

    or it may be skipped

    # selftests: bpf: test_xdp_veth.sh
    # Cannot create namespace file "/var/run/netns/ns3": File exists
    # selftests: xdp_veth [SKIP]
    ok 19 selftests: bpf: test_xdp_veth.sh # SKIP
    

    This seems to be related to the net/mptcp:mptcp_connect.sh test, which sometimes creates the ns3

    # selftests: net/mptcp: mptcp_connect.sh
    # INFO: set ns3-67a772bd-2iZfQl dev ns3eth2: ethtool -K  gso off gro off
    # INFO: set ns4-67a772bd-2iZfQl dev ns4eth3: ethtool -K  gso off gro off
    # Created /tmp/tmp.8ngPQFHWq5 (size 7183388	/tmp/tmp.8ngPQFHWq5) containing data sent by client
    # Created /tmp/tmp.aNLEp7GmM5 (size 2554908	/tmp/tmp.aNLEp7GmM5) containing data sent by server
    ...
    

    and sometimes doesn't

    # selftests: net/mptcp: mptcp_connect.sh
    # INFO: set ns4-67a77e0c-d8qLil dev ns4eth3: ethtool -K tso off gso off
    # Created /tmp/tmp.T3DnUSNyRd (size 4120604	/tmp/tmp.T3DnUSNyRd) containing data sent by client
    # Created /tmp/tmp.mveJmQppTp (size 435228	/tmp/tmp.mveJmQppTp) containing data sent by server
    ...
    
  • net/mptcp:mptcp_connect.sh: For the source-compiled tests set. See above.

Specific tests: passed

Bug replication

The bug can be replicated with the following commands, as mention in the commit's message:

tc qdisc add dev lo root handle 1: qfq
tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
tc qdisc add dev lo parent 1:1 handle 2: plug
tc filter add dev lo parent 1: basic classid 1:1
ping -c1 127.0.0.1

The tests were performed on the referential and patched kernel.

Prerequisites

The tc commands above require the following kernel options to be enabled: CONFIG_NET_SCHED, CONFIG_NET_SCH_INGRESS, CONFIG_NET_SCH_QFQ, CONFIG_NET_SCH_PLUG, CONFIG_NET_CLS_BASIC.

All of them are enabled by default in the configs/kernel-4.18.0-x86_64.config configuration file for the tested x86_64 platform.

CONFIG_NET_SCHED=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_SCH_QFQ=m
CONFIG_NET_SCH_PLUG=m
CONFIG_NET_CLS_BASIC=m

Reference ciqlts8_8 (683666ad1a6d7754125126d580f2994b4e35b3cd)

Bug replicated successfully. Kernel crashed and machine automatically rebooted.

[root@ciqlts8_8 pvts]# tc qdisc add dev lo root handle 1: qfq
tc qdisc add dev lo root handle 1: qfq
[root@ciqlts8_8 pvts]# tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
xpkt 512
[root@ciqlts8_8 pvts]# tc qdisc add dev lo parent 1:1 handle 2: plug
tc qdisc add dev lo parent 1:1 handle 2: plug
[root@ciqlts8_8 pvts]# tc filter add dev lo parent 1: basic classid 1:1
tc filter add dev lo parent 1: basic classid 1:1
[root@ciqlts8_8 pvts]# ping -c1 127.0.0.1
ping -c1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
[   30.846825] ------------[ cut here ]------------
[   30.849050] kernel BUG at mm/slub.c:376!
[   30.850468] invalid opcode: 0000 [#1] SMP PTI
[   30.851962] CPU: 2 PID: 1553 Comm: ping Kdump: loaded Not tainted 4.18.0-ciqlts8_8 #1
[   30.854641] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014
[   30.857114] RIP: 0010:set_freepointer.part.57+0x0/0x10
[   30.858983] Code: 83 ef 70 e9 22 6a fa ff 66 90 0f 1f 44 00 00 41 54 55 53 48 8b 06 48 85 c0 0f 85 56 86 00 00 5b 5d 41 5c e9 e2 ce ad 00 66 90 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 41 57 41 56 41 55
[   30.863746] RSP: 0018:ffffbbd94326cdb0 EFLAGS: 00010246
[   30.865158] RAX: ffffa08780f31200 RBX: ffffa087a87d8800 RCX: ffffa08780f31300
[   30.866906] RDX: 000000000000063e RSI: 0000000000000000 RDI: ffffa08740005180
[   30.868644] RBP: ffffe8164503cc00 R08: 0000000080000000 R09: ffffdbd93fd1b880
[   30.870065] R10: 0000000000000228 R11: 0000000000000228 R12: ffffa08740005180
[   30.871528] R13: ffffa08780f31200 R14: ffffffffb8e11fd5 R15: 0000000000000001
[   30.872931] FS:  00007f3bdf377480(0000) GS:ffffa08e9f900000(0000) knlGS:0000000000000000
[   30.874482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   30.875579] CR2: 00007ffdc419d000 CR3: 000000011ac66003 CR4: 0000000000370ee0
[   30.876865] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   30.878090] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   30.879282] Call Trace:
[   30.879718]  <IRQ>
[   30.880056]  kfree+0x238/0x250
[   30.880583]  ? __netif_receive_skb_core+0x145/0xd10
[   30.881393]  kfree_skb_reason+0x45/0x110
[   30.882049]  __netif_receive_skb_core+0x145/0xd10
[   30.882843]  ? loopback_xmit+0xd8/0x130
[   30.883586]  process_backlog+0xaa/0x170
[   30.884236]  __napi_poll+0x2d/0x130
[   30.884829]  net_rx_action+0x252/0x320
[   30.885459]  __do_softirq+0xdc/0x2cf
[   30.886069]  do_softirq_own_stack+0x2a/0x40
[   30.886754]  </IRQ>
[   30.887102]  do_softirq.part.16+0x45/0x50
[   30.887769]  __local_bh_enable_ip+0x4f/0x60
[   30.888459]  ip_finish_output2+0x1a6/0x430
[   30.889140]  ip_output+0x70/0xf0
[   30.889676]  ? __ip_finish_output+0x1d0/0x1d0
[   30.890389]  ip_send_skb+0x15/0x40
[   30.890977]  ping_v4_sendmsg+0x5bf/0x780
[   30.891710]  ? sock_has_perm+0x80/0xa0
[   30.892404]  ? release_sock+0x43/0x90
[   30.893023]  ? sock_sendmsg+0x42/0x60
[   30.893641]  sock_sendmsg+0x42/0x60
[   30.894208]  __sys_sendto+0xee/0x160
[   30.894795]  ? syscall_trace_enter+0x1ff/0x2d0
[   30.895521]  ? ksys_ioctl+0x64/0xa0
[   30.896101]  __x64_sys_sendto+0x24/0x30
[   30.896740]  do_syscall_64+0x5b/0x1b0
[   30.897340]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[   30.898145] RIP: 0033:0x7f3bde014b4b
[   30.898724] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 f5 4b 29 00 41 89 ca 8b 00 85 c0 75 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 41 57 4d 89 c7 41 56 41 89
[   30.901652] RSP: 002b:00007ffdc4096a68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   30.902842] RAX: ffffffffffffffda RBX: 0000559edafb6700 RCX: 00007f3bde014b4b
[   30.903967] RDX: 0000000000000040 RSI: 0000559edafb6700 RDI: 0000000000000003
[   30.905093] RBP: 0000000000000040 R08: 0000559edafb3500 R09: 0000000000000010
[   30.906224] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdc4098190
[   30.907339] R13: 00007ffdc4096a70 R14: 00007ffdc4096b60 R15: 0000559edafb20a0
[   30.908450] Modules linked in: cls_basic sch_plug sch_qfq vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency_common isst_if_common nfit libnvdimm kvm_intel iTCO_wdt iTCO_vendor_support kvm virtio_gpu irqbypass drm_shmem_helper drm_kms_helper rapl syscopyarea sysfillrect sysimgblt fb_sys_fops joydev pcspkr drm virtio_balloon i2c_i801 lpc_ich xfs libcrc32c sr_mod cdrom sg crct10dif_pclmul crc32_pclmul crc32c_intel virtio_net ahci ghash_clmulni_intel libahci serio_raw libata net_failover virtio_blk virtio_console failover virtiofs sunrpc dm_mirror dm_region_hash dm_log dm_mod fuse
[    0.000000] Linux version 4.18.0-ciqlts8_8 (pvts@ciqlts8_8) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-18) (GCC)) #1 SMP Sun Feb 2 16:43:15 UTC 2025
[    0.000000] Command line: elfcorehdr=0x6f000000 BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-ciqlts8_8 ro console=ttyS0,115200n8 no_timer_check net.ifnames=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable disable_cpu_apicid=0 iTCO_wdt.pretimeout=0
...

Full log:
bug-replication–ciqlts8_8.log

Patch

Patch efficacy verified successfully. Repeating the steps doesn't result in a crash and the machine doesn't reboot.

Full log:
bug-replication–ciqlts8_8-CVE-2023-4921.log

A warning can be observed

qfq_dequeue: non-workconserving leaf

issued at sch_qfq.c. The work-conserving queue discipline is a qdisc which never leaves the outbound interface in idle unless the queue is empty, while non-work-conserving qdiscs may delay packets, for example to shape the traffic1. The plug qdisc, being the leaf in the hierarchical qdiscs configuration used, is non-work-conserving, as it is able to suspend the packet flow. Multiple types of qdiscs are apparently not designed to work with non-work-conserving child qdiscs and issue similar warnings on skb == NULL condition: ets, hfsc, htb, drr. See also the message in b00355db3f88d96810a60011a30cfb2c3469409d

Patrick McHardy <[email protected]> suggested:
> How about making this flag and the warning message (in a out-of-line
> function) globally available? Other qdiscs (f.i. HFSC) can't deal with
> inner non-work-conserving qdiscs as well.

and in 6d25d1dc76bf5943a5c1f4bb74d66d5eac58eb77, which includes qfq to the group above

A helper function for printing non-work-conserving alarms is added in
commit b00355db3f88 ("pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving
 warning handler."). In this commit, use qdisc_warn_nonwc() instead of
WARN_ONCE() to handle the non-work-conserving warning in qfq Qdisc.

The point is: the warning is related to the way the qdisc hierarchy has been defined in the bug replication script and not to the introduced changes.

Footnotes

1 https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.qdisc.terminology.html

jira VULN-6730
cve CVE-2023-4921
commit-author valis <[email protected]>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ ctrliq#4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <[email protected]>
	Signed-off-by: valis <[email protected]>
	Signed-off-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Paolo Abeni <[email protected]>
(cherry picked from commit 8fc134f)
	Signed-off-by: Marcin Wcisło <[email protected]>
@pvts-mat pvts-mat marked this pull request as draft February 12, 2025 03:01
@pvts-mat pvts-mat marked this pull request as ready for review February 12, 2025 18:18
Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:
Starting github runners too

Copy link

@gvrose8192 gvrose8192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks!

Copy link
Collaborator

@bmastbergen bmastbergen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥌

@PlaidCat PlaidCat merged commit 9b4e8bb into ctrliq:ciqlts8_8 Feb 13, 2025
2 checks passed
github-actions bot pushed a commit that referenced this pull request Jun 24, 2025
Following softlockup can be easily reproduced on my test machine with:

echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
swapon /dev/zram0 # zram0 is a 48G swap device
mkdir -p /sys/fs/cgroup/memory/test
echo 1G > /sys/fs/cgroup/test/memory.max
echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs
while true; do
    dd if=/dev/zero of=/tmp/test.img bs=1M count=5120
    cat /tmp/test.img > /dev/null
    rm /tmp/test.img
done

Then after a while:
watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787]
Modules linked in: zram virtiofs
CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G             L      6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)·
Tainted: [L]=SOFTLOCKUP
Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
RIP: 0010:mpol_shared_policy_lookup+0xd/0x70
Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8
RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202
RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001
RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518
RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001
R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000
FS:  00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 shmem_alloc_folio+0x31/0xc0
 shmem_swapin_folio+0x309/0xcf0
 ? filemap_get_entry+0x117/0x1e0
 ? xas_load+0xd/0xb0
 ? filemap_get_entry+0x101/0x1e0
 shmem_get_folio_gfp+0x2ed/0x5b0
 shmem_file_read_iter+0x7f/0x2e0
 vfs_read+0x252/0x330
 ksys_read+0x68/0xf0
 do_syscall_64+0x4c/0x1c0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f03f9a46991
Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991
RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003
RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000
 </TASK>

The reason is simple, readahead brought some order 0 folio in swap cache,
and the swapin mTHP folio being allocated is in conflict with it, so
swapcache_prepare fails and causes shmem_swap_alloc_folio to return
-EEXIST, and shmem simply retries again and again causing this loop.

Fix it by applying a similar fix for anon mTHP swapin.

The performance change is very slight, time of swapin 10g zero folios
with shmem (test for 12 times):
Before:  2.47s
After:   2.48s

[[email protected]: add comment]
  Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1dd44c0 ("mm: shmem: skip swapcache for swapin of synchronous swap device")
Signed-off-by: Kairui Song <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kemeng Shi <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants