[LTS 8.8] net: sched: sch_qfq: Fix UAF in qfq_dequeue() #118

pvts-mat · 2025-02-12T02:56:31Z

CVE-2023-4921
VULN-6730

Problem

https://www.cve.org/CVERecord?id=CVE-2023-4921

A use-after-free vulnerability in the Linux kernel's net/sched: sch_qfq component can be exploited to achieve local privilege escalation. When the plug qdisc is used as a class of the qfq qdisc, sending network packets triggers use-after-free in qfq_dequeue() due to the incorrect .peek handler of sch_plug and lack of error checking in agg_dequeue()

Solution

A single commit was identified as a fix for this issue: 8fc134fee27f2263988ae38920bc03da416b03d8

kABI check: passed

python3 /mnt/code/kernel-dist-git/SOURCES/check-kabi \
        -k /mnt/code/kernel-dist-git/SOURCES/Module.kabi_$(uname -m) \
        -s /mnt/build_files/kernel-src-tree-ciqlts8_8-CVE-2023-4921/Module.symvers; echo $?

0

kernel-dist-git state:

Switched to branch 'el-8.8'
Your branch is up to date with 'origin/el-8.8'.

Boot test: passed

See Specific tests for implied boot test passing.

Kselftests: passed relative

Methodology

A mix of kernel-selftests-internal and source-compiled tests were used:

kernel-selftests-internal: bpf tests, except:
- bpf:test_kmod.sh: takes very long time to finish and always fails anyway,
- bpf:test_progs: unstable, can crash the machine,
- bpf:test_progs-no_alu32: unstable, can crash the machine.
source-compiled: all the rest.

Coverage (including tests skipped during execution)

android, bpf, breakpoints, capabilities, cgroup, core, cpu-hotplug, cpufreq, drivers/net/bonding, drivers/net/team, efivarfs, exec, filesystems, firmware, fpu, ftrace, futex, gpio, intel_pstate, ipc, kcmp, kvm, lib, livepatch, membarrier, memfd, memory-hotplug, mount, mqueue, net, net/forwarding, net/mptcp, netfilter, nsfs, proc, pstore, ptrace, rseq, rtc, sgx, sigaltstack, size, splice, static_keys, sync, sysctl, tc-testing, tdx, timens, timers, tpm2, user, vm, x86, zram

Reference `ciqlts8_8` (`683666ad1a6d7754125126d580f2994b4e35b3cd`)

Four test runs were conducted on the reference kernel.
kselftests–mixed–ciqlts8_8–run1.log
kselftests–mixed–ciqlts8_8–run2.log
kselftests–mixed–ciqlts8_8–run3.log
kselftests–mixed–ciqlts8_8–run4.log

Patch

A single rest run was conducted on the patched kernel.
kselftests–mixed–ciqlts8_8-CVE-2023-4921.log

Comparison

Overview of the results, reduced to the differences:

ktests.xsh  table --where "Summary = 'diff'" kselftests*.log

Column    File
--------  ----------------------------------------------
Status0   kselftests--mixed--ciqlts8_8--run1.log
Status1   kselftests--mixed--ciqlts8_8--run2.log
Status2   kselftests--mixed--ciqlts8_8--run3.log
Status3   kselftests--mixed--ciqlts8_8--run4.log
Status4   kselftests--mixed--ciqlts8_8-CVE-2023-4921.log

TestCase                   Status0  Status1  Status2  Status3  Status4  Summary
bpf:test_xdp_veth.sh       skip     skip     pass     skip     skip     diff
net/mptcp:simult_flows.sh  pass     fail     pass     pass     pass     diff
net:gro.sh                 pass     pass     fail     pass     pass     diff
net:xfrm_policy.sh         fail     fail     pass     fail     fail     diff

No differences in results occured which weren't present in the reference test set already.

New unreliable tests were identified:

bpf:test_xdp_veth.sh: For the rpm selftests package. The test may be run and pass

# selftests: bpf: test_xdp_veth.sh
# PING 10.1.1.33 (10.1.1.33) 56(84) bytes of data.
# 64 bytes from 10.1.1.33: icmp_seq=1 ttl=64 time=0.559 ms
# 
# --- 10.1.1.33 ping statistics ---
# 1 packets transmitted, 1 received, 0% packet loss, time 0ms
# rtt min/avg/max/mdev = 0.559/0.559/0.559/0.000 ms
# selftests: xdp_veth [PASS]
ok 19 selftests: bpf: test_xdp_veth.sh

or it may be skipped

# selftests: bpf: test_xdp_veth.sh
# Cannot create namespace file "/var/run/netns/ns3": File exists
# selftests: xdp_veth [SKIP]
ok 19 selftests: bpf: test_xdp_veth.sh # SKIP

This seems to be related to the net/mptcp:mptcp_connect.sh test, which sometimes creates the ns3

# selftests: net/mptcp: mptcp_connect.sh
# INFO: set ns3-67a772bd-2iZfQl dev ns3eth2: ethtool -K  gso off gro off
# INFO: set ns4-67a772bd-2iZfQl dev ns4eth3: ethtool -K  gso off gro off
# Created /tmp/tmp.8ngPQFHWq5 (size 7183388	/tmp/tmp.8ngPQFHWq5) containing data sent by client
# Created /tmp/tmp.aNLEp7GmM5 (size 2554908	/tmp/tmp.aNLEp7GmM5) containing data sent by server
...

and sometimes doesn't

# selftests: net/mptcp: mptcp_connect.sh
# INFO: set ns4-67a77e0c-d8qLil dev ns4eth3: ethtool -K tso off gso off
# Created /tmp/tmp.T3DnUSNyRd (size 4120604	/tmp/tmp.T3DnUSNyRd) containing data sent by client
# Created /tmp/tmp.mveJmQppTp (size 435228	/tmp/tmp.mveJmQppTp) containing data sent by server
...

net/mptcp:mptcp_connect.sh: For the source-compiled tests set. See above.

Specific tests: passed

Bug replication

The bug can be replicated with the following commands, as mention in the commit's message:

tc qdisc add dev lo root handle 1: qfq
tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
tc qdisc add dev lo parent 1:1 handle 2: plug
tc filter add dev lo parent 1: basic classid 1:1
ping -c1 127.0.0.1

The tests were performed on the referential and patched kernel.

Prerequisites

The tc commands above require the following kernel options to be enabled: CONFIG_NET_SCHED, CONFIG_NET_SCH_INGRESS, CONFIG_NET_SCH_QFQ, CONFIG_NET_SCH_PLUG, CONFIG_NET_CLS_BASIC.

All of them are enabled by default in the configs/kernel-4.18.0-x86_64.config configuration file for the tested x86_64 platform.

CONFIG_NET_SCHED=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_SCH_QFQ=m
CONFIG_NET_SCH_PLUG=m
CONFIG_NET_CLS_BASIC=m

Reference `ciqlts8_8` (`683666ad1a6d7754125126d580f2994b4e35b3cd`)

Bug replicated successfully. Kernel crashed and machine automatically rebooted.

[root@ciqlts8_8 pvts]# tc qdisc add dev lo root handle 1: qfq
tc qdisc add dev lo root handle 1: qfq
[root@ciqlts8_8 pvts]# tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
xpkt 512
[root@ciqlts8_8 pvts]# tc qdisc add dev lo parent 1:1 handle 2: plug
tc qdisc add dev lo parent 1:1 handle 2: plug
[root@ciqlts8_8 pvts]# tc filter add dev lo parent 1: basic classid 1:1
tc filter add dev lo parent 1: basic classid 1:1
[root@ciqlts8_8 pvts]# ping -c1 127.0.0.1
ping -c1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
[   30.846825] ------------[ cut here ]------------
[   30.849050] kernel BUG at mm/slub.c:376!
[   30.850468] invalid opcode: 0000 [#1] SMP PTI
[   30.851962] CPU: 2 PID: 1553 Comm: ping Kdump: loaded Not tainted 4.18.0-ciqlts8_8 #1
[   30.854641] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014
[   30.857114] RIP: 0010:set_freepointer.part.57+0x0/0x10
[   30.858983] Code: 83 ef 70 e9 22 6a fa ff 66 90 0f 1f 44 00 00 41 54 55 53 48 8b 06 48 85 c0 0f 85 56 86 00 00 5b 5d 41 5c e9 e2 ce ad 00 66 90 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 41 57 41 56 41 55
[   30.863746] RSP: 0018:ffffbbd94326cdb0 EFLAGS: 00010246
[   30.865158] RAX: ffffa08780f31200 RBX: ffffa087a87d8800 RCX: ffffa08780f31300
[   30.866906] RDX: 000000000000063e RSI: 0000000000000000 RDI: ffffa08740005180
[   30.868644] RBP: ffffe8164503cc00 R08: 0000000080000000 R09: ffffdbd93fd1b880
[   30.870065] R10: 0000000000000228 R11: 0000000000000228 R12: ffffa08740005180
[   30.871528] R13: ffffa08780f31200 R14: ffffffffb8e11fd5 R15: 0000000000000001
[   30.872931] FS:  00007f3bdf377480(0000) GS:ffffa08e9f900000(0000) knlGS:0000000000000000
[   30.874482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   30.875579] CR2: 00007ffdc419d000 CR3: 000000011ac66003 CR4: 0000000000370ee0
[   30.876865] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   30.878090] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   30.879282] Call Trace:
[   30.879718]  <IRQ>
[   30.880056]  kfree+0x238/0x250
[   30.880583]  ? __netif_receive_skb_core+0x145/0xd10
[   30.881393]  kfree_skb_reason+0x45/0x110
[   30.882049]  __netif_receive_skb_core+0x145/0xd10
[   30.882843]  ? loopback_xmit+0xd8/0x130
[   30.883586]  process_backlog+0xaa/0x170
[   30.884236]  __napi_poll+0x2d/0x130
[   30.884829]  net_rx_action+0x252/0x320
[   30.885459]  __do_softirq+0xdc/0x2cf
[   30.886069]  do_softirq_own_stack+0x2a/0x40
[   30.886754]  </IRQ>
[   30.887102]  do_softirq.part.16+0x45/0x50
[   30.887769]  __local_bh_enable_ip+0x4f/0x60
[   30.888459]  ip_finish_output2+0x1a6/0x430
[   30.889140]  ip_output+0x70/0xf0
[   30.889676]  ? __ip_finish_output+0x1d0/0x1d0
[   30.890389]  ip_send_skb+0x15/0x40
[   30.890977]  ping_v4_sendmsg+0x5bf/0x780
[   30.891710]  ? sock_has_perm+0x80/0xa0
[   30.892404]  ? release_sock+0x43/0x90
[   30.893023]  ? sock_sendmsg+0x42/0x60
[   30.893641]  sock_sendmsg+0x42/0x60
[   30.894208]  __sys_sendto+0xee/0x160
[   30.894795]  ? syscall_trace_enter+0x1ff/0x2d0
[   30.895521]  ? ksys_ioctl+0x64/0xa0
[   30.896101]  __x64_sys_sendto+0x24/0x30
[   30.896740]  do_syscall_64+0x5b/0x1b0
[   30.897340]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[   30.898145] RIP: 0033:0x7f3bde014b4b
[   30.898724] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 f5 4b 29 00 41 89 ca 8b 00 85 c0 75 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 41 57 4d 89 c7 41 56 41 89
[   30.901652] RSP: 002b:00007ffdc4096a68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   30.902842] RAX: ffffffffffffffda RBX: 0000559edafb6700 RCX: 00007f3bde014b4b
[   30.903967] RDX: 0000000000000040 RSI: 0000559edafb6700 RDI: 0000000000000003
[   30.905093] RBP: 0000000000000040 R08: 0000559edafb3500 R09: 0000000000000010
[   30.906224] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdc4098190
[   30.907339] R13: 00007ffdc4096a70 R14: 00007ffdc4096b60 R15: 0000559edafb20a0
[   30.908450] Modules linked in: cls_basic sch_plug sch_qfq vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency_common isst_if_common nfit libnvdimm kvm_intel iTCO_wdt iTCO_vendor_support kvm virtio_gpu irqbypass drm_shmem_helper drm_kms_helper rapl syscopyarea sysfillrect sysimgblt fb_sys_fops joydev pcspkr drm virtio_balloon i2c_i801 lpc_ich xfs libcrc32c sr_mod cdrom sg crct10dif_pclmul crc32_pclmul crc32c_intel virtio_net ahci ghash_clmulni_intel libahci serio_raw libata net_failover virtio_blk virtio_console failover virtiofs sunrpc dm_mirror dm_region_hash dm_log dm_mod fuse
[    0.000000] Linux version 4.18.0-ciqlts8_8 (pvts@ciqlts8_8) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-18) (GCC)) #1 SMP Sun Feb 2 16:43:15 UTC 2025
[    0.000000] Command line: elfcorehdr=0x6f000000 BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-ciqlts8_8 ro console=ttyS0,115200n8 no_timer_check net.ifnames=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable disable_cpu_apicid=0 iTCO_wdt.pretimeout=0
...

Full log:
bug-replication–ciqlts8_8.log

Patch

Patch efficacy verified successfully. Repeating the steps doesn't result in a crash and the machine doesn't reboot.

Full log:
bug-replication–ciqlts8_8-CVE-2023-4921.log

A warning can be observed

qfq_dequeue: non-workconserving leaf

issued at sch_qfq.c. The work-conserving queue discipline is a qdisc which never leaves the outbound interface in idle unless the queue is empty, while non-work-conserving qdiscs may delay packets, for example to shape the traffic¹. The plug qdisc, being the leaf in the hierarchical qdiscs configuration used, is non-work-conserving, as it is able to suspend the packet flow. Multiple types of qdiscs are apparently not designed to work with non-work-conserving child qdiscs and issue similar warnings on skb == NULL condition: ets, hfsc, htb, drr. See also the message in b00355db3f88d96810a60011a30cfb2c3469409d

Patrick McHardy <[email protected]> suggested:
> How about making this flag and the warning message (in a out-of-line
> function) globally available? Other qdiscs (f.i. HFSC) can't deal with
> inner non-work-conserving qdiscs as well.

and in 6d25d1dc76bf5943a5c1f4bb74d66d5eac58eb77, which includes qfq to the group above

A helper function for printing non-work-conserving alarms is added in
commit b00355db3f88 ("pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving
 warning handler."). In this commit, use qdisc_warn_nonwc() instead of
WARN_ONCE() to handle the non-work-conserving warning in qfq Qdisc.

The point is: the warning is related to the way the qdisc hierarchy has been defined in the bug replication script and not to the introduced changes.

Footnotes

¹ https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.qdisc.terminology.html

jira VULN-6730 cve CVE-2023-4921 commit-author valis <[email protected]> commit 8fc134f When the plug qdisc is used as a class of the qfq qdisc it could trigger a UAF. This issue can be reproduced with following commands: tc qdisc add dev lo root handle 1: qfq tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512 tc qdisc add dev lo parent 1:1 handle 2: plug tc filter add dev lo parent 1: basic classid 1:1 ping -c1 127.0.0.1 and boom: [ 285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0 [ 285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144 [ 285.355903] [ 285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ ctrliq#4 [ 285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 285.358376] Call Trace: [ 285.358773] <IRQ> [ 285.359109] dump_stack_lvl+0x44/0x60 [ 285.359708] print_address_description.constprop.0+0x2c/0x3c0 [ 285.360611] kasan_report+0x10c/0x120 [ 285.361195] ? qfq_dequeue+0xa7/0x7f0 [ 285.361780] qfq_dequeue+0xa7/0x7f0 [ 285.362342] __qdisc_run+0xf1/0x970 [ 285.362903] net_tx_action+0x28e/0x460 [ 285.363502] __do_softirq+0x11b/0x3de [ 285.364097] do_softirq.part.0+0x72/0x90 [ 285.364721] </IRQ> [ 285.365072] <TASK> [ 285.365422] __local_bh_enable_ip+0x77/0x90 [ 285.366079] __dev_queue_xmit+0x95f/0x1550 [ 285.366732] ? __pfx_csum_and_copy_from_iter+0x10/0x10 [ 285.367526] ? __pfx___dev_queue_xmit+0x10/0x10 [ 285.368259] ? __build_skb_around+0x129/0x190 [ 285.368960] ? ip_generic_getfrag+0x12c/0x170 [ 285.369653] ? __pfx_ip_generic_getfrag+0x10/0x10 [ 285.370390] ? csum_partial+0x8/0x20 [ 285.370961] ? raw_getfrag+0xe5/0x140 [ 285.371559] ip_finish_output2+0x539/0xa40 [ 285.372222] ? __pfx_ip_finish_output2+0x10/0x10 [ 285.372954] ip_output+0x113/0x1e0 [ 285.373512] ? __pfx_ip_output+0x10/0x10 [ 285.374130] ? icmp_out_count+0x49/0x60 [ 285.374739] ? __pfx_ip_finish_output+0x10/0x10 [ 285.375457] ip_push_pending_frames+0xf3/0x100 [ 285.376173] raw_sendmsg+0xef5/0x12d0 [ 285.376760] ? do_syscall_64+0x40/0x90 [ 285.377359] ? __static_call_text_end+0x136578/0x136578 [ 285.378173] ? do_syscall_64+0x40/0x90 [ 285.378772] ? kasan_enable_current+0x11/0x20 [ 285.379469] ? __pfx_raw_sendmsg+0x10/0x10 [ 285.380137] ? __sock_create+0x13e/0x270 [ 285.380673] ? __sys_socket+0xf3/0x180 [ 285.381174] ? __x64_sys_socket+0x3d/0x50 [ 285.381725] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 285.382425] ? __rcu_read_unlock+0x48/0x70 [ 285.382975] ? ip4_datagram_release_cb+0xd8/0x380 [ 285.383608] ? __pfx_ip4_datagram_release_cb+0x10/0x10 [ 285.384295] ? preempt_count_sub+0x14/0xc0 [ 285.384844] ? __list_del_entry_valid+0x76/0x140 [ 285.385467] ? _raw_spin_lock_bh+0x87/0xe0 [ 285.386014] ? __pfx__raw_spin_lock_bh+0x10/0x10 [ 285.386645] ? release_sock+0xa0/0xd0 [ 285.387148] ? preempt_count_sub+0x14/0xc0 [ 285.387712] ? freeze_secondary_cpus+0x348/0x3c0 [ 285.388341] ? aa_sk_perm+0x177/0x390 [ 285.388856] ? __pfx_aa_sk_perm+0x10/0x10 [ 285.389441] ? check_stack_object+0x22/0x70 [ 285.390032] ? inet_send_prepare+0x2f/0x120 [ 285.390603] ? __pfx_inet_sendmsg+0x10/0x10 [ 285.391172] sock_sendmsg+0xcc/0xe0 [ 285.391667] __sys_sendto+0x190/0x230 [ 285.392168] ? __pfx___sys_sendto+0x10/0x10 [ 285.392727] ? kvm_clock_get_cycles+0x14/0x30 [ 285.393328] ? set_normalized_timespec64+0x57/0x70 [ 285.393980] ? _raw_spin_unlock_irq+0x1b/0x40 [ 285.394578] ? __x64_sys_clock_gettime+0x11c/0x160 [ 285.395225] ? __pfx___x64_sys_clock_gettime+0x10/0x10 [ 285.395908] ? _copy_to_user+0x3e/0x60 [ 285.396432] ? exit_to_user_mode_prepare+0x1a/0x120 [ 285.397086] ? syscall_exit_to_user_mode+0x22/0x50 [ 285.397734] ? do_syscall_64+0x71/0x90 [ 285.398258] __x64_sys_sendto+0x74/0x90 [ 285.398786] do_syscall_64+0x64/0x90 [ 285.399273] ? exit_to_user_mode_prepare+0x1a/0x120 [ 285.399949] ? syscall_exit_to_user_mode+0x22/0x50 [ 285.400605] ? do_syscall_64+0x71/0x90 [ 285.401124] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 285.401807] RIP: 0033:0x495726 [ 285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09 [ 285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726 [ 285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000 [ 285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c [ 285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634 [ 285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000 [ 285.410403] </TASK> [ 285.410704] [ 285.410929] Allocated by task 144: [ 285.411402] kasan_save_stack+0x1e/0x40 [ 285.411926] kasan_set_track+0x21/0x30 [ 285.412442] __kasan_slab_alloc+0x55/0x70 [ 285.412973] kmem_cache_alloc_node+0x187/0x3d0 [ 285.413567] __alloc_skb+0x1b4/0x230 [ 285.414060] __ip_append_data+0x17f7/0x1b60 [ 285.414633] ip_append_data+0x97/0xf0 [ 285.415144] raw_sendmsg+0x5a8/0x12d0 [ 285.415640] sock_sendmsg+0xcc/0xe0 [ 285.416117] __sys_sendto+0x190/0x230 [ 285.416626] __x64_sys_sendto+0x74/0x90 [ 285.417145] do_syscall_64+0x64/0x90 [ 285.417624] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 285.418306] [ 285.418531] Freed by task 144: [ 285.418960] kasan_save_stack+0x1e/0x40 [ 285.419469] kasan_set_track+0x21/0x30 [ 285.419988] kasan_save_free_info+0x27/0x40 [ 285.420556] ____kasan_slab_free+0x109/0x1a0 [ 285.421146] kmem_cache_free+0x1c2/0x450 [ 285.421680] __netif_receive_skb_core+0x2ce/0x1870 [ 285.422333] __netif_receive_skb_one_core+0x97/0x140 [ 285.423003] process_backlog+0x100/0x2f0 [ 285.423537] __napi_poll+0x5c/0x2d0 [ 285.424023] net_rx_action+0x2be/0x560 [ 285.424510] __do_softirq+0x11b/0x3de [ 285.425034] [ 285.425254] The buggy address belongs to the object at ffff8880bad31280 [ 285.425254] which belongs to the cache skbuff_head_cache of size 224 [ 285.426993] The buggy address is located 40 bytes inside of [ 285.426993] freed 224-byte region [ffff8880bad31280, ffff8880bad31360) [ 285.428572] [ 285.428798] The buggy address belongs to the physical page: [ 285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31 [ 285.430758] flags: 0x100000000000200(slab|node=0|zone=1) [ 285.431447] page_type: 0xffffffff() [ 285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000 [ 285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 [ 285.433562] page dumped because: kasan: bad access detected [ 285.434144] [ 285.434320] Memory state around the buggy address: [ 285.434828] ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 285.435580] ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 285.436777] ^ [ 285.437106] ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 285.437616] ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 285.438126] ================================================================== [ 285.438662] Disabling lock debugging due to kernel taint Fix this by: 1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a function compatible with non-work-conserving qdiscs 2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq. Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Reported-by: valis <[email protected]> Signed-off-by: valis <[email protected]> Signed-off-by: Jamal Hadi Salim <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]> (cherry picked from commit 8fc134f) Signed-off-by: Marcin Wcisło <[email protected]>

PlaidCat

Starting github runners too

gvrose8192

LGTM - Thanks!

bmastbergen

🥌

Following softlockup can be easily reproduced on my test machine with: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled swapon /dev/zram0 # zram0 is a 48G swap device mkdir -p /sys/fs/cgroup/memory/test echo 1G > /sys/fs/cgroup/test/memory.max echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs while true; do dd if=/dev/zero of=/tmp/test.img bs=1M count=5120 cat /tmp/test.img > /dev/null rm /tmp/test.img done Then after a while: watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787] Modules linked in: zram virtiofs CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)· Tainted: [L]=SOFTLOCKUP Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 RIP: 0010:mpol_shared_policy_lookup+0xd/0x70 Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8 RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202 RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001 RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518 RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001 R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000 FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> shmem_alloc_folio+0x31/0xc0 shmem_swapin_folio+0x309/0xcf0 ? filemap_get_entry+0x117/0x1e0 ? xas_load+0xd/0xb0 ? filemap_get_entry+0x101/0x1e0 shmem_get_folio_gfp+0x2ed/0x5b0 shmem_file_read_iter+0x7f/0x2e0 vfs_read+0x252/0x330 ksys_read+0x68/0xf0 do_syscall_64+0x4c/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f03f9a46991 Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991 RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003 RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000 R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000 </TASK> The reason is simple, readahead brought some order 0 folio in swap cache, and the swapin mTHP folio being allocated is in conflict with it, so swapcache_prepare fails and causes shmem_swap_alloc_folio to return -EEXIST, and shmem simply retries again and again causing this loop. Fix it by applying a similar fix for anon mTHP swapin. The performance change is very slight, time of swapin 10g zero folios with shmem (test for 12 times): Before: 2.47s After: 2.48s [[email protected]: add comment] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 1dd44c0 ("mm: shmem: skip swapcache for swapin of synchronous swap device") Signed-off-by: Kairui Song <[email protected]> Reviewed-by: Barry Song <[email protected]> Acked-by: Nhat Pham <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Baoquan He <[email protected]> Cc: Chris Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Usama Arif <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

pvts-mat marked this pull request as draft February 12, 2025 03:01

pvts-mat marked this pull request as ready for review February 12, 2025 18:18

PlaidCat requested review from PlaidCat, bmastbergen and gvrose8192 February 12, 2025 20:41

PlaidCat approved these changes Feb 12, 2025

View reviewed changes

gvrose8192 approved these changes Feb 12, 2025

View reviewed changes

bmastbergen approved these changes Feb 13, 2025

View reviewed changes

PlaidCat merged commit 9b4e8bb into ctrliq:ciqlts8_8 Feb 13, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LTS 8.8] net: sched: sch_qfq: Fix UAF in qfq_dequeue() #118

[LTS 8.8] net: sched: sch_qfq: Fix UAF in qfq_dequeue() #118

Uh oh!

pvts-mat commented Feb 12, 2025 •

edited

Loading

Uh oh!

PlaidCat left a comment

Uh oh!

gvrose8192 left a comment

Uh oh!

bmastbergen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

[LTS 8.8] net: sched: sch_qfq: Fix UAF in qfq_dequeue() #118

[LTS 8.8] net: sched: sch_qfq: Fix UAF in qfq_dequeue() #118

Uh oh!

Conversation

pvts-mat commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

kABI check: passed

Boot test: passed

Kselftests: passed relative

Methodology

Coverage (including tests skipped during execution)

Reference ciqlts8_8 (683666ad1a6d7754125126d580f2994b4e35b3cd)

Patch

Comparison

Specific tests: passed

Bug replication

Prerequisites

Reference ciqlts8_8 (683666ad1a6d7754125126d580f2994b4e35b3cd)

Patch

Footnotes

Uh oh!

PlaidCat left a comment

Choose a reason for hiding this comment

Uh oh!

gvrose8192 left a comment

Choose a reason for hiding this comment

Uh oh!

bmastbergen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

pvts-mat commented Feb 12, 2025 •

edited

Loading

Reference `ciqlts8_8` (`683666ad1a6d7754125126d580f2994b4e35b3cd`)

Reference `ciqlts8_8` (`683666ad1a6d7754125126d580f2994b4e35b3cd`)