-
Notifications
You must be signed in to change notification settings - Fork 0
Improve write performance for zoned UFS devices #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* io_uring-6.16: io_uring/zcrx: fix pp destruction warnings
* block-6.16: block: reject bs > ps block devices when THP is disabled nbd: fix uaf in nbd_genl_connect() error path md/md-bitmap: fix GPF in bitmap_get_stats() md/raid1,raid10: strip REQ_NOWAIT from member bios raid10: cleanup memleak at raid10_make_request md/raid1: Fix stack memory use after return in raid1_reshape
* for-6.17/io_uring: io_uring/rw: cast rw->flags assignment to rwf_t io_uring: don't use int for ABI io_uring/rsrc: skip atomic refcount for uncloned buffers io_uring/mock: add trivial poll handler io_uring/mock: support for async read/write io_uring/mock: allow to choose FMODE_NOWAIT io_uring/mock: add sync read/write io_uring/mock: add cmd using vectored regbufs io_uring/mock: add basic infra for test mock files io_uring: remove errant ';' from IORING_CQE_F_TSTAMP_HW definition io_uring/netcmd: add tx timestamping cmd support io_uring: add mshot helper for posting CQE32 io_uring/cmd: allow multishot polled commands io_uring/poll: introduce io_arm_apoll() io_uring/nop: add IORING_NOP_TW completion flag io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() io_uring: add struct io_cold_def->sqe_copy() method io_uring: add IO_URING_F_INLINE issue flag net: timestamp: add helper returning skb's tx tstamp
* for-6.17/block: (38 commits)
block: remove pktcdvd driver
ublk: introduce and use ublk_set_canceling helper
ublk: speed up ublk server exit handling
zram: pass buffer offset to zcomp_available_show()
block: zram: replace scnprintf() with sysfs_emit() in *_show() functions
bcache: switch from pages to folios in read_super()
virtio: blk/scsi: use block layer helpers to calculate num of queues
scsi: use block layer helpers to calculate num of queues
nvme-pci: use block layer helpers to calculate num of queues
blk-mq: add number of queue calc helper
lib/group_cpus: Let group_cpu_evenly() return the number of initialized masks
ublk: cache-align struct ublk_io
ublk: remove ubq checks from ublk_{get,put}_req_ref()
ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task
ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task
ublk: return early if blk_should_fake_timeout()
ublk: allow UBLK_IO_(UN)REGISTER_IO_BUF on any task
ublk: don't take ublk_queue in ublk_unregister_io_buf()
ublk: consolidate UBLK_IO_FLAG_{ACTIVE,OWNED_BY_SRV} checks
ublk: remove task variable from __ublk_ch_uring_cmd()
...
* for-6.17/block: nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping
* for-6.17/block: Documentation: remove reference to pktcdvd in cdrom documentation
* io_uring-6.16: Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well" io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU
* for-6.17/block: block: mtip32xx: Fix usage of dma_map_sg()
* for-6.17/block: drbd: add missing kref_get in handle_write_conflicts
* for-6.17/io_uring: io_uring/zcrx: prepare fallback for larger pages io_uring/zcrx: assert area type in io_zcrx_iov_page io_uring/zcrx: allocate sgtable for umem areas io_uring/zcrx: introduce io_populate_area_dma io_uring/zcrx: return error from io_zcrx_map_area_* io_uring/zcrx: always pass page to io_zcrx_copy_chunk
* for-6.17/block: nbd: fix lockdep deadlock warning
|
Upstream branch: f4ca523 |
* for-6.17/io_uring: io_uring/net: allow multishot receive per-invocation cap io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags io_uring/net: use passed in 'len' in io_recv_buf_select()
d74f66f to
c20febf
Compare
|
Upstream branch: 232ed3c |
bf2160b to
5a71413
Compare
c20febf to
8784bb4
Compare
|
Upstream branch: 232ed3c |
5a71413 to
8d22dad
Compare
8784bb4 to
059d7e3
Compare
|
Upstream branch: 232ed3c |
8d22dad to
d2f9e58
Compare
* for-6.17/block: nvme-pci: don't allocate dma_vec for IOVA mappings
059d7e3 to
4a2df30
Compare
|
Upstream branch: d7d3914 |
d2f9e58 to
8b808ad
Compare
4a2df30 to
c69d3d4
Compare
c69d3d4 to
9a7a50e
Compare
Some storage controllers preserve the request order per hardware queue. Some but not all device mapper drivers preserve the bio order. Introduce the request queue limit member variable 'driver_preserves_write_order' to allow block drivers and device mapper drivers to indicate that the order of write commands is preserved per hardware queue and hence that serialization of writes per zone is not required if all pending writes are submitted to the same hardware queue. Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Nitesh Shetty <nj.shetty@samsung.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Zoned writes may be requeued. This happens if a block driver returns BLK_STS_RESOURCE, to handle SCSI unit attentions or by the SCSI error handler after error handling has finished. A later patch enables write pipelining and increases the number of pending writes per zone. If multiple writes are pending per zone, write requests may be requeued in another order than submitted. Restore the request order if requests are requeued. Add RQF_DONTPREP to RQF_NOMERGE_FLAGS because this patch may cause RQF_DONTPREP requests to be sent to the code that checks whether a request can be merged and RQF_DONTPREP requests must not be merged. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Software that submits zoned writes, e.g. a filesystem, may submit zoned writes from multiple CPUs as long as the zoned writes are serialized per zone. Submitting bios from different CPUs may cause bio reordering if e.g. different bios reach the storage device through different queues. Prepare for preserving the order of pipelined zoned writes per zone by adding the 'rq_cpu` argument to blk_zone_plug_bio(). This argument tells blk_zone_plug_bio() from which CPU a cached request has been allocated. The cached request will only be used if it matches the CPU from which zoned writes are being submitted for the zone associated with the bio. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Prepare for moving code from disk_zone_wplug_add_bio() into its caller. Split an if-statement and also the comment above that if-statement. No functionality has been changed. Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Move the following code into the only caller of disk_zone_wplug_add_bio(): - bio->bi_opf &= ~REQ_NOWAIT - wplug->flags |= BLK_ZONE_WPLUG_PLUGGED - The disk_zone_wplug_schedule_bio_work() call. No functionality has been changed. This patch prepares for zoned write pipelining by removing the code from disk_zone_wplug_add_bio() that does not apply to all zoned write pipelining bio processing cases. Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Prepare for submitting multiple bios from inside a single blk_zone_wplug_bio_work() call. No functionality has been changed. Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Support pipelining of zoned writes if the block driver preserves the write order per hardware queue. Track per zone to which software queue writes have been queued. If zoned writes are pipelined, submit new writes to the same software queue as the writes that are already in progress. This prevents reordering by submitting requests for the same zone to different software or hardware queues. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Support configuring the preserves_write_order in the null_blk driver to make it easier to test support for this attribute. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
If zoned writes (REQ_OP_WRITE) for a sequential write required zone have a starting LBA that differs from the write pointer, e.g. because a prior write triggered a unit attention condition, then the storage device will respond with an UNALIGNED WRITE COMMAND error. Retry commands that failed with an unaligned write error. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
If the write order is preserved, increase the number of retries for write commands sent to a sequential zone to the maximum number of outstanding commands because in the worst case the number of times reordered zoned writes have to be retried is (number of outstanding writes per sequential zone) - 1. Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Zone write locking is not used for zoned devices if the block driver reports that it preserves the order of write commands. Make it easier to test not using zone write locking by adding support for setting the driver_preserves_write_order flag. Acked-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Allow user space software, e.g. a blktests test, to inject unaligned write errors. Acked-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
From the UFSHCI 4.0 specification, about the MCQ mode: "Command Submission 1. Host SW writes an Entry to SQ 2. Host SW updates SQ doorbell tail pointer Command Processing 3. After fetching the Entry, Host Controller updates SQ doorbell head pointer 4. Host controller sends COMMAND UPIU to UFS device" In other words, in MCQ mode, UFS controllers are required to forward commands to the UFS device in the order these commands have been received from the host. This patch improves performance as follows on a test setup with UFSHCI 4.0 controller: - When not using an I/O scheduler: 2.3x more IOPS for small writes. - With the mq-deadline scheduler: 2.0x more IOPS for small writes. Reviewed-by: Avri Altman <avri.altman@wdc.com> Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com> Cc: Can Guo <quic_cang@quicinc.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
|
Upstream branch: 5d41d21 |
4480d90 to
39deece
Compare
Mitigate e.g. the following:
# echo 1e789080.lpc-snoop > /sys/bus/platform/drivers/aspeed-lpc-snoop/unbind
...
[ 120.363594] Unable to handle kernel NULL pointer dereference at virtual address 00000004 when write
[ 120.373866] [00000004] *pgd=00000000
[ 120.377910] Internal error: Oops: 805 [#1] SMP ARM
[ 120.383306] CPU: 1 UID: 0 PID: 315 Comm: sh Not tainted 6.15.0-rc1-00009-g926217bc7d7d-dirty #20 NONE
...
[ 120.679543] Call trace:
[ 120.679559] misc_deregister from aspeed_lpc_snoop_remove+0x84/0xac
[ 120.692462] aspeed_lpc_snoop_remove from platform_remove+0x28/0x38
[ 120.700996] platform_remove from device_release_driver_internal+0x188/0x200
...
Fixes: 9f4f9ae ("drivers/misc: add Aspeed LPC snoop driver")
Cc: stable@vger.kernel.org
Cc: Jean Delvare <jdelvare@suse.de>
Acked-by: Jean Delvare <jdelvare@suse.de>
Link: https://patch.msgid.link/20250616-aspeed-lpc-snoop-fixes-v2-2-3cdd59c934d3@codeconstruct.com.au
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
page_pool_put_full_page() should only be invoked when freeing Rx buffers or building a skb if the size is too short. At other times, the pages need to be reused. So remove the redundant page put. In the original code, double free pages cause kernel panic: [ 876.949834] __irq_exit_rcu+0xc7/0x130 [ 876.949836] common_interrupt+0xb8/0xd0 [ 876.949838] </IRQ> [ 876.949838] <TASK> [ 876.949840] asm_common_interrupt+0x22/0x40 [ 876.949841] RIP: 0010:cpuidle_enter_state+0xc2/0x420 [ 876.949843] Code: 00 00 e8 d1 1d 5e ff e8 ac f0 ff ff 49 89 c5 0f 1f 44 00 00 31 ff e8 cd fc 5c ff 45 84 ff 0f 85 40 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 84 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d [ 876.949844] RSP: 0018:ffffaa7340267e78 EFLAGS: 00000246 [ 876.949845] RAX: ffff9e3f135be000 RBX: 0000000000000002 RCX: 0000000000000000 [ 876.949846] RDX: 000000cc2dc4cb7c RSI: ffffffff89ee49ae RDI: ffffffff89ef9f9e [ 876.949847] RBP: ffff9e378f940800 R08: 0000000000000002 R09: 00000000000000ed [ 876.949848] R10: 000000000000afc8 R11: ffff9e3e9e5a9b6c R12: ffffffff8a6d8580 [ 876.949849] R13: 000000cc2dc4cb7c R14: 0000000000000002 R15: 0000000000000000 [ 876.949852] ? cpuidle_enter_state+0xb3/0x420 [ 876.949855] cpuidle_enter+0x29/0x40 [ 876.949857] cpuidle_idle_call+0xfd/0x170 [ 876.949859] do_idle+0x7a/0xc0 [ 876.949861] cpu_startup_entry+0x25/0x30 [ 876.949862] start_secondary+0x117/0x140 [ 876.949864] common_startup_64+0x13e/0x148 [ 876.949867] </TASK> [ 876.949868] ---[ end trace 0000000000000000 ]--- [ 876.949869] ------------[ cut here ]------------ [ 876.949870] list_del corruption, ffffead40445a348->next is NULL [ 876.949873] WARNING: CPU: 14 PID: 0 at lib/list_debug.c:52 __list_del_entry_valid_or_report+0x67/0x120 [ 876.949875] Modules linked in: snd_hrtimer(E) bnep(E) binfmt_misc(E) amdgpu(E) squashfs(E) vfat(E) loop(E) fat(E) amd_atl(E) snd_hda_codec_realtek(E) intel_rapl_msr(E) snd_hda_codec_generic(E) intel_rapl_common(E) snd_hda_scodec_component(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) edac_mce_amd(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) amdxcp(E) kvm_amd(E) snd_hwdep(E) gpu_sched(E) drm_panel_backlight_quirks(E) cec(E) snd_pcm(E) drm_buddy(E) snd_seq_dummy(E) drm_ttm_helper(E) btusb(E) kvm(E) snd_seq_oss(E) btrtl(E) ttm(E) btintel(E) snd_seq_midi(E) btbcm(E) drm_exec(E) snd_seq_midi_event(E) i2c_algo_bit(E) snd_rawmidi(E) bluetooth(E) drm_suballoc_helper(E) irqbypass(E) snd_seq(E) ghash_clmulni_intel(E) sha512_ssse3(E) drm_display_helper(E) aesni_intel(E) snd_seq_device(E) rfkill(E) snd_timer(E) gf128mul(E) drm_client_lib(E) drm_kms_helper(E) snd(E) i2c_piix4(E) joydev(E) soundcore(E) wmi_bmof(E) ccp(E) k10temp(E) i2c_smbus(E) gpio_amdpt(E) i2c_designware_platform(E) gpio_generic(E) sg(E) [ 876.949914] i2c_designware_core(E) sch_fq_codel(E) parport_pc(E) drm(E) ppdev(E) lp(E) parport(E) fuse(E) nfnetlink(E) ip_tables(E) ext4 crc16 mbcache jbd2 sd_mod sfp mdio_i2c i2c_core txgbe ahci ngbe pcs_xpcs libahci libwx r8169 phylink libata realtek ptp pps_core video wmi [ 876.949933] CPU: 14 UID: 0 PID: 0 Comm: swapper/14 Kdump: loaded Tainted: G W E 6.16.0-rc2+ #20 PREEMPT(voluntary) [ 876.949935] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE [ 876.949936] Hardware name: Micro-Star International Co., Ltd. MS-7E16/X670E GAMING PLUS WIFI (MS-7E16), BIOS 1.90 12/31/2024 [ 876.949936] RIP: 0010:__list_del_entry_valid_or_report+0x67/0x120 [ 876.949938] Code: 00 00 00 48 39 7d 08 0f 85 a6 00 00 00 5b b8 01 00 00 00 5d 41 5c e9 73 0d 93 ff 48 89 fe 48 c7 c7 a0 31 e8 89 e8 59 7c b3 ff <0f> 0b 31 c0 5b 5d 41 5c e9 57 0d 93 ff 48 89 fe 48 c7 c7 c8 31 e8 [ 876.949940] RSP: 0018:ffffaa73405d0c60 EFLAGS: 00010282 [ 876.949941] RAX: 0000000000000000 RBX: ffffead40445a348 RCX: 0000000000000000 [ 876.949942] RDX: 0000000000000105 RSI: 0000000000000001 RDI: 00000000ffffffff [ 876.949943] RBP: 0000000000000000 R08: 000000010006dfde R09: ffffffff8a47d150 [ 876.949944] R10: ffffffff8a47d150 R11: 0000000000000003 R12: dead000000000122 [ 876.949945] R13: ffff9e3e9e5af700 R14: ffffead40445a348 R15: ffff9e3e9e5af720 [ 876.949946] FS: 0000000000000000(0000) GS:ffff9e3f135be000(0000) knlGS:0000000000000000 [ 876.949947] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 876.949948] CR2: 00007fa58b480048 CR3: 0000000156724000 CR4: 0000000000750ef0 [ 876.949949] PKRU: 55555554 [ 876.949950] Call Trace: [ 876.949951] <IRQ> [ 876.949952] __rmqueue_pcplist+0x53/0x2c0 [ 876.949955] alloc_pages_bulk_noprof+0x2e0/0x660 [ 876.949958] __page_pool_alloc_pages_slow+0xa9/0x400 [ 876.949961] page_pool_alloc_pages+0xa/0x20 [ 876.949963] wx_alloc_rx_buffers+0xd7/0x110 [libwx] [ 876.949967] wx_clean_rx_irq+0x262/0x430 [libwx] [ 876.949971] wx_poll+0x92/0x130 [libwx] [ 876.949975] __napi_poll+0x28/0x190 [ 876.949977] net_rx_action+0x301/0x3f0 [ 876.949980] ? srso_alias_return_thunk+0x5/0xfbef5 [ 876.949981] ? profile_tick+0x30/0x70 [ 876.949983] ? srso_alias_return_thunk+0x5/0xfbef5 [ 876.949984] ? srso_alias_return_thunk+0x5/0xfbef5 [ 876.949986] ? timerqueue_add+0xa3/0xc0 [ 876.949988] ? srso_alias_return_thunk+0x5/0xfbef5 [ 876.949989] ? __raise_softirq_irqoff+0x16/0x70 [ 876.949991] ? srso_alias_return_thunk+0x5/0xfbef5 [ 876.949993] ? srso_alias_return_thunk+0x5/0xfbef5 [ 876.949994] ? wx_msix_clean_rings+0x41/0x50 [libwx] [ 876.949998] handle_softirqs+0xf9/0x2c0 Fixes: 3c47e8a ("net: libwx: Support to receive packets in NAPI") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250714024755.17512-2-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reported the following splat: WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 __mptcp_do_fallback net/mptcp/protocol.h:1223 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 mptcp_do_fallback net/mptcp/protocol.h:1244 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 check_fully_established net/mptcp/options.c:982 [inline] WARNING: CPU: 1 PID: 7704 at net/mptcp/protocol.h:1223 mptcp_incoming_options+0x21a8/0x2510 net/mptcp/options.c:1153 Modules linked in: CPU: 1 UID: 0 PID: 7704 Comm: syz.3.1419 Not tainted 6.16.0-rc3-gbd5ce2324dba #20 PREEMPT(voluntary) Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:__mptcp_do_fallback net/mptcp/protocol.h:1223 [inline] RIP: 0010:mptcp_do_fallback net/mptcp/protocol.h:1244 [inline] RIP: 0010:check_fully_established net/mptcp/options.c:982 [inline] RIP: 0010:mptcp_incoming_options+0x21a8/0x2510 net/mptcp/options.c:1153 Code: 24 18 e8 bb 2a 00 fd e9 1b df ff ff e8 b1 21 0f 00 e8 ec 5f c4 fc 44 0f b7 ac 24 b0 00 00 00 e9 54 f1 ff ff e8 d9 5f c4 fc 90 <0f> 0b 90 e9 b8 f4 ff ff e8 8b 2a 00 fd e9 8d e6 ff ff e8 81 2a 00 RSP: 0018:ffff8880a3f08448 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8880180a8000 RCX: ffffffff84afcf45 RDX: ffff888090223700 RSI: ffffffff84afdaa7 RDI: 0000000000000001 RBP: ffff888017955780 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8880180a8910 R14: ffff8880a3e9d058 R15: 0000000000000000 FS: 00005555791b8500(0000) GS:ffff88811c495000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000110c2800b7 CR3: 0000000058e44000 CR4: 0000000000350ef0 Call Trace: <IRQ> tcp_reset+0x26f/0x2b0 net/ipv4/tcp_input.c:4432 tcp_validate_incoming+0x1057/0x1b60 net/ipv4/tcp_input.c:5975 tcp_rcv_established+0x5b5/0x21f0 net/ipv4/tcp_input.c:6166 tcp_v4_do_rcv+0x5dc/0xa70 net/ipv4/tcp_ipv4.c:1925 tcp_v4_rcv+0x3473/0x44a0 net/ipv4/tcp_ipv4.c:2363 ip_protocol_deliver_rcu+0xba/0x480 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x2f1/0x500 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:317 [inline] NF_HOOK include/linux/netfilter.h:311 [inline] ip_local_deliver+0x1be/0x560 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:469 [inline] ip_rcv_finish net/ipv4/ip_input.c:447 [inline] NF_HOOK include/linux/netfilter.h:317 [inline] NF_HOOK include/linux/netfilter.h:311 [inline] ip_rcv+0x514/0x810 net/ipv4/ip_input.c:567 __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5975 __netif_receive_skb+0x1f/0x120 net/core/dev.c:6088 process_backlog+0x301/0x1360 net/core/dev.c:6440 __napi_poll.constprop.0+0xba/0x550 net/core/dev.c:7453 napi_poll net/core/dev.c:7517 [inline] net_rx_action+0xb44/0x1010 net/core/dev.c:7644 handle_softirqs+0x1d0/0x770 kernel/softirq.c:579 do_softirq+0x3f/0x90 kernel/softirq.c:480 </IRQ> <TASK> __local_bh_enable_ip+0xed/0x110 kernel/softirq.c:407 local_bh_enable include/linux/bottom_half.h:33 [inline] inet_csk_listen_stop+0x2c5/0x1070 net/ipv4/inet_connection_sock.c:1524 mptcp_check_listen_stop.part.0+0x1cc/0x220 net/mptcp/protocol.c:2985 mptcp_check_listen_stop net/mptcp/mib.h:118 [inline] __mptcp_close+0x9b9/0xbd0 net/mptcp/protocol.c:3000 mptcp_close+0x2f/0x140 net/mptcp/protocol.c:3066 inet_release+0xed/0x200 net/ipv4/af_inet.c:435 inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:487 __sock_release+0xb3/0x270 net/socket.c:649 sock_close+0x1c/0x30 net/socket.c:1439 __fput+0x402/0xb70 fs/file_table.c:465 task_work_run+0x150/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xd4/0xe0 kernel/entry/common.c:114 exit_to_user_mode_prepare include/linux/entry-common.h:330 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:414 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:449 [inline] do_syscall_64+0x245/0x360 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fc92f8a36ad Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffcf52802d8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 RAX: 0000000000000000 RBX: 00007ffcf52803a8 RCX: 00007fc92f8a36ad RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 RBP: 00007fc92fae7ba0 R08: 0000000000000001 R09: 0000002800000000 R10: 00007fc92f700000 R11: 0000000000000246 R12: 00007fc92fae5fac R13: 00007fc92fae5fa0 R14: 0000000000026d00 R15: 0000000000026c51 </TASK> irq event stamp: 4068 hardirqs last enabled at (4076): [<ffffffff81544816>] __up_console_sem+0x76/0x80 kernel/printk/printk.c:344 hardirqs last disabled at (4085): [<ffffffff815447fb>] __up_console_sem+0x5b/0x80 kernel/printk/printk.c:342 softirqs last enabled at (3096): [<ffffffff840e1be0>] local_bh_enable include/linux/bottom_half.h:33 [inline] softirqs last enabled at (3096): [<ffffffff840e1be0>] inet_csk_listen_stop+0x2c0/0x1070 net/ipv4/inet_connection_sock.c:1524 softirqs last disabled at (3097): [<ffffffff813b6b9f>] do_softirq+0x3f/0x90 kernel/softirq.c:480 Since we need to track the 'fallback is possible' condition and the fallback status separately, there are a few possible races open between the check and the actual fallback action. Add a spinlock to protect the fallback related information and use it close all the possible related races. While at it also remove the too-early clearing of allow_infinite_fallback in __mptcp_subflow_connect(): the field will be correctly cleared by subflow_finish_connect() if/when the connection will complete successfully. If fallback is not possible, as per RFC, reset the current subflow. Since the fallback operation can now fail and return value should be checked, rename the helper accordingly. Fixes: 0530020 ("mptcp: track and update contiguous data status") Cc: stable@vger.kernel.org Reported-by: Matthieu Baerts <matttbe@kernel.org> Closes: multipath-tcp/mptcp_net-next#570 Reported-by: syzbot+5cf807c20386d699b524@syzkaller.appspotmail.com Closes: multipath-tcp/mptcp_net-next#555 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250714-net-mptcp-fallback-races-v1-1-391aff963322@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
9a7a50e to
8bf6490
Compare
|
Upstream branch: a8fa173 |
|
Upstream branch: a8fa173 |
|
Github failed to update this PR after force push. Close it. |
Without the change `perf `hangs up on charaster devices. On my system
it's enough to run system-wide sampler for a few seconds to get the
hangup:
$ perf record -a -g --call-graph=dwarf
$ perf report
# hung
`strace` shows that hangup happens on reading on a character device
`/dev/dri/renderD128`
$ strace -y -f -p 2780484
strace: Process 2780484 attached
pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached
It's call trace descends into `elfutils`:
$ gdb -p 2780484
(gdb) bt
#0 0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0)
at ../sysdeps/unix/sysv/linux/pread64.c:25
#1 0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1
#2 0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#3 0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#4 0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 ()
from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#5 0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0)
at util/dso.h:537
#6 0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114
#7 frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242
#8 0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#9 0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0,
thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127,
best_effort=best_effort@entry=false) at util/thread.h:152
#13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0,
sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939
#14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540,
max_stack=127, symbols=true) at util/machine.c:2920
#15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440,
sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true)
at util/machine.c:2970
#16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440,
sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198
#17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0,
evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127
#18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127,
arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255
#19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540,
evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334
#20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0,
file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367
#21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245
#22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324
#23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224,
file_path=0x105ffbf0 "perf.data") at util/session.c:1419
#24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0,
--Type <RET> for more, q to quit, c to continue without paging--
quit
prog=prog@entry=0x7fff9df81220) at util/session.c:2132
#25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220)
at util/session.c:2181
#26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226
#27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390
#28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076
#29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827
#30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0)
at perf.c:351
#31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404
#32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448
#33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556
The hangup happens because nothing in` perf` or `elfutils` checks if a
mapped file is easily readable.
The change conservatively skips all non-regular files.
Signed-off-by: Sergei Trofimovich <slyich@gmail.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250505174419.2814857-1-slyich@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Symbolize stack traces by creating a live machine. Add this
functionality to dump_stack and switch dump_stack users to use
it. Switch TUI to use it. Add stack traces to the child test function
which can be useful to diagnose blocked code.
Example output:
```
$ perf test -vv PERF_RECORD_
...
7: PERF_RECORD_* events & perf_sample fields:
7: PERF_RECORD_* events & perf_sample fields : Running (1 active)
^C
Signal (2) while running tests.
Terminating tests with the same signal
Internal test harness failure. Completing any started tests:
: 7: PERF_RECORD_* events & perf_sample fields:
---- unexpected signal (2) ----
#0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
#1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#2 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
#3 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
#4 0x7fc12fef1393 in __nanosleep nanosleep.c:26
#5 0x7fc12ff02d68 in __sleep sleep.c:55
#6 0x55788c63196b in test__PERF_RECORD perf-record.c:0
#7 0x55788c620fb0 in run_test_child builtin-test.c:0
#8 0x55788c5bd18d in start_command run-command.c:127
#9 0x55788c621ef3 in __cmd_test builtin-test.c:0
#10 0x55788c6225bf in cmd_test ??:0
#11 0x55788c5afbd0 in run_builtin perf.c:0
#12 0x55788c5afeeb in handle_internal_command perf.c:0
#13 0x55788c52b383 in main ??:0
#14 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
#15 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#16 0x55788c52b9d1 in _start ??:0
---- unexpected signal (2) ----
#0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
#1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#2 0x7fc12fea3a14 in pthread_sigmask@GLIBC_2.2.5 pthread_sigmask.c:45
#3 0x7fc12fe49fd9 in __GI___sigprocmask sigprocmask.c:26
#4 0x7fc12ff2601b in __longjmp_chk longjmp.c:36
#5 0x55788c6210c0 in print_test_result.isra.0 builtin-test.c:0
#6 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#7 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
#8 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
#9 0x7fc12fef1393 in __nanosleep nanosleep.c:26
#10 0x7fc12ff02d68 in __sleep sleep.c:55
#11 0x55788c63196b in test__PERF_RECORD perf-record.c:0
#12 0x55788c620fb0 in run_test_child builtin-test.c:0
#13 0x55788c5bd18d in start_command run-command.c:127
#14 0x55788c621ef3 in __cmd_test builtin-test.c:0
#15 0x55788c6225bf in cmd_test ??:0
#16 0x55788c5afbd0 in run_builtin perf.c:0
#17 0x55788c5afeeb in handle_internal_command perf.c:0
#18 0x55788c52b383 in main ??:0
#19 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
#20 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#21 0x55788c52b9d1 in _start ??:0
7: PERF_RECORD_* events & perf_sample fields : Skip (permissions)
```
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250624210500.2121303-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Pull request for series with
subject: Improve write performance for zoned UFS devices
version: 20
url: https://patchwork.kernel.org/project/linux-block/list/?series=980236