bpf/sockmap: add splice support for tcp_bpf#11277
bpf/sockmap: add splice support for tcp_bpf#11277kernel-patches-daemon-bpf[bot] wants to merge 7 commits intobpf-next_basefrom
Conversation
|
Upstream branch: 05c9b2e |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 3995735291 via email |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 3995747758 via email |
c7dcbca to
69a44ca
Compare
|
Upstream branch: 4faa189 |
508e32a to
315e2ed
Compare
69a44ca to
f264dc7
Compare
|
Upstream branch: 748f9c6 |
315e2ed to
287dd82
Compare
f264dc7 to
59120bd
Compare
|
Upstream branch: 6dd780f |
287dd82 to
817393c
Compare
59120bd to
94aca0b
Compare
|
Upstream branch: 099bded |
817393c to
e72ffac
Compare
94aca0b to
980a66f
Compare
|
Upstream branch: bd2e02e |
e72ffac to
6975169
Compare
980a66f to
026b5c1
Compare
|
Upstream branch: bd2e02e |
6975169 to
045c068
Compare
026b5c1 to
b72a510
Compare
|
Upstream branch: 0c55d48 |
045c068 to
f2b2b81
Compare
b72a510 to
ebefa82
Compare
Add a splice_read function pointer to struct proto between recvmsg and splice_eof. Set it to tcp_splice_read in both tcp_prot and tcpv6_prot. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
…am_ops Add inet_splice_read() which dispatches to sk->sk_prot->splice_read via INDIRECT_CALL_1. Replace the direct tcp_splice_read reference in inet_stream_ops and inet6_stream_ops with inet_splice_read. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Refactor the read operation with no functional changes. tcp_bpf has two read paths: strparser and non-strparser. Currently the differences are implemented directly in their respective recvmsg functions, which works fine. However, upcoming splice support would require duplicating the same logic for both paths. To avoid this, extract the strparser-specific differences into an independent abstraction that can be reused by splice. For ingress_msg data processing, introduce a function pointer callback approach. The current implementation passes sk_msg_recvmsg_actor(), which performs copy_page_to_iter() - the same copy logic previously embedded in sk_msg_recvmsg(). This provides the extension point for future splice support, where a different actor can be plugged in. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Implement splice_read for sockmap using an always-copy approach. Each page from the psock ingress scatterlist is copied to a newly allocated page before being added to the pipe, avoiding lifetime and slab-page issues. Add sk_msg_splice_actor() which allocates a fresh page via alloc_page(), copies the data with memcpy(), then passes it to add_to_pipe(). The newly allocated page already has a refcount of 1, so no additional get_page() is needed. On add_to_pipe() failure, no explicit cleanup is needed since add_to_pipe() internally calls pipe_buf_release(). Also fix sk_msg_read_core() to update msg_rx->sg.start when the actor returns 0 mid-way through processing. The loop processes msg_rx->sg entries sequentially — if the actor fails (e.g. pipe full for splice, or user buffer fault for recvmsg), prior entries may already be consumed with sge->length set to 0. Without advancing sg.start, subsequent calls would revisit these zero-length entries and return -EFAULT. This is especially common with the splice actor since the pipe has a small fixed capacity (16 slots), but theoretically affects recvmsg as well. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
The previous splice_read implementation copies all data through intermediate pages (alloc_page + memcpy). This is wasteful for skb fragment pages which are allocated from the page allocator and can be safely referenced via get_page(). Optimize by checking PageSlab() to distinguish between linear skb data (slab-backed) and fragment pages (page allocator-backed): - For slab pages (skb linear data): copy to a page fragment via sk_page_frag, matching what linear_to_page() does in the standard TCP splice path (skb_splice_bits). get_page() is invalid on slab pages so a copy is unavoidable here. - For non-slab pages (skb frags): use get_page() directly for true zero-copy, same as skb_splice_bits does for fragments. Both paths use nosteal_pipe_buf_ops. The sk_page_frag approach is more memory-efficient than alloc_page for small linear copies, as multiple copies can share a single page fragment. Benchmark results with rx-verdict-ingress mode (loopback, 8 CPUs): splice(2) + always-copy: ~2770 MB/s (before this patch) splice(2) + zero-copy: ~4270 MB/s (after this patch, +54%) read(2): ~4292 MB/s (baseline for reference) Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Add splice_read coverage to sockmap_basic and sockmap_strp selftests. Each test suite now runs twice: once with normal recv_timeout() and once with splice-based reads, verifying that data read via splice(2) through a pipe produces identical results. A recv_timeout_with_splice() helper is added to sockmap_helpers.h that creates a temporary pipe, splices data from the socket into the pipe, then reads from the pipe into the user buffer. MSG_PEEK calls fall back to native recv since splice does not support peek. Non-TCP sockets also fall back to native recv. The splice subtests are distinguished by appending " splice" to each subtest name via a test__start_subtest macro override. ./test_progs -a sockmap_* ... Summary: 5/830 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Add --splice option to bench_sockmap that uses splice(2) instead of read(2) in the consumer path. A global pipe is created once during setup and reused across iterations to avoid per-call pipe creation overhead. When --splice is enabled, the consumer splices data from the socket into the pipe, then reads from the pipe into the user buffer. The socket is set to O_NONBLOCK to prevent tcp_splice_read() from blocking indefinitely, as it only checks sock->file->f_flags for non-blocking mode, ignoring SPLICE_F_NONBLOCK. Also increase SO_RCVBUF to 16MB to avoid sk_psock_backlog being throttled by the default sk_rcvbuf limit, and add --verify option to optionally enable data correctness checking (disabled by default for benchmark accuracy). Benchmark results with rx-verdict-ingress mode (loopback, 8 CPUs): read(2): ~4292 MB/s splice(2) + zero-copy: ~4270 MB/s splice(2) + always-copy: ~2770 MB/s Zero-copy splice achieves near-parity with read(2), while the always-copy fallback is ~35% slower. Usage: # Steer softirqs to CPU 7 to avoid contending with the producer CPU echo 80 > /sys/class/net/lo/queues/rx-0/rps_cpus # Raise the receive buffer ceiling so the benchmark can set 16MB rcvbuf sysctl -w net.core.rmem_max=16777216 # Run the benchmark ./bench sockmap --rx-verdict-ingress --splice -c 2 -p 1 -a -d 30 Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
|
Upstream branch: e95e85b |
f2b2b81 to
17ac338
Compare
Pull request for series with
subject: bpf/sockmap: add splice support for tcp_bpf
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1061046