Skip to content

Commit ca0b04b

Browse files
committed
Merge tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux
Pull io_uring zero-copy receive support from Jens Axboe: "This adds support for zero-copy receive with io_uring, enabling fast bulk receive of data directly into application memory, rather than needing to copy the data out of kernel memory. While this version only supports host memory as that was the initial target, other memory types are planned as well, with notably GPU memory coming next. This work depends on some networking components which were queued up on the networking side, but have now landed in your tree. This is the work of Pavel Begunkov and David Wei. From the v14 posting: 'We configure a page pool that a driver uses to fill a hw rx queue to hand out user pages instead of kernel pages. Any data that ends up hitting this hw rx queue will thus be dma'd into userspace memory directly, without needing to be bounced through kernel memory. 'Reading' data out of a socket instead becomes a _notification_ mechanism, where the kernel tells userspace where the data is. The overall approach is similar to the devmem TCP proposal This relies on hw header/data split, flow steering and RSS to ensure packet headers remain in kernel memory and only desired flows hit a hw rx queue configured for zero copy. Configuring this is outside of the scope of this patchset. We share netdev core infra with devmem TCP. The main difference is that io_uring is used for the uAPI and the lifetime of all objects are bound to an io_uring instance. Data is 'read' using a new io_uring request type. When done, data is returned via a new shared refill queue. A zero copy page pool refills a hw rx queue from this refill queue directly. Of course, the lifetime of these data buffers are managed by io_uring rather than the networking stack, with different refcounting rules. This patchset is the first step adding basic zero copy support. We will extend this iteratively with new features e.g. dynamically allocated zero copy areas, THP support, dmabuf support, improved copy fallback, general optimisations and more' In a local setup, I was able to saturate a 200G link with a single CPU core, and at netdev conf 0x19 earlier this month, Jamal reported 188Gbit of bandwidth using a single core (no HT, including soft-irq). Safe to say the efficiency is there, as bigger links would be needed to find the per-core limit, and it's considerably more efficient and faster than the existing devmem solution" * tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux: io_uring/zcrx: add selftest case for recvzc with read limit io_uring/zcrx: add a read limit to recvzc requests io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap io_uring: Rename KConfig to Kconfig io_uring/zcrx: fix leaks on failed registration io_uring/zcrx: recheck ifq on shutdown io_uring/zcrx: add selftest net: add documentation for io_uring zcrx io_uring/zcrx: add copy fallback io_uring/zcrx: throttle receive requests io_uring/zcrx: set pp memory provider for an rx queue io_uring/zcrx: add io_recvzc request io_uring/zcrx: dma-map area for the device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: grab a net device io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add interface queue and refill queue
2 parents 15cb9a2 + 89baa22 commit ca0b04b

File tree

22 files changed

+1988
-2
lines changed

22 files changed

+1988
-2
lines changed

Documentation/networking/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ Contents:
6363
gtp
6464
ila
6565
ioam6-sysctl
66+
iou-zcrx
6667
ip_dynaddr
6768
ipsec
6869
ip-sysctl

Documentation/networking/iou-zcrx.rst

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=====================
4+
io_uring zero copy Rx
5+
=====================
6+
7+
Introduction
8+
============
9+
10+
io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
11+
the network receive path, allowing packet data to be received directly into
12+
userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
13+
there are no strict alignment requirements and no need to mmap()/munmap().
14+
Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
15+
processed by the kernel TCP stack as normal.
16+
17+
NIC HW Requirements
18+
===================
19+
20+
Several NIC HW features are required for io_uring ZC Rx to work. For now the
21+
kernel API does not configure the NIC and it must be done by the user.
22+
23+
Header/data split
24+
-----------------
25+
26+
Required to split packets at the L4 boundary into a header and a payload.
27+
Headers are received into kernel memory as normal and processed by the TCP
28+
stack as normal. Payloads are received into userspace memory directly.
29+
30+
Flow steering
31+
-------------
32+
33+
Specific HW Rx queues are configured for this feature, but modern NICs
34+
typically distribute flows across all HW Rx queues. Flow steering is required
35+
to ensure that only desired flows are directed towards HW queues that are
36+
configured for io_uring ZC Rx.
37+
38+
RSS
39+
---
40+
41+
In addition to flow steering above, RSS is required to steer all other non-zero
42+
copy flows away from queues that are configured for io_uring ZC Rx.
43+
44+
Usage
45+
=====
46+
47+
Setup NIC
48+
---------
49+
50+
Must be done out of band for now.
51+
52+
Ensure there are at least two queues::
53+
54+
ethtool -L eth0 combined 2
55+
56+
Enable header/data split::
57+
58+
ethtool -G eth0 tcp-data-split on
59+
60+
Carve out half of the HW Rx queues for zero copy using RSS::
61+
62+
ethtool -X eth0 equal 1
63+
64+
Set up flow steering, bearing in mind that queues are 0-indexed::
65+
66+
ethtool -N eth0 flow-type tcp6 ... action 1
67+
68+
Setup io_uring
69+
--------------
70+
71+
This section describes the low level io_uring kernel API. Please refer to
72+
liburing documentation for how to use the higher level API.
73+
74+
Create an io_uring instance with the following required setup flags::
75+
76+
IORING_SETUP_SINGLE_ISSUER
77+
IORING_SETUP_DEFER_TASKRUN
78+
IORING_SETUP_CQE32
79+
80+
Create memory area
81+
------------------
82+
83+
Allocate userspace memory area for receiving zero copy data::
84+
85+
void *area_ptr = mmap(NULL, area_size,
86+
PROT_READ | PROT_WRITE,
87+
MAP_ANONYMOUS | MAP_PRIVATE,
88+
0, 0);
89+
90+
Create refill ring
91+
------------------
92+
93+
Allocate memory for a shared ringbuf used for returning consumed buffers::
94+
95+
void *ring_ptr = mmap(NULL, ring_size,
96+
PROT_READ | PROT_WRITE,
97+
MAP_ANONYMOUS | MAP_PRIVATE,
98+
0, 0);
99+
100+
This refill ring consists of some space for the header, followed by an array of
101+
``struct io_uring_zcrx_rqe``::
102+
103+
size_t rq_entries = 4096;
104+
size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
105+
/* align to page size */
106+
ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
107+
108+
Register ZC Rx
109+
--------------
110+
111+
Fill in registration structs::
112+
113+
struct io_uring_zcrx_area_reg area_reg = {
114+
.addr = (__u64)(unsigned long)area_ptr,
115+
.len = area_size,
116+
.flags = 0,
117+
};
118+
119+
struct io_uring_region_desc region_reg = {
120+
.user_addr = (__u64)(unsigned long)ring_ptr,
121+
.size = ring_size,
122+
.flags = IORING_MEM_REGION_TYPE_USER,
123+
};
124+
125+
struct io_uring_zcrx_ifq_reg reg = {
126+
.if_idx = if_nametoindex("eth0"),
127+
/* this is the HW queue with desired flow steered into it */
128+
.if_rxq = 1,
129+
.rq_entries = rq_entries,
130+
.area_ptr = (__u64)(unsigned long)&area_reg,
131+
.region_ptr = (__u64)(unsigned long)&region_reg,
132+
};
133+
134+
Register with kernel::
135+
136+
io_uring_register_ifq(ring, &reg);
137+
138+
Map refill ring
139+
---------------
140+
141+
The kernel fills in fields for the refill ring in the registration ``struct
142+
io_uring_zcrx_ifq_reg``. Map it into userspace::
143+
144+
struct io_uring_zcrx_rq refill_ring;
145+
146+
refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
147+
refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
148+
refill_ring.rqes =
149+
(struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
150+
refill_ring.rq_tail = 0;
151+
refill_ring.ring_ptr = ring_ptr;
152+
153+
Receiving data
154+
--------------
155+
156+
Prepare a zero copy recv request::
157+
158+
struct io_uring_sqe *sqe;
159+
160+
sqe = io_uring_get_sqe(ring);
161+
io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
162+
sqe->ioprio |= IORING_RECV_MULTISHOT;
163+
164+
Now, submit and wait::
165+
166+
io_uring_submit_and_wait(ring, 1);
167+
168+
Finally, process completions::
169+
170+
struct io_uring_cqe *cqe;
171+
unsigned int count = 0;
172+
unsigned int head;
173+
174+
io_uring_for_each_cqe(ring, head, cqe) {
175+
struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
176+
177+
unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
178+
unsigned char *data = area_ptr + (rcqe->off & mask);
179+
/* do something with the data */
180+
181+
count++;
182+
}
183+
io_uring_cq_advance(ring, count);
184+
185+
Recycling buffers
186+
-----------------
187+
188+
Return buffers back to the kernel to be used again::
189+
190+
struct io_uring_zcrx_rqe *rqe;
191+
unsigned mask = refill_ring.ring_entries - 1;
192+
rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];
193+
194+
unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK;
195+
rqe->off = area_offset | area_reg.rq_area_token;
196+
rqe->len = cqe->res;
197+
IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
198+
199+
Testing
200+
=======
201+
202+
See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``

Kconfig

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,5 @@ source "lib/Kconfig"
3030
source "lib/Kconfig.debug"
3131

3232
source "Documentation/Kconfig"
33+
34+
source "io_uring/Kconfig"

include/linux/io_uring_types.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ enum io_uring_cmd_flags {
4040
IO_URING_F_TASK_DEAD = (1 << 13),
4141
};
4242

43+
struct io_zcrx_ifq;
44+
4345
struct io_wq_work_node {
4446
struct io_wq_work_node *next;
4547
};
@@ -384,6 +386,8 @@ struct io_ring_ctx {
384386
struct wait_queue_head poll_wq;
385387
struct io_restriction restrictions;
386388

389+
struct io_zcrx_ifq *ifq;
390+
387391
u32 pers_next;
388392
struct xarray personalities;
389393

@@ -436,6 +440,8 @@ struct io_ring_ctx {
436440
struct io_mapped_region ring_region;
437441
/* used for optimised request parameter and wait argument passing */
438442
struct io_mapped_region param_region;
443+
/* just one zcrx per ring for now, will move to io_zcrx_ifq eventually */
444+
struct io_mapped_region zcrx_region;
439445
};
440446

441447
/*

include/uapi/linux/io_uring.h

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ struct io_uring_sqe {
8787
union {
8888
__s32 splice_fd_in;
8989
__u32 file_index;
90+
__u32 zcrx_ifq_idx;
9091
__u32 optlen;
9192
struct {
9293
__u16 addr_len;
@@ -278,6 +279,7 @@ enum io_uring_op {
278279
IORING_OP_FTRUNCATE,
279280
IORING_OP_BIND,
280281
IORING_OP_LISTEN,
282+
IORING_OP_RECV_ZC,
281283

282284
/* this goes last, obviously */
283285
IORING_OP_LAST,
@@ -641,7 +643,8 @@ enum io_uring_register_op {
641643
/* send MSG_RING without having a ring */
642644
IORING_REGISTER_SEND_MSG_RING = 31,
643645

644-
/* 32 reserved for zc rx */
646+
/* register a netdev hw rx queue for zerocopy */
647+
IORING_REGISTER_ZCRX_IFQ = 32,
645648

646649
/* resize CQ ring */
647650
IORING_REGISTER_RESIZE_RINGS = 33,
@@ -958,6 +961,55 @@ enum io_uring_socket_op {
958961
SOCKET_URING_OP_SETSOCKOPT,
959962
};
960963

964+
/* Zero copy receive refill queue entry */
965+
struct io_uring_zcrx_rqe {
966+
__u64 off;
967+
__u32 len;
968+
__u32 __pad;
969+
};
970+
971+
struct io_uring_zcrx_cqe {
972+
__u64 off;
973+
__u64 __pad;
974+
};
975+
976+
/* The bit from which area id is encoded into offsets */
977+
#define IORING_ZCRX_AREA_SHIFT 48
978+
#define IORING_ZCRX_AREA_MASK (~(((__u64)1 << IORING_ZCRX_AREA_SHIFT) - 1))
979+
980+
struct io_uring_zcrx_offsets {
981+
__u32 head;
982+
__u32 tail;
983+
__u32 rqes;
984+
__u32 __resv2;
985+
__u64 __resv[2];
986+
};
987+
988+
struct io_uring_zcrx_area_reg {
989+
__u64 addr;
990+
__u64 len;
991+
__u64 rq_area_token;
992+
__u32 flags;
993+
__u32 __resv1;
994+
__u64 __resv2[2];
995+
};
996+
997+
/*
998+
* Argument for IORING_REGISTER_ZCRX_IFQ
999+
*/
1000+
struct io_uring_zcrx_ifq_reg {
1001+
__u32 if_idx;
1002+
__u32 if_rxq;
1003+
__u32 rq_entries;
1004+
__u32 flags;
1005+
1006+
__u64 area_ptr; /* pointer to struct io_uring_zcrx_area_reg */
1007+
__u64 region_ptr; /* struct io_uring_region_desc * */
1008+
1009+
struct io_uring_zcrx_offsets offsets;
1010+
__u64 __resv[4];
1011+
};
1012+
9611013
#ifdef __cplusplus
9621014
}
9631015
#endif

io_uring/Kconfig

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# SPDX-License-Identifier: GPL-2.0-only
2+
#
3+
# io_uring configuration
4+
#
5+
6+
config IO_URING_ZCRX
7+
def_bool y
8+
depends on PAGE_POOL
9+
depends on INET
10+
depends on NET_RX_BUSY_POLL

io_uring/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ obj-$(CONFIG_IO_URING) += io_uring.o opdef.o kbuf.o rsrc.o notif.o \
1414
epoll.o statx.o timeout.o fdinfo.o \
1515
cancel.o waitid.o register.o \
1616
truncate.o memmap.o alloc_cache.o
17+
obj-$(CONFIG_IO_URING_ZCRX) += zcrx.o
1718
obj-$(CONFIG_IO_WQ) += io-wq.o
1819
obj-$(CONFIG_FUTEX) += futex.o
1920
obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o

io_uring/io_uring.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@
9797
#include "uring_cmd.h"
9898
#include "msg_ring.h"
9999
#include "memmap.h"
100+
#include "zcrx.h"
100101

101102
#include "timeout.h"
102103
#include "poll.h"
@@ -2732,6 +2733,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
27322733
mutex_lock(&ctx->uring_lock);
27332734
io_sqe_buffers_unregister(ctx);
27342735
io_sqe_files_unregister(ctx);
2736+
io_unregister_zcrx_ifqs(ctx);
27352737
io_cqring_overflow_kill(ctx);
27362738
io_eventfd_unregister(ctx);
27372739
io_free_alloc_caches(ctx);
@@ -2891,6 +2893,11 @@ static __cold void io_ring_exit_work(struct work_struct *work)
28912893
io_cqring_overflow_kill(ctx);
28922894
mutex_unlock(&ctx->uring_lock);
28932895
}
2896+
if (ctx->ifq) {
2897+
mutex_lock(&ctx->uring_lock);
2898+
io_shutdown_zcrx_ifqs(ctx);
2899+
mutex_unlock(&ctx->uring_lock);
2900+
}
28942901

28952902
if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
28962903
io_move_task_work_from_local(ctx);

io_uring/io_uring.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,16 @@ static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret
189189
return io_get_cqe_overflow(ctx, ret, false);
190190
}
191191

192+
static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
193+
struct io_uring_cqe **cqe_ret)
194+
{
195+
io_lockdep_assert_cq_locked(ctx);
196+
197+
ctx->cq_extra++;
198+
ctx->submit_state.cq_flush = true;
199+
return io_get_cqe(ctx, cqe_ret);
200+
}
201+
192202
static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
193203
struct io_kiocb *req)
194204
{

io_uring/memmap.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,8 @@ static struct io_mapped_region *io_mmap_get_region(struct io_ring_ctx *ctx,
271271
return io_pbuf_get_region(ctx, bgid);
272272
case IORING_MAP_OFF_PARAM_REGION:
273273
return &ctx->param_region;
274+
case IORING_MAP_OFF_ZCRX_REGION:
275+
return &ctx->zcrx_region;
274276
}
275277
return NULL;
276278
}

0 commit comments

Comments
 (0)