Skip to content

Commit d9ac1d5

Browse files
spikehaxboe
authored andcommitted
net: add documentation for io_uring zcrx
Add documentation for io_uring zero copy Rx that explains requirements and the user API. Signed-off-by: David Wei <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
1 parent bc57c7d commit d9ac1d5

File tree

2 files changed

+203
-0
lines changed

2 files changed

+203
-0
lines changed

Documentation/networking/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ Contents:
6363
gtp
6464
ila
6565
ioam6-sysctl
66+
iou-zcrx
6667
ip_dynaddr
6768
ipsec
6869
ip-sysctl
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=====================
4+
io_uring zero copy Rx
5+
=====================
6+
7+
Introduction
8+
============
9+
10+
io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
11+
the network receive path, allowing packet data to be received directly into
12+
userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
13+
there are no strict alignment requirements and no need to mmap()/munmap().
14+
Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
15+
processed by the kernel TCP stack as normal.
16+
17+
NIC HW Requirements
18+
===================
19+
20+
Several NIC HW features are required for io_uring ZC Rx to work. For now the
21+
kernel API does not configure the NIC and it must be done by the user.
22+
23+
Header/data split
24+
-----------------
25+
26+
Required to split packets at the L4 boundary into a header and a payload.
27+
Headers are received into kernel memory as normal and processed by the TCP
28+
stack as normal. Payloads are received into userspace memory directly.
29+
30+
Flow steering
31+
-------------
32+
33+
Specific HW Rx queues are configured for this feature, but modern NICs
34+
typically distribute flows across all HW Rx queues. Flow steering is required
35+
to ensure that only desired flows are directed towards HW queues that are
36+
configured for io_uring ZC Rx.
37+
38+
RSS
39+
---
40+
41+
In addition to flow steering above, RSS is required to steer all other non-zero
42+
copy flows away from queues that are configured for io_uring ZC Rx.
43+
44+
Usage
45+
=====
46+
47+
Setup NIC
48+
---------
49+
50+
Must be done out of band for now.
51+
52+
Ensure there are at least two queues::
53+
54+
ethtool -L eth0 combined 2
55+
56+
Enable header/data split::
57+
58+
ethtool -G eth0 tcp-data-split on
59+
60+
Carve out half of the HW Rx queues for zero copy using RSS::
61+
62+
ethtool -X eth0 equal 1
63+
64+
Set up flow steering, bearing in mind that queues are 0-indexed::
65+
66+
ethtool -N eth0 flow-type tcp6 ... action 1
67+
68+
Setup io_uring
69+
--------------
70+
71+
This section describes the low level io_uring kernel API. Please refer to
72+
liburing documentation for how to use the higher level API.
73+
74+
Create an io_uring instance with the following required setup flags::
75+
76+
IORING_SETUP_SINGLE_ISSUER
77+
IORING_SETUP_DEFER_TASKRUN
78+
IORING_SETUP_CQE32
79+
80+
Create memory area
81+
------------------
82+
83+
Allocate userspace memory area for receiving zero copy data::
84+
85+
void *area_ptr = mmap(NULL, area_size,
86+
PROT_READ | PROT_WRITE,
87+
MAP_ANONYMOUS | MAP_PRIVATE,
88+
0, 0);
89+
90+
Create refill ring
91+
------------------
92+
93+
Allocate memory for a shared ringbuf used for returning consumed buffers::
94+
95+
void *ring_ptr = mmap(NULL, ring_size,
96+
PROT_READ | PROT_WRITE,
97+
MAP_ANONYMOUS | MAP_PRIVATE,
98+
0, 0);
99+
100+
This refill ring consists of some space for the header, followed by an array of
101+
``struct io_uring_zcrx_rqe``::
102+
103+
size_t rq_entries = 4096;
104+
size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
105+
/* align to page size */
106+
ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
107+
108+
Register ZC Rx
109+
--------------
110+
111+
Fill in registration structs::
112+
113+
struct io_uring_zcrx_area_reg area_reg = {
114+
.addr = (__u64)(unsigned long)area_ptr,
115+
.len = area_size,
116+
.flags = 0,
117+
};
118+
119+
struct io_uring_region_desc region_reg = {
120+
.user_addr = (__u64)(unsigned long)ring_ptr,
121+
.size = ring_size,
122+
.flags = IORING_MEM_REGION_TYPE_USER,
123+
};
124+
125+
struct io_uring_zcrx_ifq_reg reg = {
126+
.if_idx = if_nametoindex("eth0"),
127+
/* this is the HW queue with desired flow steered into it */
128+
.if_rxq = 1,
129+
.rq_entries = rq_entries,
130+
.area_ptr = (__u64)(unsigned long)&area_reg,
131+
.region_ptr = (__u64)(unsigned long)&region_reg,
132+
};
133+
134+
Register with kernel::
135+
136+
io_uring_register_ifq(ring, &reg);
137+
138+
Map refill ring
139+
---------------
140+
141+
The kernel fills in fields for the refill ring in the registration ``struct
142+
io_uring_zcrx_ifq_reg``. Map it into userspace::
143+
144+
struct io_uring_zcrx_rq refill_ring;
145+
146+
refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
147+
refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
148+
refill_ring.rqes =
149+
(struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
150+
refill_ring.rq_tail = 0;
151+
refill_ring.ring_ptr = ring_ptr;
152+
153+
Receiving data
154+
--------------
155+
156+
Prepare a zero copy recv request::
157+
158+
struct io_uring_sqe *sqe;
159+
160+
sqe = io_uring_get_sqe(ring);
161+
io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
162+
sqe->ioprio |= IORING_RECV_MULTISHOT;
163+
164+
Now, submit and wait::
165+
166+
io_uring_submit_and_wait(ring, 1);
167+
168+
Finally, process completions::
169+
170+
struct io_uring_cqe *cqe;
171+
unsigned int count = 0;
172+
unsigned int head;
173+
174+
io_uring_for_each_cqe(ring, head, cqe) {
175+
struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
176+
177+
unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
178+
unsigned char *data = area_ptr + (rcqe->off & mask);
179+
/* do something with the data */
180+
181+
count++;
182+
}
183+
io_uring_cq_advance(ring, count);
184+
185+
Recycling buffers
186+
-----------------
187+
188+
Return buffers back to the kernel to be used again::
189+
190+
struct io_uring_zcrx_rqe *rqe;
191+
unsigned mask = refill_ring.ring_entries - 1;
192+
rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];
193+
194+
unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK;
195+
rqe->off = area_offset | area_reg.rq_area_token;
196+
rqe->len = cqe->res;
197+
IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
198+
199+
Testing
200+
=======
201+
202+
See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``

0 commit comments

Comments
 (0)