Skip to content

Commit 09d1db2

Browse files
minakuba-moo
authored andcommitted
net: add devmem TCP documentation
Add documentation outlining the usage and details of devmem TCP. Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Bagas Sanjaya <[email protected]> Reviewed-by: Donald Hunter <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
1 parent 678f6e2 commit 09d1db2

File tree

2 files changed

+270
-0
lines changed

2 files changed

+270
-0
lines changed

Documentation/networking/devmem.rst

Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=================
4+
Device Memory TCP
5+
=================
6+
7+
8+
Intro
9+
=====
10+
11+
Device memory TCP (devmem TCP) enables receiving data directly into device
12+
memory (dmabuf). The feature is currently implemented for TCP sockets.
13+
14+
15+
Opportunity
16+
-----------
17+
18+
A large number of data transfers have device memory as the source and/or
19+
destination. Accelerators drastically increased the prevalence of such
20+
transfers. Some examples include:
21+
22+
- Distributed training, where ML accelerators, such as GPUs on different hosts,
23+
exchange data.
24+
25+
- Distributed raw block storage applications transfer large amounts of data with
26+
remote SSDs. Much of this data does not require host processing.
27+
28+
Typically the Device-to-Device data transfers in the network are implemented as
29+
the following low-level operations: Device-to-Host copy, Host-to-Host network
30+
transfer, and Host-to-Device copy.
31+
32+
The flow involving host copies is suboptimal, especially for bulk data transfers,
33+
and can put significant strains on system resources such as host memory
34+
bandwidth and PCIe bandwidth.
35+
36+
Devmem TCP optimizes this use case by implementing socket APIs that enable
37+
the user to receive incoming network packets directly into device memory.
38+
39+
Packet payloads go directly from the NIC to device memory.
40+
41+
Packet headers go to host memory and are processed by the TCP/IP stack
42+
normally. The NIC must support header split to achieve this.
43+
44+
Advantages:
45+
46+
- Alleviate host memory bandwidth pressure, compared to existing
47+
network-transfer + device-copy semantics.
48+
49+
- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
50+
level of the PCIe tree, compared to the traditional path which sends data
51+
through the root complex.
52+
53+
54+
More Info
55+
---------
56+
57+
slides, video
58+
https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
59+
60+
patchset
61+
[PATCH net-next v24 00/13] Device Memory TCP
62+
https://lore.kernel.org/netdev/[email protected]/
63+
64+
65+
Interface
66+
=========
67+
68+
69+
Example
70+
-------
71+
72+
tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
73+
the RX path of this API.
74+
75+
76+
NIC Setup
77+
---------
78+
79+
Header split, flow steering, & RSS are required features for devmem TCP.
80+
81+
Header split is used to split incoming packets into a header buffer in host
82+
memory, and a payload buffer in device memory.
83+
84+
Flow steering & RSS are used to ensure that only flows targeting devmem land on
85+
an RX queue bound to devmem.
86+
87+
Enable header split & flow steering::
88+
89+
# enable header split
90+
ethtool -G eth1 tcp-data-split on
91+
92+
93+
# enable flow steering
94+
ethtool -K eth1 ntuple on
95+
96+
Configure RSS to steer all traffic away from the target RX queue (queue 15 in
97+
this example)::
98+
99+
ethtool --set-rxfh-indir eth1 equal 15
100+
101+
102+
The user must bind a dmabuf to any number of RX queues on a given NIC using
103+
the netlink API::
104+
105+
/* Bind dmabuf to NIC RX queue 15 */
106+
struct netdev_queue *queues;
107+
queues = malloc(sizeof(*queues) * 1);
108+
109+
queues[0]._present.type = 1;
110+
queues[0]._present.idx = 1;
111+
queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
112+
queues[0].idx = 15;
113+
114+
*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
115+
116+
req = netdev_bind_rx_req_alloc();
117+
netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
118+
netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
119+
__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
120+
121+
rsp = netdev_bind_rx(*ys, req);
122+
123+
dmabuf_id = rsp->dmabuf_id;
124+
125+
126+
The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
127+
that has been bound.
128+
129+
The user can unbind the dmabuf from the netdevice by closing the netlink socket
130+
that established the binding. We do this so that the binding is automatically
131+
unbound even if the userspace process crashes.
132+
133+
Note that any reasonably well-behaved dmabuf from any exporter should work with
134+
devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
135+
this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
136+
137+
138+
Socket Setup
139+
------------
140+
141+
The socket must be flow steered to the dmabuf bound RX queue::
142+
143+
ethtool -N eth1 flow-type tcp4 ... queue 15
144+
145+
146+
Receiving data
147+
--------------
148+
149+
The user application must signal to the kernel that it is capable of receiving
150+
devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
151+
152+
ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
153+
154+
Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
155+
on devmem data.
156+
157+
Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
158+
Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
159+
160+
for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
161+
if (cm->cmsg_level != SOL_SOCKET ||
162+
(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
163+
cm->cmsg_type != SCM_DEVMEM_LINEAR))
164+
continue;
165+
166+
dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
167+
168+
if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
169+
/* Frag landed in dmabuf.
170+
*
171+
* dmabuf_cmsg->dmabuf_id is the dmabuf the
172+
* frag landed on.
173+
*
174+
* dmabuf_cmsg->frag_offset is the offset into
175+
* the dmabuf where the frag starts.
176+
*
177+
* dmabuf_cmsg->frag_size is the size of the
178+
* frag.
179+
*
180+
* dmabuf_cmsg->frag_token is a token used to
181+
* refer to this frag for later freeing.
182+
*/
183+
184+
struct dmabuf_token token;
185+
token.token_start = dmabuf_cmsg->frag_token;
186+
token.token_count = 1;
187+
continue;
188+
}
189+
190+
if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
191+
/* Frag landed in linear buffer.
192+
*
193+
* dmabuf_cmsg->frag_size is the size of the
194+
* frag.
195+
*/
196+
continue;
197+
198+
}
199+
200+
Applications may receive 2 cmsgs:
201+
202+
- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
203+
by dmabuf_id.
204+
205+
- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
206+
This typically happens when the NIC is unable to split the packet at the
207+
header boundary, such that part (or all) of the payload landed in host
208+
memory.
209+
210+
Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
211+
regular TCP data that landed on an RX queue not bound to a dmabuf.
212+
213+
214+
Freeing frags
215+
-------------
216+
217+
Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
218+
processes the frag. The user must return the frag to the kernel via
219+
SO_DEVMEM_DONTNEED::
220+
221+
ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
222+
sizeof(token));
223+
224+
The user must ensure the tokens are returned to the kernel in a timely manner.
225+
Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
226+
and will lead to packet drops.
227+
228+
229+
Implementation & Caveats
230+
========================
231+
232+
Unreadable skbs
233+
---------------
234+
235+
Devmem payloads are inaccessible to the kernel processing the packets. This
236+
results in a few quirks for payloads of devmem skbs:
237+
238+
- Loopback is not functional. Loopback relies on copying the payload, which is
239+
not possible with devmem skbs.
240+
241+
- Software checksum calculation fails.
242+
243+
- TCP Dump and bpf can't access devmem packet payloads.
244+
245+
246+
Testing
247+
=======
248+
249+
More realistic example code can be found in the kernel source under
250+
``tools/testing/selftests/net/ncdevmem.c``
251+
252+
ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
253+
receives data directly into a udmabuf.
254+
255+
To run ncdevmem, you need to run it on a server on the machine under test, and
256+
you need to run netcat on a peer to provide the TX data.
257+
258+
ncdevmem has a validation mode as well that expects a repeating pattern of
259+
incoming data and validates it as such. For example, you can launch
260+
ncdevmem on the server by::
261+
262+
ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
263+
-p 5201 -v 7
264+
265+
On client side, use regular netcat to send TX data to ncdevmem process
266+
on the server::
267+
268+
yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
269+
tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201

Documentation/networking/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ Contents:
4949
cdc_mbim
5050
dccp
5151
dctcp
52+
devmem
5253
dns_resolver
5354
driver
5455
eql

0 commit comments

Comments
 (0)