Skip to content

Commit e331673

Browse files
committed
Merge branch 'device-memory-tcp'
Mina Almasry says: ==================== Device Memory TCP Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer. * Problem: A large amount of data transfers have device memory as the source and/or destination. Accelerators drastically increased the volume of such transfers. Some examples include: - ML accelerators transferring large amounts of training data from storage into GPU/TPU memory. In some cases ML training setup time can be as long as 50% of TPU compute time, improving data transfer throughput & efficiency can help improving GPU/TPU utilization. - Distributed training, where ML accelerators, such as GPUs on different hosts, exchange data among them. - Distributed raw block storage applications transfer large amounts of data with remote SSDs, much of this data does not require host processing. Today, the majority of the Device-to-Device data transfers the network are implemented as the following low level operations: Device-to-Host copy, Host-to-Host network transfer, and Host-to-Device copy. The implementation is suboptimal, especially for bulk data transfers, and can put significant strains on system resources, such as host memory bandwidth, PCIe bandwidth, etc. One important reason behind the current state is the kernel’s lack of semantics to express device to network transfers. * Proposal: In this patch series we attempt to optimize this use case by implementing socket APIs that enable the user to: 1. send device memory across the network directly, and 2. receive incoming network packets directly into device memory. Packet _payloads_ go directly from the NIC to device memory for receive and from device memory to NIC for transmit. Packet _headers_ go to/from host memory and are processed by the TCP/IP stack normally. The NIC _must_ support header split to achieve this. Advantages: - Alleviate host memory bandwidth pressure, compared to existing network-transfer + device-copy semantics. - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level of the PCIe tree, compared to traditional path which sends data through the root complex. * Patch overview: ** Part 1: netlink API Gives user ability to bind dma-buf to an RX queue. ** Part 2: scatterlist support Currently the standard for device memory sharing is DMABUF, which doesn't generate struct pages. On the other hand, networking stack (skbs, drivers, and page pool) operate on pages. We have 2 options: 1. Generate struct pages for dmabuf device memory, or, 2. Modify the networking stack to process scatterlist. Approach #1 was attempted in RFC v1. RFC v2 implements approach #2. ** part 3: page pool support We piggy back on page pool memory providers proposal: https://github.com/kuba-moo/linux/tree/pp-providers It allows the page pool to define a memory provider that provides the page allocation and freeing. It helps abstract most of the device memory TCP changes from the driver. ** part 4: support for unreadable skb frags Page pool iovs are not accessible by the host; we implement changes throughput the networking stack to correctly handle skbs with unreadable frags. ** Part 5: recvmsg() APIs We define user APIs for the user to send and receive device memory. Not included with this series is the GVE devmem TCP support, just to simplify the review. Code available here if desired: https://github.com/mina/linux/tree/tcpdevmem This series is built on top of net-next with Jakub's pp-providers changes cherry-picked. * NIC dependencies: 1. (strict) Devmem TCP require the NIC to support header split, i.e. the capability to split incoming packets into a header + payload and to put each into a separate buffer. Devmem TCP works by using device memory for the packet payload, and host memory for the packet headers. 2. (optional) Devmem TCP works better with flow steering support & RSS support, i.e. the NIC's ability to steer flows into certain rx queues. This allows the sysadmin to enable devmem TCP on a subset of the rx queues, and steer devmem TCP traffic onto these queues and non devmem TCP elsewhere. The NIC I have access to with these properties is the GVE with DQO support running in Google Cloud, but any NIC that supports these features would suffice. I may be able to help reviewers bring up devmem TCP on their NICs. * Testing: The series includes a udmabuf kselftest that show a simple use case of devmem TCP and validates the entire data path end to end without a dependency on a specific dmabuf provider. ** Test Setup Kernel: net-next with this series and memory provider API cherry-picked locally. Hardware: Google Cloud A3 VMs. NIC: GVE with header split & RSS & flow steering support. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2 parents 24b8c19 + d0caf98 commit e331673

File tree

54 files changed

+2757
-124
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+2757
-124
lines changed

Documentation/netlink/specs/netdev.yaml

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,10 @@ attribute-sets:
167167
"re-attached", they are just waiting to disappear.
168168
Attribute is absent if Page Pool has not been detached, and
169169
can still be used to allocate new memory.
170+
-
171+
name: dmabuf
172+
doc: ID of the dmabuf this page-pool is attached to.
173+
type: u32
170174
-
171175
name: page-pool-info
172176
subset-of: page-pool
@@ -268,6 +272,10 @@ attribute-sets:
268272
name: napi-id
269273
doc: ID of the NAPI instance which services this queue.
270274
type: u32
275+
-
276+
name: dmabuf
277+
doc: ID of the dmabuf attached to this queue, if any.
278+
type: u32
271279

272280
-
273281
name: qstats
@@ -457,6 +465,39 @@ attribute-sets:
457465
Number of times driver re-started accepting send
458466
requests to this queue from the stack.
459467
type: uint
468+
-
469+
name: queue-id
470+
subset-of: queue
471+
attributes:
472+
-
473+
name: id
474+
-
475+
name: type
476+
-
477+
name: dmabuf
478+
attributes:
479+
-
480+
name: ifindex
481+
doc: netdev ifindex to bind the dmabuf to.
482+
type: u32
483+
checks:
484+
min: 1
485+
-
486+
name: queues
487+
doc: receive queues to bind the dmabuf to.
488+
type: nest
489+
nested-attributes: queue-id
490+
multi-attr: true
491+
-
492+
name: fd
493+
doc: dmabuf file descriptor to bind.
494+
type: u32
495+
-
496+
name: id
497+
doc: id of the dmabuf binding
498+
type: u32
499+
checks:
500+
min: 1
460501

461502
operations:
462503
list:
@@ -510,6 +551,7 @@ operations:
510551
- inflight
511552
- inflight-mem
512553
- detach-time
554+
- dmabuf
513555
dump:
514556
reply: *pp-reply
515557
config-cond: page-pool
@@ -574,6 +616,7 @@ operations:
574616
- type
575617
- napi-id
576618
- ifindex
619+
- dmabuf
577620
dump:
578621
request:
579622
attributes:
@@ -619,6 +662,24 @@ operations:
619662
- rx-bytes
620663
- tx-packets
621664
- tx-bytes
665+
-
666+
name: bind-rx
667+
doc: Bind dmabuf to netdev
668+
attribute-set: dmabuf
669+
flags: [ admin-perm ]
670+
do:
671+
request:
672+
attributes:
673+
- ifindex
674+
- fd
675+
- queues
676+
reply:
677+
attributes:
678+
- id
679+
680+
kernel-family:
681+
headers: [ "linux/list.h"]
682+
sock-priv: struct list_head
622683

623684
mcast-groups:
624685
list:

Documentation/networking/devmem.rst

Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=================
4+
Device Memory TCP
5+
=================
6+
7+
8+
Intro
9+
=====
10+
11+
Device memory TCP (devmem TCP) enables receiving data directly into device
12+
memory (dmabuf). The feature is currently implemented for TCP sockets.
13+
14+
15+
Opportunity
16+
-----------
17+
18+
A large number of data transfers have device memory as the source and/or
19+
destination. Accelerators drastically increased the prevalence of such
20+
transfers. Some examples include:
21+
22+
- Distributed training, where ML accelerators, such as GPUs on different hosts,
23+
exchange data.
24+
25+
- Distributed raw block storage applications transfer large amounts of data with
26+
remote SSDs. Much of this data does not require host processing.
27+
28+
Typically the Device-to-Device data transfers in the network are implemented as
29+
the following low-level operations: Device-to-Host copy, Host-to-Host network
30+
transfer, and Host-to-Device copy.
31+
32+
The flow involving host copies is suboptimal, especially for bulk data transfers,
33+
and can put significant strains on system resources such as host memory
34+
bandwidth and PCIe bandwidth.
35+
36+
Devmem TCP optimizes this use case by implementing socket APIs that enable
37+
the user to receive incoming network packets directly into device memory.
38+
39+
Packet payloads go directly from the NIC to device memory.
40+
41+
Packet headers go to host memory and are processed by the TCP/IP stack
42+
normally. The NIC must support header split to achieve this.
43+
44+
Advantages:
45+
46+
- Alleviate host memory bandwidth pressure, compared to existing
47+
network-transfer + device-copy semantics.
48+
49+
- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
50+
level of the PCIe tree, compared to the traditional path which sends data
51+
through the root complex.
52+
53+
54+
More Info
55+
---------
56+
57+
slides, video
58+
https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
59+
60+
patchset
61+
[PATCH net-next v24 00/13] Device Memory TCP
62+
https://lore.kernel.org/netdev/[email protected]/
63+
64+
65+
Interface
66+
=========
67+
68+
69+
Example
70+
-------
71+
72+
tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
73+
the RX path of this API.
74+
75+
76+
NIC Setup
77+
---------
78+
79+
Header split, flow steering, & RSS are required features for devmem TCP.
80+
81+
Header split is used to split incoming packets into a header buffer in host
82+
memory, and a payload buffer in device memory.
83+
84+
Flow steering & RSS are used to ensure that only flows targeting devmem land on
85+
an RX queue bound to devmem.
86+
87+
Enable header split & flow steering::
88+
89+
# enable header split
90+
ethtool -G eth1 tcp-data-split on
91+
92+
93+
# enable flow steering
94+
ethtool -K eth1 ntuple on
95+
96+
Configure RSS to steer all traffic away from the target RX queue (queue 15 in
97+
this example)::
98+
99+
ethtool --set-rxfh-indir eth1 equal 15
100+
101+
102+
The user must bind a dmabuf to any number of RX queues on a given NIC using
103+
the netlink API::
104+
105+
/* Bind dmabuf to NIC RX queue 15 */
106+
struct netdev_queue *queues;
107+
queues = malloc(sizeof(*queues) * 1);
108+
109+
queues[0]._present.type = 1;
110+
queues[0]._present.idx = 1;
111+
queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
112+
queues[0].idx = 15;
113+
114+
*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
115+
116+
req = netdev_bind_rx_req_alloc();
117+
netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
118+
netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
119+
__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
120+
121+
rsp = netdev_bind_rx(*ys, req);
122+
123+
dmabuf_id = rsp->dmabuf_id;
124+
125+
126+
The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
127+
that has been bound.
128+
129+
The user can unbind the dmabuf from the netdevice by closing the netlink socket
130+
that established the binding. We do this so that the binding is automatically
131+
unbound even if the userspace process crashes.
132+
133+
Note that any reasonably well-behaved dmabuf from any exporter should work with
134+
devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
135+
this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
136+
137+
138+
Socket Setup
139+
------------
140+
141+
The socket must be flow steered to the dmabuf bound RX queue::
142+
143+
ethtool -N eth1 flow-type tcp4 ... queue 15
144+
145+
146+
Receiving data
147+
--------------
148+
149+
The user application must signal to the kernel that it is capable of receiving
150+
devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
151+
152+
ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
153+
154+
Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
155+
on devmem data.
156+
157+
Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
158+
Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
159+
160+
for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
161+
if (cm->cmsg_level != SOL_SOCKET ||
162+
(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
163+
cm->cmsg_type != SCM_DEVMEM_LINEAR))
164+
continue;
165+
166+
dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
167+
168+
if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
169+
/* Frag landed in dmabuf.
170+
*
171+
* dmabuf_cmsg->dmabuf_id is the dmabuf the
172+
* frag landed on.
173+
*
174+
* dmabuf_cmsg->frag_offset is the offset into
175+
* the dmabuf where the frag starts.
176+
*
177+
* dmabuf_cmsg->frag_size is the size of the
178+
* frag.
179+
*
180+
* dmabuf_cmsg->frag_token is a token used to
181+
* refer to this frag for later freeing.
182+
*/
183+
184+
struct dmabuf_token token;
185+
token.token_start = dmabuf_cmsg->frag_token;
186+
token.token_count = 1;
187+
continue;
188+
}
189+
190+
if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
191+
/* Frag landed in linear buffer.
192+
*
193+
* dmabuf_cmsg->frag_size is the size of the
194+
* frag.
195+
*/
196+
continue;
197+
198+
}
199+
200+
Applications may receive 2 cmsgs:
201+
202+
- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
203+
by dmabuf_id.
204+
205+
- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
206+
This typically happens when the NIC is unable to split the packet at the
207+
header boundary, such that part (or all) of the payload landed in host
208+
memory.
209+
210+
Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
211+
regular TCP data that landed on an RX queue not bound to a dmabuf.
212+
213+
214+
Freeing frags
215+
-------------
216+
217+
Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
218+
processes the frag. The user must return the frag to the kernel via
219+
SO_DEVMEM_DONTNEED::
220+
221+
ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
222+
sizeof(token));
223+
224+
The user must ensure the tokens are returned to the kernel in a timely manner.
225+
Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
226+
and will lead to packet drops.
227+
228+
229+
Implementation & Caveats
230+
========================
231+
232+
Unreadable skbs
233+
---------------
234+
235+
Devmem payloads are inaccessible to the kernel processing the packets. This
236+
results in a few quirks for payloads of devmem skbs:
237+
238+
- Loopback is not functional. Loopback relies on copying the payload, which is
239+
not possible with devmem skbs.
240+
241+
- Software checksum calculation fails.
242+
243+
- TCP Dump and bpf can't access devmem packet payloads.
244+
245+
246+
Testing
247+
=======
248+
249+
More realistic example code can be found in the kernel source under
250+
``tools/testing/selftests/net/ncdevmem.c``
251+
252+
ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
253+
receives data directly into a udmabuf.
254+
255+
To run ncdevmem, you need to run it on a server on the machine under test, and
256+
you need to run netcat on a peer to provide the TX data.
257+
258+
ncdevmem has a validation mode as well that expects a repeating pattern of
259+
incoming data and validates it as such. For example, you can launch
260+
ncdevmem on the server by::
261+
262+
ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
263+
-p 5201 -v 7
264+
265+
On client side, use regular netcat to send TX data to ncdevmem process
266+
on the server::
267+
268+
yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
269+
tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201

Documentation/networking/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ Contents:
4949
cdc_mbim
5050
dccp
5151
dctcp
52+
devmem
5253
dns_resolver
5354
driver
5455
eql

arch/alpha/include/uapi/asm/socket.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,12 @@
140140
#define SO_PASSPIDFD 76
141141
#define SO_PEERPIDFD 77
142142

143+
#define SO_DEVMEM_LINEAR 78
144+
#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR
145+
#define SO_DEVMEM_DMABUF 79
146+
#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF
147+
#define SO_DEVMEM_DONTNEED 80
148+
143149
#if !defined(__KERNEL__)
144150

145151
#if __BITS_PER_LONG == 64

0 commit comments

Comments
 (0)