| 
 | 1 | +---  | 
 | 2 | +layout: post  | 
 | 3 | +title:  "Introducing TCP-in-UDP solution"  | 
 | 4 | +---  | 
 | 5 | + | 
 | 6 | +The MPTCP protocol is complex, mainly to be able to survive on the Internet  | 
 | 7 | +where  | 
 | 8 | +[middleboxes](https://datatracker.ietf.org/doc/html/rfc8684#name-interactions-with-middlebox)  | 
 | 9 | +such as NATs, firewalls, IDS or proxies can modify parts of the TCP packets.  | 
 | 10 | +Worst case scenario, an MPTCP connection should fallback to "plain" TCP. Today,  | 
 | 11 | +such fallbacks are rarer than before -- probably because MPTCP has been used  | 
 | 12 | +since 2013 on millions of Apple smartphones worldwide -- but they can still  | 
 | 13 | +exist, e.g. on some mobile networks using Performance Enhancing Proxies (PEPs)  | 
 | 14 | +where MPTCP connections are not bypassed. In such cases, a solution to continue  | 
 | 15 | +benefiting from MPTCP is to tunnel the MPTCP connections. Different solutions  | 
 | 16 | +exist, but they usually add extra layers, and requires setting a virtual private  | 
 | 17 | +network (VPN) up with private IP addresses between the client and the server.  | 
 | 18 | + | 
 | 19 | +Here, a simpler solution is presented:  | 
 | 20 | +[TCP-in-UDP](https://github.com/multipath-tcp/tcp-in-udp). This solution relies  | 
 | 21 | +on [eBPF](https://ebpf.io/), doesn't add extra data per packet, and doesn't  | 
 | 22 | +require a virtual private network. Read on to find out more about that!  | 
 | 23 | + | 
 | 24 | +<!--more-->  | 
 | 25 | + | 
 | 26 | +--------------------------------------------------------------------------------  | 
 | 27 | + | 
 | 28 | +> First, if the network you use blocks TCP extensions like MPTCP or other  | 
 | 29 | +> protocols, the best thing to do is to contact your network operator: maybe  | 
 | 30 | +> they are simply not aware of this issue, and can easily fix it.  | 
 | 31 | +
  | 
 | 32 | +## TCP-in-UDP  | 
 | 33 | + | 
 | 34 | +Many tunnel solutions exist, but they have other use-cases: getting access to  | 
 | 35 | +private networks, eventually with encryptions -- with solutions like OpenVPN,  | 
 | 36 | +IPSec, WireGuard®, etc. -- or to add extra info in each packet for routing  | 
 | 37 | +purposes -- like GRE, GENEVE, etc. The Linux kernel  | 
 | 38 | +[supports](https://developers.redhat.com/blog/2019/05/17/an-introduction-to-linux-virtual-interfaces-tunnels)  | 
 | 39 | +many of these tunnels. In our case, the goal is not to get access to private  | 
 | 40 | +networks and not to add an extra layer of encryption, but to make sure packets  | 
 | 41 | +are not being modified by the network.  | 
 | 42 | + | 
 | 43 | +For our use-case, it is then enough to "convert the TCP packets in UDP". This  | 
 | 44 | +what [TCP-in-UDP](https://github.com/multipath-tcp/tcp-in-udp) is doing. This  | 
 | 45 | +idea is not new, it is inspired by an old [IETF  | 
 | 46 | +draft](https://datatracker.ietf.org/doc/html/draft-cheshire-tcp-over-udp-00.html).  | 
 | 47 | +In short, items from the TCP header are re-ordered to start with items from the  | 
 | 48 | +UDP header.  | 
 | 49 | + | 
 | 50 | +### TCP to UDP header  | 
 | 51 | + | 
 | 52 | +To better understand the translation, let's see how the different headers look  | 
 | 53 | +like:  | 
 | 54 | + | 
 | 55 | +- [UDP](https://www.ietf.org/rfc/rfc768.html):  | 
 | 56 | + | 
 | 57 | +```  | 
 | 58 | + 0                   1                   2                   3  | 
 | 59 | + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1  | 
 | 60 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 61 | +|          Source Port          |       Destination Port        |  | 
 | 62 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 63 | +|            Length             |           Checksum            |  | 
 | 64 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 65 | +```  | 
 | 66 | + | 
 | 67 | +- [TCP](https://www.ietf.org/rfc/rfc9293.html):  | 
 | 68 | + | 
 | 69 | +```  | 
 | 70 | + 0                   1                   2                   3  | 
 | 71 | + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1  | 
 | 72 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 73 | +|          Source Port          |       Destination Port        |  | 
 | 74 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 75 | +|                        Sequence Number                        |  | 
 | 76 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 77 | +|                    Acknowledgment Number                      |  | 
 | 78 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 79 | +|  Data |       |C|E|U|A|P|R|S|F|                               |  | 
 | 80 | +| Offset| Reser |R|C|R|C|S|S|Y|I|            Window             |  | 
 | 81 | +|       |       |W|E|G|K|H|T|N|N|                               |  | 
 | 82 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 83 | +|           Checksum            |         Urgent Pointer        |  | 
 | 84 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 85 | +|                      (Optional) Options                       |  | 
 | 86 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 87 | +```  | 
 | 88 | + | 
 | 89 | +- [TCP-in-UDP](https://datatracker.ietf.org/doc/html/draft-cheshire-tcp-over-udp-00.html):  | 
 | 90 | + | 
 | 91 | +```  | 
 | 92 | + 0                   1                   2                   3  | 
 | 93 | + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1  | 
 | 94 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 95 | +|          Source Port          |       Destination Port        |  | 
 | 96 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 97 | +|            Length             |           Checksum            |  | 
 | 98 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 99 | +|  Data |       |C|E| |A|P|R|S|F|                               |  | 
 | 100 | +| Offset| Reser |R|C|0|C|S|S|Y|I|            Window             |  | 
 | 101 | +|       |       |W|E| |K|H|T|N|N|                               |  | 
 | 102 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 103 | +|                        Sequence Number                        |  | 
 | 104 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 105 | +|                    Acknowledgment Number                      |  | 
 | 106 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 107 | +|                      (Optional) Options                       |  | 
 | 108 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  | 
 | 109 | +```  | 
 | 110 | + | 
 | 111 | +As described  | 
 | 112 | +[here](https://perso.uclouvain.be/olivier.bonaventure/blog/html/2013/07/04/tcp_over_udp.html),  | 
 | 113 | +the first eight bytes of the TCP-in-UDP header correspond to the classical UDP  | 
 | 114 | +header. Then, the Data Offset is placed with the flags and the window field.  | 
 | 115 | +Placing the Data Offset after the Checksum ensures that a value larger than  | 
 | 116 | +`0x5` will appear there, which is required for STUN traversal. Then, the  | 
 | 117 | +sequence numbers and acknowledgment numbers follow. With this translation, the  | 
 | 118 | +TCP header has been reordered, but starts with a UDP header without modifying  | 
 | 119 | +the packet length. The informed reader will have noticed that the `URG` flag and  | 
 | 120 | +the `Urgent Pointer` have disappeared. This field is rarely used and some  | 
 | 121 | +middleboxes reset it. This is not a huge loss for most TCP applications.  | 
 | 122 | + | 
 | 123 | +In other words, apart from a different order, the only two modifications are:  | 
 | 124 | + | 
 | 125 | +- the layer 4 protocol indicated in layer 3 (IPv4/IPv6)  | 
 | 126 | +- the switch from `Urgent Pointer` to `Length` (and the opposite)  | 
 | 127 | + | 
 | 128 | +These two modifications will of course affect the Checksum field that will need  | 
 | 129 | +to be updated accordingly.  | 
 | 130 | + | 
 | 131 | +## Dealing with network stack optimisations  | 
 | 132 | + | 
 | 133 | +On paper, the required modifications -- protocol, a 16-bit word, and adapt the  | 
 | 134 | +checksum -- are small, and should be easy to do using eBPF with TC ingress and  | 
 | 135 | +egress hooks. But doing that in a highly optimised stack is more complex than  | 
 | 136 | +expected.  | 
 | 137 | + | 
 | 138 | +### Accessing all required data  | 
 | 139 | + | 
 | 140 | +On Linux, all per-packet data are stored in a socket buffer, or  | 
 | 141 | +"[SKB](http://oldvger.kernel.org/~davem/skb.html)". In our case here, the eBPF  | 
 | 142 | +code needs to access the packet header, which should be available between  | 
 | 143 | +`skb->data` and `skb->data_end`. Except that, `skb->data_end` might not point to  | 
 | 144 | +the end of the packet, but typically it points to the end of the packet header.  | 
 | 145 | +This is an optimisation, because the kernel will often do operations depending  | 
 | 146 | +on the packet header, and it doesn't really care about the content of the data,  | 
 | 147 | +which is usually more for the userspace, or to be forwarded to another network  | 
 | 148 | +interface.  | 
 | 149 | + | 
 | 150 | +In our case, in egress -- translation from TCP to UDP -- it is fine: the whole  | 
 | 151 | +TCP header is available, and that's where the modifications will need to be  | 
 | 152 | +done. In ingress -- translation from UDP to TCP -- that's different: some  | 
 | 153 | +network drivers will only align data going up to the end of the layer 4  | 
 | 154 | +protocol, so the 8 bytes of the UDP header here. This is not enough to do the  | 
 | 155 | +translation, as it is required to access the 12 more bytes. This issue is easy  | 
 | 156 | +to fix: eBPF helpers were introduced a long time ago to pull in non-linear data,  | 
 | 157 | +e.g. via  | 
 | 158 | +[`bpf_skb_pull_data`](https://docs.ebpf.io/linux/helper-function/bpf_skb_pull_data/)  | 
 | 159 | +or  | 
 | 160 | +[`bpf_skb_load_bytes`](https://docs.ebpf.io/linux/helper-function/bpf_skb_load_bytes/).  | 
 | 161 | + | 
 | 162 | +### GRO & TSO/GSO  | 
 | 163 | + | 
 | 164 | +On the Internet, packets are usually limited to 1500 bytes or fewer. Each packet  | 
 | 165 | +still needs to carry some headers to indicate the source and destination, but  | 
 | 166 | +also per-packet information like the data sequence number. Having to deal with  | 
 | 167 | +"small" packets has a cost which can be very high to deal with very high  | 
 | 168 | +throughput. To counter that, the Linux networking stack will prefer to deal with  | 
 | 169 | +bigger chunks of data, with "internal" packets of tens of kilobytes, and split  | 
 | 170 | +the packet into smaller ones with very similar header later on. Some network  | 
 | 171 | +devices can even do this segmentation or aggregation work in hardware. That's  | 
 | 172 | +what GRO (Generic Receive Offload), and TSO (TCP Segmentation Offload) / GSO  | 
 | 173 | +(Generic Segmentation Offload) are for.  | 
 | 174 | + | 
 | 175 | +With TCP-in-UDP, it is required to act on a per-packet basis: each TCP packet  | 
 | 176 | +will be translated to UDP, which will contain the UDP header (8 bytes), the rest  | 
 | 177 | +of the TCP one (12 bytes + the TCP options), then the TCP payload. In other  | 
 | 178 | +words, for each UDP packet, the UDP payload will contain a part of the TCP  | 
 | 179 | +header: data that is per-packet specific. It means that the traditional GRO and  | 
 | 180 | +TSO cannot be used because the data cannot "simply" be merged with the next one  | 
 | 181 | +like before.  | 
 | 182 | + | 
 | 183 | +Informed readers will then say that these network device features can be easily  | 
 | 184 | +disabled using `ethtool`, e.g.  | 
 | 185 | + | 
 | 186 | +```  | 
 | 187 | +ethtool -K "${IFACE}" gro off gso off tso off  | 
 | 188 | +```  | 
 | 189 | + | 
 | 190 | +Correct, but even if all hardware offload accelerations are disabled, in egress,  | 
 | 191 | +the Linux networking stack still has interest to deal with bigger packets  | 
 | 192 | +internally, and do the segmentation in software at the end. Because it is not  | 
 | 193 | +easily possible to modify how the segmentation will be done with eBPF, it is  | 
 | 194 | +required to tell the stack not to do this optimisation, e.g. with:  | 
 | 195 | + | 
 | 196 | +```  | 
 | 197 | +ip link set "${IFACE}" gso_max_segs 1  | 
 | 198 | +```  | 
 | 199 | + | 
 | 200 | +### Checksum  | 
 | 201 | + | 
 | 202 | +> The following was certainly the most frustrating issue to deal with!  | 
 | 203 | +
  | 
 | 204 | +Thanks to how the checksum is  | 
 | 205 | +[computed](https://datatracker.ietf.org/doc/html/rfc1071), moving some 16-bit  | 
 | 206 | +words or bigger around doesn't change the checksum. Still, some fields need to  | 
 | 207 | +be updated:  | 
 | 208 | + | 
 | 209 | +- The layer 4 protocol, set in layer 3 (IPv4/IPv6) here, also used to compute  | 
 | 210 | +  the next layer (UDP/TCP) checksum.  | 
 | 211 | +- The switch from the TCP `Urgent Pointer` (`0`) to the UDP `Length` (and the  | 
 | 212 | +  opposite).  | 
 | 213 | + | 
 | 214 | +It is not required to recompute the full checksum. Instead, this can be done  | 
 | 215 | +[incrementally](https://datatracker.ietf.org/doc/html/rfc1141), and some eBPF  | 
 | 216 | +helpers can do that for us, e.g.  | 
 | 217 | +[`bpf_l3_csum_replace`](https://docs.ebpf.io/linux/helper-function/bpf_l3_csum_replace/)  | 
 | 218 | +and  | 
 | 219 | +[`bpf_l4_csum_replace`](https://docs.ebpf.io/linux/helper-function/bpf_l4_csum_replace/).  | 
 | 220 | + | 
 | 221 | +When testing with Network namespaces (`netns`) with one host dedicated to the  | 
 | 222 | +translation when forwarding packets, everything was fine: the correct checksum  | 
 | 223 | +was visible in each packet. But when testing with real hardware, with TCP-in-UDP  | 
 | 224 | +eBPF hooks directly on the client and server, that was different: the checksum  | 
 | 225 | +in egress was incorrect on most network interfaces, even when the transmission  | 
 | 226 | +checksum offload (`tx`) was disabled on the network interface.  | 
 | 227 | + | 
 | 228 | +After quite a bit of investigation, it appears that both the layer 3 and 4  | 
 | 229 | +checksums were correctly updated by the eBPF hook, but either the NIC or the  | 
 | 230 | +networking stack was modifying the layer 4 checksum at the wrong place. This  | 
 | 231 | +deserves some explanation.  | 
 | 232 | + | 
 | 233 | +In egress, the Linux TCP networking stack of the sender will typically set  | 
 | 234 | +`skb->ip_summed` to `CHECKSUM_PARTIAL`. In short, it means the TCP/IP stack will  | 
 | 235 | +compute a part of the checksum, only the one covering the  | 
 | 236 | +[pseudo-header](https://www.ietf.org/rfc/rfc9293.html#section-3.1-6.18.1): IP  | 
 | 237 | +addresses, protocol number and length. The rest will be computed later on,  | 
 | 238 | +ideally by the networking device. At that last stage, the device only needs to  | 
 | 239 | +know where the layer 4 starts in the packet, but also where the checksum field  | 
 | 240 | +is from the start of this layer 4. This info is internally registered in  | 
 | 241 | +`skb->csum_offset`, and it is different for TCP and UDP because the checksum  | 
 | 242 | +field is not at the same place in their headers.  | 
 | 243 | + | 
 | 244 | +When switching from UDP to TCP, it is then not enough to change the protocol  | 
 | 245 | +number in the layer 3, this internal checksum offset value also needs to be  | 
 | 246 | +updated. If I'm not mistaken, today, it is not possible to update it directly  | 
 | 247 | +with eBPF. A proper solution is certainly to add a new eBPF helper, but that  | 
 | 248 | +would only work with newer kernels, or eventually with a custom module. Instead,  | 
 | 249 | +a workaround has been found: chain the eBPF TC egress hook with a TC `ACT_CSUM`  | 
 | 250 | +action when the packet is translated from TCP to UDP. This [`csum`  | 
 | 251 | +action](https://www.man7.org/linux/man-pages/man8/tc-csum.8.html) triggers a  | 
 | 252 | +software checksum recalculation of the specified packet headers. In other words  | 
 | 253 | +and in our case, it is used to compute the rest of the checksum for a given  | 
 | 254 | +protocol (UDP), and mark the checksum as computed (`CHECKSUM_NONE`). This last  | 
 | 255 | +step is important, because even if it is possible to compute the full checksum  | 
 | 256 | +with eBPF code like we did at some point, it is wrong to do so if we cannot  | 
 | 257 | +change the `CHECKSUM_PARTIAL` flag which expect a later stage to update a  | 
 | 258 | +checksum at a (wrong) offset with the rest of the data.  | 
 | 259 | + | 
 | 260 | +So with a combination of both TC `ACT_CSUM` and eBPF, it is possible to get the  | 
 | 261 | +right checksum after having modified the layer 4 protocol.  | 
 | 262 | + | 
 | 263 | +### MTU/MSS  | 
 | 264 | + | 
 | 265 | +This is not linked to the highly optimised Linux network stack, but, on the  | 
 | 266 | +wire, the packets will be in UDP and not TCP. It means that some operations like  | 
 | 267 | +the dynamic adaptation of the MSS (TCP Maximum Segment Size) -- aka MSS clamping  | 
 | 268 | +-- will have no effects here. Many mobile networks uses encapsulation without  | 
 | 269 | +jumbo frames, meaning that the maximum size is lower than 1500 bytes. For  | 
 | 270 | +performance reasons, and not to have to deal with this, it is important to avoid  | 
 | 271 | +IP fragmentation. In other words, it might be required to adapt the interface  | 
 | 272 | +Maximum Transmission Unit (MTU), or the  | 
 | 273 | +[MTU](https://man.archlinux.org/man/ip-route.8.en#mtu) /  | 
 | 274 | +[MSS](https://man.archlinux.org/man/ip-route.8.en#advmss) per destination.  | 
 | 275 | + | 
 | 276 | +## Conclusion  | 
 | 277 | + | 
 | 278 | +In conclusion, this new eBPF program can be easily deployed on both the client  | 
 | 279 | +and server sides to circumvent middleboxes that are still blocking MPTCP or  | 
 | 280 | +other protocols. All you might still need to do is to modify the destination  | 
 | 281 | +port which is [currently  | 
 | 282 | +hardcoded](https://github.com/multipath-tcp/tcp-in-udp/blob/cde92b1cf8588f7cd3932b204cd51e0596a07ade/tcp_in_udp_tc.c#L29).  | 
 | 283 | + | 
 | 284 | +## Acknowledgments  | 
 | 285 | + | 
 | 286 | +Thanks to [Xpedite Technologies](https://xpedite-tech.com) for having supported  | 
 | 287 | +this work, and in particular [Chester](https://github.com/arinc9) for his help  | 
 | 288 | +investigating the checksum issues with real hardware. Also thanks to Nickz from  | 
 | 289 | +the eBPF.io community for his support while working on these checksum issues.  | 
0 commit comments