Skip to content

Commit a502ea6

Browse files
sbrivio-rhdavem330
authored andcommitted
udp: Deal with race between UDP socket address change and rehash
If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) and the four-tuple hash (local and remote ports and addresses) are updated. Secondary hash chains were introduced by commit 30fff92 ("udp: bind() optimisation") and, as a result, a rehash operation became needed to make a bound socket reachable again after a connect(). This operation was introduced by commit 719f835 ("udp: add rehash on connect()") which isn't however a complete fix: the socket will be found once the rehashing completes, but not while it's pending. This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a client sending datagrams to it. After the server receives the first datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to the address of the sender, in order to set up a directed flow. Now, if the client, running on a different CPU thread, happens to send a (subsequent) datagram while the server's socket changes its address, but is not rehashed yet, this will result in a failed lookup and a port unreachable error delivered to the client, as apparent from the following reproducer: LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4)) dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in while :; do taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.1 || sleep 1 taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null wait done where the client will eventually get ECONNREFUSED on a write() (typically the second or third one of a given iteration): 2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused This issue was first observed as a seldom failure in Podman's tests checking UDP functionality while using pasta(1) to connect the container's network namespace, which leads us to a reproducer with the lookup error resulting in an ICMP packet on a tap device: LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')" while :; do ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.2 || sleep 1 socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null wait cmp tmp.in tmp.out done Once this fails: tmp.in tmp.out differ: char 8193, line 29 we can finally have a look at what's going on: $ tshark -r pasta.pcap 1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2 2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164 6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55 7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable) 9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096 12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0 On the third datagram received, the network namespace of the container initiates an ARP lookup to deliver the ICMP message. In another variant of this reproducer, starting the client with: strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log & and connecting to the socat server using a loopback address: socat OPEN:tmp.in UDP4:localhost:1337,shut-null we can more clearly observe a sendmmsg() call failing after the first datagram is delivered: [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0 [...] [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1 [...] [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) and, somewhat confusingly, after a connect() on the same socket succeeded. Until commit 4cdeeee ("net: udp: prefer listeners bound to an address"), the race between receive address change and lookup didn't actually cause visible issues, because, once the lookup based on the secondary hash chain failed, we would still attempt a lookup based on the primary hash (destination port only), and find the socket with the outdated secondary hash. That change, however, dropped port-only lookups altogether, as side effect, making the race visible. To fix this, while avoiding the need to make address changes and rehash atomic against lookups, reintroduce primary hash lookups as fallback, if lookups based on four-tuple and secondary hashes fail. To this end, introduce a simplified lookup implementation, which doesn't take care of SO_REUSEPORT groups: if we have one, there are multiple sockets that would match the four-tuple or secondary hash, meaning that we can't run into this race at all. v2: - instead of synchronising lookup operations against address change plus rehash, reintroduce a simplified version of the original primary hash lookup as fallback v1: - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr usage (Kuniyuki Iwashima) - directly use sk_rcv_saddr for IPv4 receive addresses instead of fetching inet_rcv_saddr (Kuniyuki Iwashima) - move inet_update_saddr() to inet_hashtables.h and use that to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima) - rebase onto net-next, update commit message accordingly Reported-by: Ed Santiago <[email protected]> Link: containers/podman#24147 Analysed-by: David Gibson <[email protected]> Fixes: 30fff92 ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent ae418e9 commit a502ea6

File tree

2 files changed

+106
-0
lines changed

2 files changed

+106
-0
lines changed

net/ipv4/udp.c

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -420,6 +420,49 @@ u32 udp_ehashfn(const struct net *net, const __be32 laddr, const __u16 lport,
420420
}
421421
EXPORT_SYMBOL(udp_ehashfn);
422422

423+
/**
424+
* udp4_lib_lookup1() - Simplified lookup using primary hash (destination port)
425+
* @net: Network namespace
426+
* @saddr: Source address, network order
427+
* @sport: Source port, network order
428+
* @daddr: Destination address, network order
429+
* @hnum: Destination port, host order
430+
* @dif: Destination interface index
431+
* @sdif: Destination bridge port index, if relevant
432+
* @udptable: Set of UDP hash tables
433+
*
434+
* Simplified lookup to be used as fallback if no sockets are found due to a
435+
* potential race between (receive) address change, and lookup happening before
436+
* the rehash operation. This function ignores SO_REUSEPORT groups while scoring
437+
* result sockets, because if we have one, we don't need the fallback at all.
438+
*
439+
* Called under rcu_read_lock().
440+
*
441+
* Return: socket with highest matching score if any, NULL if none
442+
*/
443+
static struct sock *udp4_lib_lookup1(const struct net *net,
444+
__be32 saddr, __be16 sport,
445+
__be32 daddr, unsigned int hnum,
446+
int dif, int sdif,
447+
const struct udp_table *udptable)
448+
{
449+
unsigned int slot = udp_hashfn(net, hnum, udptable->mask);
450+
struct udp_hslot *hslot = &udptable->hash[slot];
451+
struct sock *sk, *result = NULL;
452+
int score, badness = 0;
453+
454+
sk_for_each_rcu(sk, &hslot->head) {
455+
score = compute_score(sk, net,
456+
saddr, sport, daddr, hnum, dif, sdif);
457+
if (score > badness) {
458+
result = sk;
459+
badness = score;
460+
}
461+
}
462+
463+
return result;
464+
}
465+
423466
/* called with rcu_read_lock() */
424467
static struct sock *udp4_lib_lookup2(const struct net *net,
425468
__be32 saddr, __be16 sport,
@@ -683,6 +726,19 @@ struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
683726
result = udp4_lib_lookup2(net, saddr, sport,
684727
htonl(INADDR_ANY), hnum, dif, sdif,
685728
hslot2, skb);
729+
if (!IS_ERR_OR_NULL(result))
730+
goto done;
731+
732+
/* Primary hash (destination port) lookup as fallback for this race:
733+
* 1. __ip4_datagram_connect() sets sk_rcv_saddr
734+
* 2. lookup (this function): new sk_rcv_saddr, hashes not updated yet
735+
* 3. rehash operation updating _secondary and four-tuple_ hashes
736+
* The primary hash doesn't need an update after 1., so, thanks to this
737+
* further step, 1. and 3. don't need to be atomic against the lookup.
738+
*/
739+
result = udp4_lib_lookup1(net, saddr, sport, daddr, hnum, dif, sdif,
740+
udptable);
741+
686742
done:
687743
if (IS_ERR(result))
688744
return NULL;

net/ipv6/udp.c

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,49 @@ static int compute_score(struct sock *sk, const struct net *net,
170170
return score;
171171
}
172172

173+
/**
174+
* udp6_lib_lookup1() - Simplified lookup using primary hash (destination port)
175+
* @net: Network namespace
176+
* @saddr: Source address, network order
177+
* @sport: Source port, network order
178+
* @daddr: Destination address, network order
179+
* @hnum: Destination port, host order
180+
* @dif: Destination interface index
181+
* @sdif: Destination bridge port index, if relevant
182+
* @udptable: Set of UDP hash tables
183+
*
184+
* Simplified lookup to be used as fallback if no sockets are found due to a
185+
* potential race between (receive) address change, and lookup happening before
186+
* the rehash operation. This function ignores SO_REUSEPORT groups while scoring
187+
* result sockets, because if we have one, we don't need the fallback at all.
188+
*
189+
* Called under rcu_read_lock().
190+
*
191+
* Return: socket with highest matching score if any, NULL if none
192+
*/
193+
static struct sock *udp6_lib_lookup1(const struct net *net,
194+
const struct in6_addr *saddr, __be16 sport,
195+
const struct in6_addr *daddr,
196+
unsigned int hnum, int dif, int sdif,
197+
const struct udp_table *udptable)
198+
{
199+
unsigned int slot = udp_hashfn(net, hnum, udptable->mask);
200+
struct udp_hslot *hslot = &udptable->hash[slot];
201+
struct sock *sk, *result = NULL;
202+
int score, badness = 0;
203+
204+
sk_for_each_rcu(sk, &hslot->head) {
205+
score = compute_score(sk, net,
206+
saddr, sport, daddr, hnum, dif, sdif);
207+
if (score > badness) {
208+
result = sk;
209+
badness = score;
210+
}
211+
}
212+
213+
return result;
214+
}
215+
173216
/* called with rcu_read_lock() */
174217
static struct sock *udp6_lib_lookup2(const struct net *net,
175218
const struct in6_addr *saddr, __be16 sport,
@@ -347,6 +390,13 @@ struct sock *__udp6_lib_lookup(const struct net *net,
347390
result = udp6_lib_lookup2(net, saddr, sport,
348391
&in6addr_any, hnum, dif, sdif,
349392
hslot2, skb);
393+
if (!IS_ERR_OR_NULL(result))
394+
goto done;
395+
396+
/* Cover address change/lookup/rehash race: see __udp4_lib_lookup() */
397+
result = udp6_lib_lookup1(net, saddr, sport, daddr, hnum, dif, sdif,
398+
udptable);
399+
350400
done:
351401
if (IS_ERR(result))
352402
return NULL;

0 commit comments

Comments
 (0)