Skip to content

Commit 6e2f148

Browse files
author
Paolo Abeni
committed
Merge branch 'tcp-update-bind-bucket-state-on-port-release'
Jakub Sitnicki says: ==================== tcp: Update bind bucket state on port release TL;DR ----- This is another take on addressing the issue we already raised earlier [1]. This time around, instead of trying to relax the bind-conflict checks in connect(), we make an attempt to fix the tcp bind bucket state accounting. The goal of this patch set is to make the bind buckets return to "port reusable by ephemeral connections" state when all sockets blocking the port from reuse get unhashed. Changelog --------- Changes in v5: - Fix initial port-addr bucket state on saddr update with ip_dynaddr=1 - Add Kuniyuki's tag for tests - Link to v4: https://lore.kernel.org/r/20250913-update-bind-bucket-state-on-unhash-v4-0-33a567594df7@cloudflare.com Changes in v4: - Drop redundant sk_is_connect_bind helper doc comment - Link to v3: https://lore.kernel.org/r/20250910-update-bind-bucket-state-on-unhash-v3-0-023caaf4ae3c@cloudflare.com Changes in v3: - Move the flag from inet_flags to sk_userlocks (Kuniyuki) - Rename the flag from AUTOBIND to CONNECT_BIND to avoid a name clash (Kuniyuki) - Drop unreachable code for sk_state == TCP_NEW_SYN_RECV (Kuniyuki) - Move the helper to inet_hashtables where it's used - Reword patch 1 description for conciseness - Link to v2: https://lore.kernel.org/r/20250821-update-bind-bucket-state-on-unhash-v2-0-0c204543a522@cloudflare.com Changes in v2: - Rename the inet_sock flag from LAZY_BIND to AUTOBIND (Eric) - Clear the AUTOBIND flag on disconnect path (Eric) - Add a test to cover the disconnect case (Eric) - Link to RFC v1: https://lore.kernel.org/r/20250808-update-bind-bucket-state-on-unhash-v1-0-faf85099d61b@cloudflare.com Situation --------- We observe the following scenario in production: inet_bind_bucket state for port 54321 -------------------- (bucket doesn't exist) // Process A opens a long-lived connection: s1 = socket(AF_INET, SOCK_STREAM) s1.setsockopt(IP_BIND_ADDRESS_NO_PORT) s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500) s1.bind(192.0.2.10, 0) s1.connect(192.51.100.1, 443) tb->fastreuse = -1 tb->fastreuseport = -1 s1.getsockname() -> 192.0.2.10:54321 s1.send() s1.recv() // ... s1 stays open. // Process B opens a short-lived connection: s2 = socket(AF_INET, SOCK_STREAM) s2.setsockopt(SO_REUSEADDR) s2.bind(192.0.2.20, 0) tb->fastreuse = 0 tb->fastreuseport = 0 s2.connect(192.51.100.2, 53) s2.getsockname() -> 192.0.2.20:54321 s2.send() s2.recv() s2.close() // bucket remains in this // state even though port // was released by s2 tb->fastreuse = 0 tb->fastreuseport = 0 // Process A attempts to open another connection // when there is connection pressure from // 192.0.2.30:54000..54500 to 192.51.100.1:443. // Assume only port 54321 is still available. s3 = socket(AF_INET, SOCK_STREAM) s3.setsockopt(IP_BIND_ADDRESS_NO_PORT) s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500) s3.bind(192.0.2.30, 0) s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99) Problem ------- We end up in a state where Process A can't reuse ephemeral port 54321 for as long as there are sockets, like s1, that keep the bind bucket alive. The bucket does not return to "reusable" state even when all sockets which blocked it from reuse, like s2, are gone. The ephemeral port becomes available for use again only after all sockets bound to it are gone and the bind bucket is destroyed. Programs which behave like Process B in this scenario - that is, binding to an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered poorly written. However, the reality is that such implementation is not actually uncommon. Trying to fix each and every such program is like playing whack-a-mole. For instance, it could be any software using Golang's net.Dialer with LocalAddr provided: dialer := &net.Dialer{ LocalAddr: &net.TCPAddr{IP: srcIP}, } conn, err := dialer.Dial("tcp4", dialTarget) Or even a ubiquitous tool like dig when using a specific local address: $ dig -b 127.1.1.1 +tcp +short example.com Hence, we are proposing a systematic fix in the network stack itself. Solution -------- Please see the description in patch 1. [1] https://lore.kernel.org/r/20250714-connect-port-search-harder-v3-0-b1a41f249865@cloudflare.com Reported-by: Lee Valentine <[email protected]> Signed-off-by: Jakub Sitnicki <[email protected]> ==================== Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-0-57168b661b47@cloudflare.com Signed-off-by: Paolo Abeni <[email protected]>
2 parents 3afb106 + 8a8241c commit 6e2f148

File tree

9 files changed

+322
-8
lines changed

9 files changed

+322
-8
lines changed

include/net/inet_connection_sock.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -316,8 +316,9 @@ int inet_csk_listen_start(struct sock *sk);
316316
void inet_csk_listen_stop(struct sock *sk);
317317

318318
/* update the fast reuse flag when adding a socket */
319-
void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
320-
struct sock *sk);
319+
void inet_csk_update_fastreuse(const struct sock *sk,
320+
struct inet_bind_bucket *tb,
321+
struct inet_bind2_bucket *tb2);
321322

322323
struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
323324

include/net/inet_hashtables.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@ struct inet_bind2_bucket {
108108
struct hlist_node bhash_node;
109109
/* List of sockets hashed to this bucket */
110110
struct hlist_head owners;
111+
signed char fastreuse;
112+
signed char fastreuseport;
111113
};
112114

113115
static inline struct net *ib_net(const struct inet_bind_bucket *ib)

include/net/inet_timewait_sock.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,8 @@ struct inet_timewait_sock {
7070
unsigned int tw_transparent : 1,
7171
tw_flowlabel : 20,
7272
tw_usec_ts : 1,
73-
tw_pad : 2, /* 2 bits hole */
73+
tw_connect_bind : 1,
74+
tw_pad : 1, /* 1 bit hole */
7475
tw_tos : 8;
7576
u32 tw_txhash;
7677
u32 tw_priority;

include/net/sock.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1494,6 +1494,10 @@ static inline int __sk_prot_rehash(struct sock *sk)
14941494

14951495
#define SOCK_BINDADDR_LOCK 4
14961496
#define SOCK_BINDPORT_LOCK 8
1497+
/**
1498+
* define SOCK_CONNECT_BIND - &sock->sk_userlocks flag for auto-bind at connect() time
1499+
*/
1500+
#define SOCK_CONNECT_BIND 16
14971501

14981502
struct socket_alloc {
14991503
struct socket socket;

net/ipv4/inet_connection_sock.c

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -423,7 +423,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
423423
}
424424

425425
static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
426-
struct sock *sk)
426+
const struct sock *sk)
427427
{
428428
if (tb->fastreuseport <= 0)
429429
return 0;
@@ -453,8 +453,9 @@ static inline int sk_reuseport_match(struct inet_bind_bucket *tb,
453453
ipv6_only_sock(sk), true, false);
454454
}
455455

456-
void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
457-
struct sock *sk)
456+
void inet_csk_update_fastreuse(const struct sock *sk,
457+
struct inet_bind_bucket *tb,
458+
struct inet_bind2_bucket *tb2)
458459
{
459460
bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
460461

@@ -501,6 +502,9 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
501502
tb->fastreuseport = 0;
502503
}
503504
}
505+
506+
tb2->fastreuse = tb->fastreuse;
507+
tb2->fastreuseport = tb->fastreuseport;
504508
}
505509

506510
/* Obtain a reference to a local port for the given sock,
@@ -582,7 +586,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
582586
}
583587

584588
success:
585-
inet_csk_update_fastreuse(tb, sk);
589+
inet_csk_update_fastreuse(sk, tb, tb2);
586590

587591
if (!inet_csk(sk)->icsk_bind_hash)
588592
inet_bind_hash(sk, tb, tb2, port);

net/ipv4/inet_hashtables.c

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,14 @@ static u32 sk_ehashfn(const struct sock *sk)
5858
sk->sk_daddr, sk->sk_dport);
5959
}
6060

61+
static bool sk_is_connect_bind(const struct sock *sk)
62+
{
63+
if (sk->sk_state == TCP_TIME_WAIT)
64+
return inet_twsk(sk)->tw_connect_bind;
65+
else
66+
return sk->sk_userlocks & SOCK_CONNECT_BIND;
67+
}
68+
6169
/*
6270
* Allocate and initialize a new local port bind bucket.
6371
* The bindhash mutex for snum's hash chain must be held here.
@@ -87,10 +95,22 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
8795
*/
8896
void inet_bind_bucket_destroy(struct inet_bind_bucket *tb)
8997
{
98+
const struct inet_bind2_bucket *tb2;
99+
90100
if (hlist_empty(&tb->bhash2)) {
91101
hlist_del_rcu(&tb->node);
92102
kfree_rcu(tb, rcu);
103+
return;
104+
}
105+
106+
if (tb->fastreuse == -1 && tb->fastreuseport == -1)
107+
return;
108+
hlist_for_each_entry(tb2, &tb->bhash2, bhash_node) {
109+
if (tb2->fastreuse != -1 || tb2->fastreuseport != -1)
110+
return;
93111
}
112+
tb->fastreuse = -1;
113+
tb->fastreuseport = -1;
94114
}
95115

96116
bool inet_bind_bucket_match(const struct inet_bind_bucket *tb, const struct net *net,
@@ -121,6 +141,8 @@ static void inet_bind2_bucket_init(struct inet_bind2_bucket *tb2,
121141
#else
122142
tb2->rcv_saddr = sk->sk_rcv_saddr;
123143
#endif
144+
tb2->fastreuse = 0;
145+
tb2->fastreuseport = 0;
124146
INIT_HLIST_HEAD(&tb2->owners);
125147
hlist_add_head(&tb2->node, &head->chain);
126148
hlist_add_head(&tb2->bhash_node, &tb->bhash2);
@@ -143,11 +165,23 @@ struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
143165
/* Caller must hold hashbucket lock for this tb with local BH disabled */
144166
void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
145167
{
168+
const struct sock *sk;
169+
146170
if (hlist_empty(&tb->owners)) {
147171
__hlist_del(&tb->node);
148172
__hlist_del(&tb->bhash_node);
149173
kmem_cache_free(cachep, tb);
174+
return;
150175
}
176+
177+
if (tb->fastreuse == -1 && tb->fastreuseport == -1)
178+
return;
179+
sk_for_each_bound(sk, &tb->owners) {
180+
if (!sk_is_connect_bind(sk))
181+
return;
182+
}
183+
tb->fastreuse = -1;
184+
tb->fastreuseport = -1;
151185
}
152186

153187
static bool inet_bind2_bucket_addr_match(const struct inet_bind2_bucket *tb2,
@@ -191,6 +225,7 @@ static void __inet_put_port(struct sock *sk)
191225
tb = inet_csk(sk)->icsk_bind_hash;
192226
inet_csk(sk)->icsk_bind_hash = NULL;
193227
inet_sk(sk)->inet_num = 0;
228+
sk->sk_userlocks &= ~SOCK_CONNECT_BIND;
194229

195230
spin_lock(&head2->lock);
196231
if (inet_csk(sk)->icsk_bind2_hash) {
@@ -277,7 +312,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
277312
}
278313
}
279314
if (update_fastreuse)
280-
inet_csk_update_fastreuse(tb, child);
315+
inet_csk_update_fastreuse(child, tb, tb2);
281316
inet_bind_hash(child, tb, tb2, port);
282317
spin_unlock(&head2->lock);
283318
spin_unlock(&head->lock);
@@ -950,6 +985,10 @@ static int __inet_bhash2_update_saddr(struct sock *sk, void *saddr, int family,
950985
if (!tb2) {
951986
tb2 = new_tb2;
952987
inet_bind2_bucket_init(tb2, net, head2, inet_csk(sk)->icsk_bind_hash, sk);
988+
if (sk_is_connect_bind(sk)) {
989+
tb2->fastreuse = -1;
990+
tb2->fastreuseport = -1;
991+
}
953992
}
954993
inet_csk(sk)->icsk_bind2_hash = tb2;
955994
sk_add_bind_node(sk, &tb2->owners);
@@ -1120,6 +1159,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
11201159
head2, tb, sk);
11211160
if (!tb2)
11221161
goto error;
1162+
tb2->fastreuse = -1;
1163+
tb2->fastreuseport = -1;
11231164
}
11241165

11251166
/* Here we want to add a little bit of randomness to the next source
@@ -1132,6 +1173,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
11321173

11331174
/* Head lock still held and bh's disabled */
11341175
inet_bind_hash(sk, tb, tb2, port);
1176+
sk->sk_userlocks |= SOCK_CONNECT_BIND;
11351177

11361178
if (sk_unhashed(sk)) {
11371179
inet_sk(sk)->inet_sport = htons(port);

net/ipv4/inet_timewait_sock.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
208208
tw->tw_hash = sk->sk_hash;
209209
tw->tw_ipv6only = 0;
210210
tw->tw_transparent = inet_test_bit(TRANSPARENT, sk);
211+
tw->tw_connect_bind = !!(sk->sk_userlocks & SOCK_CONNECT_BIND);
211212
tw->tw_prot = sk->sk_prot_creator;
212213
atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
213214
twsk_net_set(tw, sock_net(sk));

tools/testing/selftests/net/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ TEST_PROGS += broadcast_pmtu.sh
122122
TEST_PROGS += ipv6_force_forwarding.sh
123123
TEST_GEN_PROGS += ipv6_fragmentation
124124
TEST_PROGS += route_hint.sh
125+
TEST_GEN_PROGS += tcp_port_share
125126

126127
# YNL files, must be before "include ..lib.mk"
127128
YNL_GEN_FILES := busy_poller

0 commit comments

Comments
 (0)