Skip to content

Commit 1e50306

Browse files
edumazetaosemp
authored andcommitted
6.19 Revert "net/sched: Fix mirred deadlock on device recursion" & net: dev_queue_xmit() llist adoption & net: add a fast path in __netif_schedule()
This reverts commits 0f022d3 and 44180fe. Prior patch in this series implemented loop detection in act_mirred, we can remove q->owner to save some cycles in the fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251014171907.3554413-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> net: dev_queue_xmit() llist adoption Remove busylock spinlock and use a lockless list (llist) to reduce spinlock contention to the minimum. Idea is that only one cpu might spin on the qdisc spinlock, while others simply add their skb in the llist. After this patch, we get a 300 % improvement on heavy TX workloads. - Sending twice the number of packets per second. - While consuming 50 % less cycles. Note that this also allows in the future to submit batches to various qdisc->enqueue() methods. Tested: - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads). - 100Gbit NIC, 30 TX queues with FQ packet scheduler. - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm) - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n" Before: 16 Mpps (41 Mpps if each thread is pinned to a different cpu) vmstat 2 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0 Lock contention (1 second sample taken on 8 cores) perf lock record -C0-7 sleep 1; perf lock contention contended total wait max wait avg wait type caller 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8 If netperf threads are pinned, spinlock stress is very high. perf lock record -C0-7 sleep 1; perf lock contention contended total wait max wait avg wait type caller 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b @__dev_queue_xmit_ns: [256, 512) 21 | | [512, 1K) 631 | | [1K, 2K) 27328 |@ | [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ | [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 19055 |@ | [64K, 128K) 17240 |@ | [128K, 256K) 25633 |@ | [256K, 512K) 4 | | After: 29 Mpps (57 Mpps if each thread is pinned to a different cpu) vmstat 2 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0 perf lock record -C0-7 sleep 1; perf lock contention contended total wait max wait avg wait type caller 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b If netperf threads are pinned (~54 Mpps) perf lock record -C0-7 sleep 1; perf lock contention 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0 @__dev_queue_xmit_ns: [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 483936 |@@@@@@@@@@ | [1K, 2K) 265345 |@@@@@@ | [2K, 4K) 145463 |@@@ | [4K, 8K) 54571 |@ | [8K, 16K) 10270 | | [16K, 32K) 9385 | | [32K, 64K) 7749 | | [64K, 128K) 26799 | | [128K, 256K) 2665 | | [256K, 512K) 665 | | Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251014171907.3554413-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> net: add a fast path in __netif_schedule() Cpus serving NIC interrupts and specifically TX completions are often trapped in also restarting a busy qdisc (because qdisc was stopped by BQL or the driver's own flow control). When they call netdev_tx_completed_queue() or netif_tx_wake_queue(), they call __netif_schedule() so that the queue can be run later from net_tx_action() (involving NET_TX_SOFTIRQ) Quite often, by the time the cpu reaches net_tx_action(), another cpu grabbed the qdisc spinlock from __dev_xmit_skb(), and we spend too much time spinning on this lock. We can detect in __netif_schedule() if a cpu is already at a specific point in __dev_xmit_skb() where we have the guarantee the queue will be run. This patch gives a 13 % increase of throughput on an IDPF NIC (200Gbit), 32 TX qeues, sending UDP packets of 120 bytes. This also helps __qdisc_run() to not force a NET_TX_SOFTIRQ if another thread is waiting in __dev_xmit_skb() Before: sar -n DEV 5 5|grep eth1|grep Average Average: eth1 1496.44 52191462.56 210.00 13369396.90 0.00 0.00 0.00 54.76 After: sar -n DEV 5 5|grep eth1|grep Average Average: eth1 1457.88 59363099.96 205.08 15206384.35 0.00 0.00 0.00 62.29 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251017145334.3016097-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> net: allow busy connected flows to switch tx queues This is a followup of commit 726e9e8 ("tcp: refine skb->ooo_okay setting") and of prior commit in this series ("net: control skb->ooo_okay from skb_set_owner_w()") skb->ooo_okay might never be set for bulk flows that always have at least one skb in a qdisc queue of NIC queue, especially if TX completion is delayed because of a stressed cpu. The so-called "strange attractors" has caused many performance issues (see for instance 9b462d0 ("tcp: TCP Small Queues and strange attractors")), we need to do better. We have tried very hard to avoid reorders because TCP was not dealing with them nicely a decade ago. Use the new net.core.txq_reselection_ms sysctl to let flows follow XPS and select a more efficient queue. After this patch, we no longer have to make sure threads are pinned to cpus, they now can be migrated without adding too much spinlock/qdisc/TX completion pressure anymore. TX completion part was problematic, because it added false sharing on various socket fields, but also added false sharing and spinlock contention in mm layers. Calling skb_orphan() from ndo_start_xmit() is not an option unfortunately. Note for later: 1) move sk->sk_tx_queue_mapping closer to sk_tx_queue_mapping_jiffies for better cache locality. 2) Study if 9b462d0 ("tcp: TCP Small Queues and strange attractors") could be revised. Tested: Used a host with 32 TX queues, shared by groups of 8 cores. XPS setup : echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus ... Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15 taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host Checked that only queues 0 and 1 are used as instructed by XPS : tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p" backlog 123489410b 1890p backlog 69809026b 1064p backlog 52401054b 805p Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121 C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done Set txq_reselection_ms to 1000 echo 1000 > /proc/sys/net/core/txq_reselection_ms Check that the flows have migrated nicely: tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p" backlog 130508314b 1916p backlog 8584380b 126p backlog 8584380b 126p backlog 8379990b 123p backlog 8584380b 126p backlog 8487484b 125p backlog 8584380b 126p backlog 8448120b 124p backlog 8584380b 126p backlog 8720640b 128p backlog 8856900b 130p backlog 8584380b 126p backlog 8652510b 127p backlog 8448120b 124p backlog 8516250b 125p backlog 7834950b 115p Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251013152234.842065-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1 parent e1a731a commit 1e50306

File tree

8 files changed

+131
-63
lines changed

8 files changed

+131
-63
lines changed

Documentation/admin-guide/sysctl/net.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,23 @@ to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
406406
If set to 1 (default), hash rethink is performed on listening socket.
407407
If set to 0, hash rethink is not performed.
408408

409+
txq_reselection_ms
410+
------------------
411+
412+
Controls how often (in ms) a busy connected flow can select another tx queue.
413+
414+
A resection is desirable when/if user thread has migrated and XPS
415+
would select a different queue. Same can occur without XPS
416+
if the flow hash has changed.
417+
418+
But switching txq can introduce reorders, especially if the
419+
old queue is under high pressure. Modern TCP stacks deal
420+
well with reorders if they happen not too often.
421+
422+
To disable this feature, set the value to 0.
423+
424+
Default : 1000
425+
409426
gro_normal_batch
410427
----------------
411428

include/net/netns/core.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ struct netns_core {
1313
struct ctl_table_header *sysctl_hdr;
1414

1515
int sysctl_somaxconn;
16+
int sysctl_txq_reselection;
1617
int sysctl_optmem_max;
1718
u8 sysctl_txrehash;
1819
u8 sysctl_tstamp_allow_data;

include/net/sch_generic.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,9 @@ struct Qdisc {
123123
struct Qdisc *next_sched;
124124
struct sk_buff_head skb_bad_txq;
125125

126-
spinlock_t busylock ____cacheline_aligned_in_smp;
126+
atomic_long_t defer_count ____cacheline_aligned_in_smp;
127+
struct llist_head defer_list;
128+
127129
spinlock_t seqlock;
128130

129131
struct rcu_head rcu;

include/net/sock.h

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -313,6 +313,7 @@ struct sk_filter;
313313
* @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
314314
* for timestamping
315315
* @sk_tskey: counter to disambiguate concurrent tstamp requests
316+
* @sk_tx_queue_mapping_jiffies: time in jiffies of last @sk_tx_queue_mapping refresh.
316317
* @sk_zckey: counter to order MSG_ZEROCOPY notifications
317318
* @sk_socket: Identd and reporting IO signals
318319
* @sk_user_data: RPC layer private data. Write-protected by @sk_callback_lock.
@@ -485,6 +486,7 @@ struct sock {
485486
unsigned long sk_pacing_rate; /* bytes per second */
486487
atomic_t sk_zckey;
487488
atomic_t sk_tskey;
489+
unsigned long sk_tx_queue_mapping_jiffies;
488490
__cacheline_group_end(sock_write_tx);
489491

490492
__cacheline_group_begin(sock_read_tx);
@@ -1992,7 +1994,15 @@ static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
19921994
/* Paired with READ_ONCE() in sk_tx_queue_get() and
19931995
* other WRITE_ONCE() because socket lock might be not held.
19941996
*/
1995-
WRITE_ONCE(sk->sk_tx_queue_mapping, tx_queue);
1997+
if (READ_ONCE(sk->sk_tx_queue_mapping) != tx_queue) {
1998+
WRITE_ONCE(sk->sk_tx_queue_mapping, tx_queue);
1999+
WRITE_ONCE(sk->sk_tx_queue_mapping_jiffies, jiffies);
2000+
return;
2001+
}
2002+
2003+
/* Refresh sk_tx_queue_mapping_jiffies if too old. */
2004+
if (time_is_before_jiffies(READ_ONCE(sk->sk_tx_queue_mapping_jiffies) + HZ))
2005+
WRITE_ONCE(sk->sk_tx_queue_mapping_jiffies, jiffies);
19962006
}
19972007

19982008
#define NO_QUEUE_MAPPING USHRT_MAX
@@ -2005,19 +2015,7 @@ static inline void sk_tx_queue_clear(struct sock *sk)
20052015
WRITE_ONCE(sk->sk_tx_queue_mapping, NO_QUEUE_MAPPING);
20062016
}
20072017

2008-
static inline int sk_tx_queue_get(const struct sock *sk)
2009-
{
2010-
if (sk) {
2011-
/* Paired with WRITE_ONCE() in sk_tx_queue_clear()
2012-
* and sk_tx_queue_set().
2013-
*/
2014-
int val = READ_ONCE(sk->sk_tx_queue_mapping);
2015-
2016-
if (val != NO_QUEUE_MAPPING)
2017-
return val;
2018-
}
2019-
return -1;
2020-
}
2018+
int sk_tx_queue_get(const struct sock *sk);
20212019

20222020
static inline void __sk_rx_queue_set(struct sock *sk,
20232021
const struct sk_buff *skb,

net/core/dev.c

Lines changed: 90 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -3373,6 +3373,13 @@ static void __netif_reschedule(struct Qdisc *q)
33733373

33743374
void __netif_schedule(struct Qdisc *q)
33753375
{
3376+
/* If q->defer_list is not empty, at least one thread is
3377+
* in __dev_xmit_skb() before llist_del_all(&q->defer_list).
3378+
* This thread will attempt to run the queue.
3379+
*/
3380+
if (!llist_empty(&q->defer_list))
3381+
return;
3382+
33763383
if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
33773384
__netif_reschedule(q);
33783385
}
@@ -4125,9 +4132,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
41254132
struct net_device *dev,
41264133
struct netdev_queue *txq)
41274134
{
4135+
struct sk_buff *next, *to_free = NULL;
41284136
spinlock_t *root_lock = qdisc_lock(q);
4129-
struct sk_buff *to_free = NULL;
4130-
bool contended;
4137+
struct llist_node *ll_list, *first_n;
4138+
unsigned long defer_count = 0;
41314139
int rc;
41324140

41334141
qdisc_calculate_pkt_len(skb, q);
@@ -4167,67 +4175,81 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
41674175
return rc;
41684176
}
41694177

4170-
if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
4171-
kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
4172-
return NET_XMIT_DROP;
4173-
}
4174-
/*
4175-
* Heuristic to force contended enqueues to serialize on a
4176-
* separate lock before trying to get qdisc main lock.
4177-
* This permits qdisc->running owner to get the lock more
4178-
* often and dequeue packets faster.
4179-
* On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
4180-
* and then other tasks will only enqueue packets. The packets will be
4181-
* sent after the qdisc owner is scheduled again. To prevent this
4182-
* scenario the task always serialize on the lock.
4178+
/* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
4179+
* In the try_cmpxchg() loop, we want to increment q->defer_count
4180+
* at most once to limit the number of skbs in defer_list.
4181+
* We perform the defer_count increment only if the list is not empty,
4182+
* because some arches have slow atomic_long_inc_return().
41834183
*/
4184-
contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
4185-
if (unlikely(contended))
4186-
spin_lock(&q->busylock);
4184+
first_n = READ_ONCE(q->defer_list.first);
4185+
do {
4186+
if (first_n && !defer_count) {
4187+
defer_count = atomic_long_inc_return(&q->defer_count);
4188+
if (unlikely(defer_count > q->limit)) {
4189+
kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
4190+
return NET_XMIT_DROP;
4191+
}
4192+
}
4193+
skb->ll_node.next = first_n;
4194+
} while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
4195+
4196+
/* If defer_list was not empty, we know the cpu which queued
4197+
* the first skb will process the whole list for us.
4198+
*/
4199+
if (first_n)
4200+
return NET_XMIT_SUCCESS;
41874201

41884202
spin_lock(root_lock);
4203+
4204+
ll_list = llist_del_all(&q->defer_list);
4205+
/* There is a small race because we clear defer_count not atomically
4206+
* with the prior llist_del_all(). This means defer_list could grow
4207+
* over q->limit.
4208+
*/
4209+
atomic_long_set(&q->defer_count, 0);
4210+
4211+
ll_list = llist_reverse_order(ll_list);
4212+
41894213
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
4190-
__qdisc_drop(skb, &to_free);
4214+
llist_for_each_entry_safe(skb, next, ll_list, ll_node)
4215+
__qdisc_drop(skb, &to_free);
41914216
rc = NET_XMIT_DROP;
4192-
} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
4193-
qdisc_run_begin(q)) {
4217+
goto unlock;
4218+
}
4219+
if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
4220+
!llist_next(ll_list) && qdisc_run_begin(q)) {
41944221
/*
41954222
* This is a work-conserving queue; there are no old skbs
41964223
* waiting to be sent out; and the qdisc is not running -
41974224
* xmit the skb directly.
41984225
*/
41994226

4227+
DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
4228+
struct sk_buff,
4229+
ll_node));
42004230
qdisc_bstats_update(q, skb);
4201-
4202-
if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
4203-
if (unlikely(contended)) {
4204-
spin_unlock(&q->busylock);
4205-
contended = false;
4206-
}
4231+
if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
42074232
__qdisc_run(q);
4208-
}
4209-
42104233
qdisc_run_end(q);
42114234
rc = NET_XMIT_SUCCESS;
42124235
} else {
4213-
WRITE_ONCE(q->owner, smp_processor_id());
4214-
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
4215-
WRITE_ONCE(q->owner, -1);
4216-
if (qdisc_run_begin(q)) {
4217-
if (unlikely(contended)) {
4218-
spin_unlock(&q->busylock);
4219-
contended = false;
4220-
}
4221-
__qdisc_run(q);
4222-
qdisc_run_end(q);
4236+
int count = 0;
4237+
4238+
llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
4239+
prefetch(next);
4240+
skb_mark_not_on_list(skb);
4241+
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
4242+
count++;
42234243
}
4244+
qdisc_run(q);
4245+
if (count != 1)
4246+
rc = NET_XMIT_SUCCESS;
42244247
}
4248+
unlock:
42254249
spin_unlock(root_lock);
42264250
if (unlikely(to_free))
42274251
kfree_skb_list_reason(to_free,
42284252
tcf_get_drop_reason(to_free));
4229-
if (unlikely(contended))
4230-
spin_unlock(&q->busylock);
42314253
return rc;
42324254
}
42334255

@@ -4591,6 +4613,32 @@ u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
45914613
}
45924614
EXPORT_SYMBOL(dev_pick_tx_zero);
45934615

4616+
int sk_tx_queue_get(const struct sock *sk)
4617+
{
4618+
int resel, val;
4619+
4620+
if (!sk)
4621+
return -1;
4622+
/* Paired with WRITE_ONCE() in sk_tx_queue_clear()
4623+
* and sk_tx_queue_set().
4624+
*/
4625+
val = READ_ONCE(sk->sk_tx_queue_mapping);
4626+
4627+
if (val == NO_QUEUE_MAPPING)
4628+
return -1;
4629+
4630+
if (!sk_fullsock(sk))
4631+
return val;
4632+
4633+
resel = READ_ONCE(sock_net(sk)->core.sysctl_txq_reselection);
4634+
if (resel && time_is_before_jiffies(
4635+
READ_ONCE(sk->sk_tx_queue_mapping_jiffies) + resel))
4636+
return -1;
4637+
4638+
return val;
4639+
}
4640+
EXPORT_SYMBOL(sk_tx_queue_get);
4641+
45944642
u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
45954643
struct net_device *sb_dev)
45964644
{
@@ -4606,8 +4654,7 @@ u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
46064654
if (new_index < 0)
46074655
new_index = skb_tx_hash(dev, sb_dev, skb);
46084656

4609-
if (queue_index != new_index && sk &&
4610-
sk_fullsock(sk) &&
4657+
if (sk && sk_fullsock(sk) &&
46114658
rcu_access_pointer(sk->sk_dst_cache))
46124659
sk_tx_queue_set(sk, new_index);
46134660

net/core/net_namespace.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -395,6 +395,7 @@ static __net_init void preinit_net_sysctl(struct net *net)
395395
net->core.sysctl_optmem_max = 128 * 1024;
396396
net->core.sysctl_txrehash = SOCK_TXREHASH_ENABLED;
397397
net->core.sysctl_tstamp_allow_data = 1;
398+
net->core.sysctl_txq_reselection = msecs_to_jiffies(1000);
398399
}
399400

400401
/* init code that must occur even if setup_net() is not called. */

net/core/sysctl_net_core.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -667,6 +667,13 @@ static struct ctl_table netns_core_table[] = {
667667
.extra2 = SYSCTL_ONE,
668668
.proc_handler = proc_dou8vec_minmax,
669669
},
670+
{
671+
.procname = "txq_reselection_ms",
672+
.data = &init_net.core.sysctl_txq_reselection,
673+
.maxlen = sizeof(int),
674+
.mode = 0644,
675+
.proc_handler = proc_dointvec_ms_jiffies,
676+
},
670677
{
671678
.procname = "tstamp_allow_data",
672679
.data = &init_net.core.sysctl_tstamp_allow_data,

net/sched/sch_generic.c

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -669,7 +669,6 @@ struct Qdisc noop_qdisc = {
669669
.ops = &noop_qdisc_ops,
670670
.q.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
671671
.dev_queue = &noop_netdev_queue,
672-
.busylock = __SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
673672
.gso_skb = {
674673
.next = (struct sk_buff *)&noop_qdisc.gso_skb,
675674
.prev = (struct sk_buff *)&noop_qdisc.gso_skb,
@@ -974,10 +973,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
974973
}
975974
}
976975

977-
spin_lock_init(&sch->busylock);
978-
lockdep_set_class(&sch->busylock,
979-
dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
980-
981976
/* seqlock has the same scope of busylock, for NOLOCK qdisc */
982977
spin_lock_init(&sch->seqlock);
983978
lockdep_set_class(&sch->seqlock,

0 commit comments

Comments
 (0)