Commit 22f578f
6.19 Revert "net/sched: Fix mirred deadlock on device recursion" & net: dev_queue_xmit() llist adoption & net: add a fast path in __netif_schedule()
This reverts commits 0f022d3
and 44180fe.
Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dev_queue_xmit() llist adoption
Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.
Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.
After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.
Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.
Tested:
- Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
Before:
16 Mpps (41 Mpps if each thread is pinned to a different cpu)
vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
@__dev_queue_xmit_ns:
[256, 512) 21 | |
[512, 1K) 631 | |
[1K, 2K) 27328 |@ |
[2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
[4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32K, 64K) 19055 |@ |
[64K, 128K) 17240 |@ |
[128K, 256K) 25633 |@ |
[256K, 512K) 4 | |
After:
29 Mpps (57 Mpps if each thread is pinned to a different cpu)
vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
@__dev_queue_xmit_ns:
[128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 483936 |@@@@@@@@@@ |
[1K, 2K) 265345 |@@@@@@ |
[2K, 4K) 145463 |@@@ |
[4K, 8K) 54571 |@ |
[8K, 16K) 10270 | |
[16K, 32K) 9385 | |
[32K, 64K) 7749 | |
[64K, 128K) 26799 | |
[128K, 256K) 2665 | |
[256K, 512K) 665 | |
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: add a fast path in __netif_schedule()
Cpus serving NIC interrupts and specifically TX completions are often
trapped in also restarting a busy qdisc (because qdisc was stopped by BQL
or the driver's own flow control).
When they call netdev_tx_completed_queue() or netif_tx_wake_queue(),
they call __netif_schedule() so that the queue can be run
later from net_tx_action() (involving NET_TX_SOFTIRQ)
Quite often, by the time the cpu reaches net_tx_action(), another cpu
grabbed the qdisc spinlock from __dev_xmit_skb(), and we spend too much
time spinning on this lock.
We can detect in __netif_schedule() if a cpu is already at a specific
point in __dev_xmit_skb() where we have the guarantee the queue will
be run.
This patch gives a 13 % increase of throughput on an IDPF NIC (200Gbit),
32 TX qeues, sending UDP packets of 120 bytes.
This also helps __qdisc_run() to not force a NET_TX_SOFTIRQ
if another thread is waiting in __dev_xmit_skb()
Before:
sar -n DEV 5 5|grep eth1|grep Average
Average: eth1 1496.44 52191462.56 210.00 13369396.90 0.00 0.00 0.00 54.76
After:
sar -n DEV 5 5|grep eth1|grep Average
Average: eth1 1457.88 59363099.96 205.08 15206384.35 0.00 0.00 0.00 62.29
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251017145334.3016097-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: allow busy connected flows to switch tx queues
This is a followup of commit 726e9e8 ("tcp: refine
skb->ooo_okay setting") and of prior commit in this series
("net: control skb->ooo_okay from skb_set_owner_w()")
skb->ooo_okay might never be set for bulk flows that always
have at least one skb in a qdisc queue of NIC queue,
especially if TX completion is delayed because of a stressed cpu.
The so-called "strange attractors" has caused many performance
issues (see for instance 9b462d0 ("tcp: TCP Small Queues
and strange attractors")), we need to do better.
We have tried very hard to avoid reorders because TCP was
not dealing with them nicely a decade ago.
Use the new net.core.txq_reselection_ms sysctl to let
flows follow XPS and select a more efficient queue.
After this patch, we no longer have to make sure threads
are pinned to cpus, they now can be migrated without
adding too much spinlock/qdisc/TX completion pressure anymore.
TX completion part was problematic, because it added false sharing
on various socket fields, but also added false sharing and spinlock
contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
is not an option unfortunately.
Note for later:
1) move sk->sk_tx_queue_mapping closer
to sk_tx_queue_mapping_jiffies for better cache locality.
2) Study if 9b462d0 ("tcp: TCP Small Queues
and strange attractors") could be revised.
Tested:
Used a host with 32 TX queues, shared by groups of 8 cores.
XPS setup :
echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
...
Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15
taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host
Checked that only queues 0 and 1 are used as instructed by XPS :
tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
backlog 123489410b 1890p
backlog 69809026b 1064p
backlog 52401054b 805p
Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121
C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done
Set txq_reselection_ms to 1000
echo 1000 > /proc/sys/net/core/txq_reselection_ms
Check that the flows have migrated nicely:
tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
backlog 130508314b 1916p
backlog 8584380b 126p
backlog 8584380b 126p
backlog 8379990b 123p
backlog 8584380b 126p
backlog 8487484b 125p
backlog 8584380b 126p
backlog 8448120b 124p
backlog 8584380b 126p
backlog 8720640b 128p
backlog 8856900b 130p
backlog 8584380b 126p
backlog 8652510b 127p
backlog 8448120b 124p
backlog 8516250b 125p
backlog 7834950b 115p
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>1 parent a142323 commit 22f578f
3 files changed
+93
-49
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
123 | 123 | | |
124 | 124 | | |
125 | 125 | | |
126 | | - | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
127 | 129 | | |
128 | 130 | | |
129 | 131 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3373 | 3373 | | |
3374 | 3374 | | |
3375 | 3375 | | |
| 3376 | + | |
| 3377 | + | |
| 3378 | + | |
| 3379 | + | |
| 3380 | + | |
| 3381 | + | |
| 3382 | + | |
3376 | 3383 | | |
3377 | 3384 | | |
3378 | 3385 | | |
| |||
4125 | 4132 | | |
4126 | 4133 | | |
4127 | 4134 | | |
| 4135 | + | |
4128 | 4136 | | |
4129 | | - | |
4130 | | - | |
| 4137 | + | |
| 4138 | + | |
4131 | 4139 | | |
4132 | 4140 | | |
4133 | 4141 | | |
| |||
4167 | 4175 | | |
4168 | 4176 | | |
4169 | 4177 | | |
4170 | | - | |
4171 | | - | |
4172 | | - | |
4173 | | - | |
4174 | | - | |
4175 | | - | |
4176 | | - | |
4177 | | - | |
4178 | | - | |
4179 | | - | |
4180 | | - | |
4181 | | - | |
4182 | | - | |
| 4178 | + | |
| 4179 | + | |
| 4180 | + | |
| 4181 | + | |
| 4182 | + | |
4183 | 4183 | | |
4184 | | - | |
4185 | | - | |
4186 | | - | |
| 4184 | + | |
| 4185 | + | |
| 4186 | + | |
| 4187 | + | |
| 4188 | + | |
| 4189 | + | |
| 4190 | + | |
| 4191 | + | |
| 4192 | + | |
| 4193 | + | |
| 4194 | + | |
| 4195 | + | |
| 4196 | + | |
| 4197 | + | |
| 4198 | + | |
| 4199 | + | |
| 4200 | + | |
4187 | 4201 | | |
4188 | 4202 | | |
| 4203 | + | |
| 4204 | + | |
| 4205 | + | |
| 4206 | + | |
| 4207 | + | |
| 4208 | + | |
| 4209 | + | |
| 4210 | + | |
| 4211 | + | |
| 4212 | + | |
4189 | 4213 | | |
4190 | | - | |
| 4214 | + | |
| 4215 | + | |
4191 | 4216 | | |
4192 | | - | |
4193 | | - | |
| 4217 | + | |
| 4218 | + | |
| 4219 | + | |
| 4220 | + | |
4194 | 4221 | | |
4195 | 4222 | | |
4196 | 4223 | | |
4197 | 4224 | | |
4198 | 4225 | | |
4199 | 4226 | | |
| 4227 | + | |
| 4228 | + | |
| 4229 | + | |
4200 | 4230 | | |
4201 | | - | |
4202 | | - | |
4203 | | - | |
4204 | | - | |
4205 | | - | |
4206 | | - | |
| 4231 | + | |
4207 | 4232 | | |
4208 | | - | |
4209 | | - | |
4210 | 4233 | | |
4211 | 4234 | | |
4212 | 4235 | | |
4213 | | - | |
4214 | | - | |
4215 | | - | |
4216 | | - | |
4217 | | - | |
4218 | | - | |
4219 | | - | |
4220 | | - | |
4221 | | - | |
4222 | | - | |
| 4236 | + | |
| 4237 | + | |
| 4238 | + | |
| 4239 | + | |
| 4240 | + | |
| 4241 | + | |
| 4242 | + | |
4223 | 4243 | | |
| 4244 | + | |
| 4245 | + | |
| 4246 | + | |
4224 | 4247 | | |
| 4248 | + | |
4225 | 4249 | | |
4226 | 4250 | | |
4227 | 4251 | | |
4228 | 4252 | | |
4229 | | - | |
4230 | | - | |
4231 | 4253 | | |
4232 | 4254 | | |
4233 | 4255 | | |
| |||
4591 | 4613 | | |
4592 | 4614 | | |
4593 | 4615 | | |
| 4616 | + | |
| 4617 | + | |
| 4618 | + | |
| 4619 | + | |
| 4620 | + | |
| 4621 | + | |
| 4622 | + | |
| 4623 | + | |
| 4624 | + | |
| 4625 | + | |
| 4626 | + | |
| 4627 | + | |
| 4628 | + | |
| 4629 | + | |
| 4630 | + | |
| 4631 | + | |
| 4632 | + | |
| 4633 | + | |
| 4634 | + | |
| 4635 | + | |
| 4636 | + | |
| 4637 | + | |
| 4638 | + | |
| 4639 | + | |
| 4640 | + | |
| 4641 | + | |
4594 | 4642 | | |
4595 | 4643 | | |
4596 | 4644 | | |
| |||
4606 | 4654 | | |
4607 | 4655 | | |
4608 | 4656 | | |
4609 | | - | |
4610 | | - | |
| 4657 | + | |
4611 | 4658 | | |
4612 | 4659 | | |
4613 | 4660 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
669 | 669 | | |
670 | 670 | | |
671 | 671 | | |
672 | | - | |
673 | 672 | | |
674 | 673 | | |
675 | 674 | | |
| |||
974 | 973 | | |
975 | 974 | | |
976 | 975 | | |
977 | | - | |
978 | | - | |
979 | | - | |
980 | | - | |
981 | 976 | | |
982 | 977 | | |
983 | 978 | | |
| |||
0 commit comments