You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.
Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.
After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.
Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.
Tested:
- Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
Before:
16 Mpps (41 Mpps if each thread is pinned to a different cpu)
vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
@__dev_queue_xmit_ns:
[256, 512) 21 | |
[512, 1K) 631 | |
[1K, 2K) 27328 |@ |
[2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
[4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32K, 64K) 19055 |@ |
[64K, 128K) 17240 |@ |
[128K, 256K) 25633 |@ |
[256K, 512K) 4 | |
After:
29 Mpps (57 Mpps if each thread is pinned to a different cpu)
vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
perf lock record -C0-7 sleep 1; perf lock contention
contended total wait max wait avg wait type caller
2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
@__dev_queue_xmit_ns:
[128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 483936 |@@@@@@@@@@ |
[1K, 2K) 265345 |@@@@@@ |
[2K, 4K) 145463 |@@@ |
[4K, 8K) 54571 |@ |
[8K, 16K) 10270 | |
[16K, 32K) 9385 | |
[32K, 64K) 7749 | |
[64K, 128K) 26799 | |
[128K, 256K) 2665 | |
[256K, 512K) 665 | |
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Tested-by: Jamal Hadi Salim <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
0 commit comments