Skip to content

Commit 7dbf15c

Browse files
committed
workqueue: Add "Affinity Scopes and Performance" section to documentation
With affinity scopes and their strictness setting added, unbound workqueues should now be able to cover wide variety of configurations and use cases. Unfortunately, the performance picture is not entirely straight-forward due to a trade-off between efficiency and work-conservation in some situations necessitating manual configuration. This patch adds "Affinity Scopes and Performance" section to Documentation/core-api/workqueue.rst which illustrates the trade-off with a set of experiments and provides some guidelines. Signed-off-by: Tejun Heo <[email protected]>
1 parent 8639ece commit 7dbf15c

File tree

1 file changed

+179
-5
lines changed

1 file changed

+179
-5
lines changed

Documentation/core-api/workqueue.rst

Lines changed: 179 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
====================================
2-
Concurrency Managed Workqueue (cmwq)
3-
====================================
1+
=========
2+
Workqueue
3+
=========
44

55
:Date: September, 2010
66
:Author: Tejun Heo <[email protected]>
@@ -25,8 +25,8 @@ there is no work item left on the workqueue the worker becomes idle.
2525
When a new work item gets queued, the worker begins executing again.
2626

2727

28-
Why cmwq?
29-
=========
28+
Why Concurrency Managed Workqueue?
29+
==================================
3030

3131
In the original wq implementation, a multi threaded (MT) wq had one
3232
worker thread per CPU and a single threaded (ST) wq had one worker
@@ -408,6 +408,180 @@ directory.
408408
behavior of older kernels.
409409

410410

411+
Affinity Scopes and Performance
412+
===============================
413+
414+
It'd be ideal if an unbound workqueue's behavior is optimal for vast
415+
majority of use cases without further tuning. Unfortunately, in the current
416+
kernel, there exists a pronounced trade-off between locality and utilization
417+
necessitating explicit configurations when workqueues are heavily used.
418+
419+
Higher locality leads to higher efficiency where more work is performed for
420+
the same number of consumed CPU cycles. However, higher locality may also
421+
cause lower overall system utilization if the work items are not spread
422+
enough across the affinity scopes by the issuers. The following performance
423+
testing with dm-crypt clearly illustrates this trade-off.
424+
425+
The tests are run on a CPU with 12-cores/24-threads split across four L3
426+
caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency.
427+
``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and
428+
opened with ``cryptsetup`` with default settings.
429+
430+
431+
Scenario 1: Enough issuers and work spread across the machine
432+
-------------------------------------------------------------
433+
434+
The command used: ::
435+
436+
$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
437+
--iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
438+
--name=iops-test-job --verify=sha512
439+
440+
There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512``
441+
makes ``fio`` generate and read back the content each time which makes
442+
execution locality matter between the issuer and ``kcryptd``. The followings
443+
are the read bandwidths and CPU utilizations depending on different affinity
444+
scope settings on ``kcryptd`` measured over five runs. Bandwidths are in
445+
MiBps, and CPU util in percents.
446+
447+
.. list-table::
448+
:widths: 16 20 20
449+
:header-rows: 1
450+
451+
* - Affinity
452+
- Bandwidth (MiBps)
453+
- CPU util (%)
454+
455+
* - system
456+
- 1159.40 ±1.34
457+
- 99.31 ±0.02
458+
459+
* - cache
460+
- 1166.40 ±0.89
461+
- 99.34 ±0.01
462+
463+
* - cache (strict)
464+
- 1166.00 ±0.71
465+
- 99.35 ±0.01
466+
467+
With enough issuers spread across the system, there is no downside to
468+
"cache", strict or otherwise. All three configurations saturate the whole
469+
machine but the cache-affine ones outperform by 0.6% thanks to improved
470+
locality.
471+
472+
473+
Scenario 2: Fewer issuers, enough work for saturation
474+
-----------------------------------------------------
475+
476+
The command used: ::
477+
478+
$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
479+
--ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
480+
--time_based --group_reporting --name=iops-test-job --verify=sha512
481+
482+
The only difference from the previous scenario is ``--numjobs=8``. There are
483+
a third of the issuers but is still enough total work to saturate the
484+
system.
485+
486+
.. list-table::
487+
:widths: 16 20 20
488+
:header-rows: 1
489+
490+
* - Affinity
491+
- Bandwidth (MiBps)
492+
- CPU util (%)
493+
494+
* - system
495+
- 1155.40 ±0.89
496+
- 97.41 ±0.05
497+
498+
* - cache
499+
- 1154.40 ±1.14
500+
- 96.15 ±0.09
501+
502+
* - cache (strict)
503+
- 1112.00 ±4.64
504+
- 93.26 ±0.35
505+
506+
This is more than enough work to saturate the system. Both "system" and
507+
"cache" are nearly saturating the machine but not fully. "cache" is using
508+
less CPU but the better efficiency puts it at the same bandwidth as
509+
"system".
510+
511+
Eight issuers moving around over four L3 cache scope still allow "cache
512+
(strict)" to mostly saturate the machine but the loss of work conservation
513+
is now starting to hurt with 3.7% bandwidth loss.
514+
515+
516+
Scenario 3: Even fewer issuers, not enough work to saturate
517+
-----------------------------------------------------------
518+
519+
The command used: ::
520+
521+
$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
522+
--ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
523+
--time_based --group_reporting --name=iops-test-job --verify=sha512
524+
525+
Again, the only difference is ``--numjobs=4``. With the number of issuers
526+
reduced to four, there now isn't enough work to saturate the whole system
527+
and the bandwidth becomes dependent on completion latencies.
528+
529+
.. list-table::
530+
:widths: 16 20 20
531+
:header-rows: 1
532+
533+
* - Affinity
534+
- Bandwidth (MiBps)
535+
- CPU util (%)
536+
537+
* - system
538+
- 993.60 ±1.82
539+
- 75.49 ±0.06
540+
541+
* - cache
542+
- 973.40 ±1.52
543+
- 74.90 ±0.07
544+
545+
* - cache (strict)
546+
- 828.20 ±4.49
547+
- 66.84 ±0.29
548+
549+
Now, the tradeoff between locality and utilization is clearer. "cache" shows
550+
2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
551+
552+
553+
Conclusion and Recommendations
554+
------------------------------
555+
556+
In the above experiments, the efficiency advantage of the "cache" affinity
557+
scope over "system" is, while consistent and noticeable, small. However, the
558+
impact is dependent on the distances between the scopes and may be more
559+
pronounced in processors with more complex topologies.
560+
561+
While the loss of work-conservation in certain scenarios hurts, it is a lot
562+
better than "cache (strict)" and maximizing workqueue utilization is
563+
unlikely to be the common case anyway. As such, "cache" is the default
564+
affinity scope for unbound pools.
565+
566+
* As there is no one option which is great for most cases, workqueue usages
567+
that may consume a significant amount of CPU are recommended to configure
568+
the workqueues using ``apply_workqueue_attrs()`` and/or enable
569+
``WQ_SYSFS``.
570+
571+
* An unbound workqueue with strict "cpu" affinity scope behaves the same as
572+
``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the
573+
latter and an unbound workqueue provides a lot more flexibility.
574+
575+
* Affinity scopes are introduced in Linux v6.5. To emulate the previous
576+
behavior, use strict "numa" affinity scope.
577+
578+
* The loss of work-conservation in non-strict affinity scopes is likely
579+
originating from the scheduler. There is no theoretical reason why the
580+
kernel wouldn't be able to do the right thing and maintain
581+
work-conservation in most cases. As such, it is possible that future
582+
scheduler improvements may make most of these tunables unnecessary.
583+
584+
411585
Examining Configuration
412586
=======================
413587

0 commit comments

Comments
 (0)