1
- ====================================
2
- Concurrency Managed Workqueue (cmwq)
3
- ====================================
1
+ =========
2
+ Workqueue
3
+ =========
4
4
5
5
:Date: September, 2010
6
6
:Author: Tejun Heo <
[email protected] >
@@ -25,8 +25,8 @@ there is no work item left on the workqueue the worker becomes idle.
25
25
When a new work item gets queued, the worker begins executing again.
26
26
27
27
28
- Why cmwq ?
29
- =========
28
+ Why Concurrency Managed Workqueue ?
29
+ ==================================
30
30
31
31
In the original wq implementation, a multi threaded (MT) wq had one
32
32
worker thread per CPU and a single threaded (ST) wq had one worker
@@ -408,6 +408,180 @@ directory.
408
408
behavior of older kernels.
409
409
410
410
411
+ Affinity Scopes and Performance
412
+ ===============================
413
+
414
+ It'd be ideal if an unbound workqueue's behavior is optimal for vast
415
+ majority of use cases without further tuning. Unfortunately, in the current
416
+ kernel, there exists a pronounced trade-off between locality and utilization
417
+ necessitating explicit configurations when workqueues are heavily used.
418
+
419
+ Higher locality leads to higher efficiency where more work is performed for
420
+ the same number of consumed CPU cycles. However, higher locality may also
421
+ cause lower overall system utilization if the work items are not spread
422
+ enough across the affinity scopes by the issuers. The following performance
423
+ testing with dm-crypt clearly illustrates this trade-off.
424
+
425
+ The tests are run on a CPU with 12-cores/24-threads split across four L3
426
+ caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency.
427
+ ``/dev/dm-0 `` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and
428
+ opened with ``cryptsetup `` with default settings.
429
+
430
+
431
+ Scenario 1: Enough issuers and work spread across the machine
432
+ -------------------------------------------------------------
433
+
434
+ The command used: ::
435
+
436
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
437
+ --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
438
+ --name=iops-test-job --verify=sha512
439
+
440
+ There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512 ``
441
+ makes ``fio `` generate and read back the content each time which makes
442
+ execution locality matter between the issuer and ``kcryptd ``. The followings
443
+ are the read bandwidths and CPU utilizations depending on different affinity
444
+ scope settings on ``kcryptd `` measured over five runs. Bandwidths are in
445
+ MiBps, and CPU util in percents.
446
+
447
+ .. list-table ::
448
+ :widths: 16 20 20
449
+ :header-rows: 1
450
+
451
+ * - Affinity
452
+ - Bandwidth (MiBps)
453
+ - CPU util (%)
454
+
455
+ * - system
456
+ - 1159.40 ±1.34
457
+ - 99.31 ±0.02
458
+
459
+ * - cache
460
+ - 1166.40 ±0.89
461
+ - 99.34 ±0.01
462
+
463
+ * - cache (strict)
464
+ - 1166.00 ±0.71
465
+ - 99.35 ±0.01
466
+
467
+ With enough issuers spread across the system, there is no downside to
468
+ "cache", strict or otherwise. All three configurations saturate the whole
469
+ machine but the cache-affine ones outperform by 0.6% thanks to improved
470
+ locality.
471
+
472
+
473
+ Scenario 2: Fewer issuers, enough work for saturation
474
+ -----------------------------------------------------
475
+
476
+ The command used: ::
477
+
478
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
479
+ --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
480
+ --time_based --group_reporting --name=iops-test-job --verify=sha512
481
+
482
+ The only difference from the previous scenario is ``--numjobs=8 ``. There are
483
+ a third of the issuers but is still enough total work to saturate the
484
+ system.
485
+
486
+ .. list-table ::
487
+ :widths: 16 20 20
488
+ :header-rows: 1
489
+
490
+ * - Affinity
491
+ - Bandwidth (MiBps)
492
+ - CPU util (%)
493
+
494
+ * - system
495
+ - 1155.40 ±0.89
496
+ - 97.41 ±0.05
497
+
498
+ * - cache
499
+ - 1154.40 ±1.14
500
+ - 96.15 ±0.09
501
+
502
+ * - cache (strict)
503
+ - 1112.00 ±4.64
504
+ - 93.26 ±0.35
505
+
506
+ This is more than enough work to saturate the system. Both "system" and
507
+ "cache" are nearly saturating the machine but not fully. "cache" is using
508
+ less CPU but the better efficiency puts it at the same bandwidth as
509
+ "system".
510
+
511
+ Eight issuers moving around over four L3 cache scope still allow "cache
512
+ (strict)" to mostly saturate the machine but the loss of work conservation
513
+ is now starting to hurt with 3.7% bandwidth loss.
514
+
515
+
516
+ Scenario 3: Even fewer issuers, not enough work to saturate
517
+ -----------------------------------------------------------
518
+
519
+ The command used: ::
520
+
521
+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
522
+ --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
523
+ --time_based --group_reporting --name=iops-test-job --verify=sha512
524
+
525
+ Again, the only difference is ``--numjobs=4 ``. With the number of issuers
526
+ reduced to four, there now isn't enough work to saturate the whole system
527
+ and the bandwidth becomes dependent on completion latencies.
528
+
529
+ .. list-table ::
530
+ :widths: 16 20 20
531
+ :header-rows: 1
532
+
533
+ * - Affinity
534
+ - Bandwidth (MiBps)
535
+ - CPU util (%)
536
+
537
+ * - system
538
+ - 993.60 ±1.82
539
+ - 75.49 ±0.06
540
+
541
+ * - cache
542
+ - 973.40 ±1.52
543
+ - 74.90 ±0.07
544
+
545
+ * - cache (strict)
546
+ - 828.20 ±4.49
547
+ - 66.84 ±0.29
548
+
549
+ Now, the tradeoff between locality and utilization is clearer. "cache" shows
550
+ 2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
551
+
552
+
553
+ Conclusion and Recommendations
554
+ ------------------------------
555
+
556
+ In the above experiments, the efficiency advantage of the "cache" affinity
557
+ scope over "system" is, while consistent and noticeable, small. However, the
558
+ impact is dependent on the distances between the scopes and may be more
559
+ pronounced in processors with more complex topologies.
560
+
561
+ While the loss of work-conservation in certain scenarios hurts, it is a lot
562
+ better than "cache (strict)" and maximizing workqueue utilization is
563
+ unlikely to be the common case anyway. As such, "cache" is the default
564
+ affinity scope for unbound pools.
565
+
566
+ * As there is no one option which is great for most cases, workqueue usages
567
+ that may consume a significant amount of CPU are recommended to configure
568
+ the workqueues using ``apply_workqueue_attrs() `` and/or enable
569
+ ``WQ_SYSFS ``.
570
+
571
+ * An unbound workqueue with strict "cpu" affinity scope behaves the same as
572
+ ``WQ_CPU_INTENSIVE `` per-cpu workqueue. There is no real advanage to the
573
+ latter and an unbound workqueue provides a lot more flexibility.
574
+
575
+ * Affinity scopes are introduced in Linux v6.5. To emulate the previous
576
+ behavior, use strict "numa" affinity scope.
577
+
578
+ * The loss of work-conservation in non-strict affinity scopes is likely
579
+ originating from the scheduler. There is no theoretical reason why the
580
+ kernel wouldn't be able to do the right thing and maintain
581
+ work-conservation in most cases. As such, it is possible that future
582
+ scheduler improvements may make most of these tunables unnecessary.
583
+
584
+
411
585
Examining Configuration
412
586
=======================
413
587
0 commit comments