Skip to content

Commit 4f2bfd9

Browse files
Neeraj Upadhyaypaulmckrcu
authored andcommitted
srcu: Make expedited RCU grace periods block even less frequently
The purpose of commit 282d899 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was to prevent a long series of never-blocking expedited SRCU grace periods from blocking kernel-live-patching (KLP) progress. Although it was successful, it also resulted in excessive boot times on certain embedded workloads running under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive" means increasing the boot time up into the three-to-four minute range. This increase in boot time was due to the more than 6000 back-to-back invocations of synchronize_rcu_expedited() within the KVM host OS, which in turn resulted from qemu's emulation of a long series of MMIO accesses. Commit 640a7d3 ("srcu: Block less aggressively for expedited grace periods") did not significantly help this particular use case. Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values of non-sleeping per phase counts on a system with preemption enabled, and observed the following boot times: +──────────────────────────+────────────────+ | SRCU_MAX_NODELAY_PHASE | Boot time (s) | +──────────────────────────+────────────────+ | 100 | 30.053 | | 150 | 25.151 | | 200 | 20.704 | | 250 | 15.748 | | 500 | 11.401 | | 1000 | 11.443 | | 10000 | 11.258 | | 1000000 | 11.154 | +──────────────────────────+────────────────+ Analysis on the experiment results show additional improvements with CPU-bound delays approaching one jiffy in duration. This improvement was also seen when number of per-phase iterations were scaled to one jiffy. This commit therefore scales per-grace-period phase number of non-sleeping polls so that non-sleeping polls extend for about one jiffy. In addition, the delay-calculation call to srcu_get_delay() in srcu_gp_end() is replaced with a simple check for an expedited grace period. This change schedules callback invocation immediately after expedited grace periods complete, which results in greatly improved boot times. Testing done by Marc and Zhangfei confirms that this change recovers most of the performance degradation in boottime; for CONFIG_HZ_250 configuration, specifically, boot times improve from 3m50s to 41s on Marc's setup; and from 2m40s to ~9.7s on Zhangfei's setup. In addition to the changes to default per phase delays, this change adds 3 new kernel parameters - srcutree.srcu_max_nodelay, srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay. This allows users to configure the srcu grace period scanning delays in order to more quickly react to additional use cases. Fixes: 640a7d3 ("srcu: Block less aggressively for expedited grace periods") Fixes: 282d899 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Reported-by: Zhangfei Gao <[email protected]> Reported-by: yueluck <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]> Tested-by: Marc Zyngier <[email protected]> Tested-by: Zhangfei Gao <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Paul E. McKenney <[email protected]>
1 parent 8f870e6 commit 4f2bfd9

File tree

2 files changed

+81
-19
lines changed

2 files changed

+81
-19
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5771,6 +5771,24 @@
57715771
expediting. Set to zero to disable automatic
57725772
expediting.
57735773

5774+
srcutree.srcu_max_nodelay [KNL]
5775+
Specifies the number of no-delay instances
5776+
per jiffy for which the SRCU grace period
5777+
worker thread will be rescheduled with zero
5778+
delay. Beyond this limit, worker thread will
5779+
be rescheduled with a sleep delay of one jiffy.
5780+
5781+
srcutree.srcu_max_nodelay_phase [KNL]
5782+
Specifies the per-grace-period phase, number of
5783+
non-sleeping polls of readers. Beyond this limit,
5784+
grace period worker thread will be rescheduled
5785+
with a sleep delay of one jiffy, between each
5786+
rescan of the readers, for a grace period phase.
5787+
5788+
srcutree.srcu_retry_check_delay [KNL]
5789+
Specifies number of microseconds of non-sleeping
5790+
delay between each non-sleeping poll of readers.
5791+
57745792
srcutree.small_contention_lim [KNL]
57755793
Specifies the number of update-side contention
57765794
events per jiffy will be tolerated before

kernel/rcu/srcutree.c

Lines changed: 63 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -511,10 +511,52 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
511511
return sum;
512512
}
513513

514-
#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
515-
#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
516-
#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
517-
#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
514+
/*
515+
* We use an adaptive strategy for synchronize_srcu() and especially for
516+
* synchronize_srcu_expedited(). We spin for a fixed time period
517+
* (defined below, boot time configurable) to allow SRCU readers to exit
518+
* their read-side critical sections. If there are still some readers
519+
* after one jiffy, we repeatedly block for one jiffy time periods.
520+
* The blocking time is increased as the grace-period age increases,
521+
* with max blocking time capped at 10 jiffies.
522+
*/
523+
#define SRCU_DEFAULT_RETRY_CHECK_DELAY 5
524+
525+
static ulong srcu_retry_check_delay = SRCU_DEFAULT_RETRY_CHECK_DELAY;
526+
module_param(srcu_retry_check_delay, ulong, 0444);
527+
528+
#define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
529+
#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
530+
531+
#define SRCU_DEFAULT_MAX_NODELAY_PHASE_LO 3UL // Lowmark on default per-GP-phase
532+
// no-delay instances.
533+
#define SRCU_DEFAULT_MAX_NODELAY_PHASE_HI 1000UL // Highmark on default per-GP-phase
534+
// no-delay instances.
535+
536+
#define SRCU_UL_CLAMP_LO(val, low) ((val) > (low) ? (val) : (low))
537+
#define SRCU_UL_CLAMP_HI(val, high) ((val) < (high) ? (val) : (high))
538+
#define SRCU_UL_CLAMP(val, low, high) SRCU_UL_CLAMP_HI(SRCU_UL_CLAMP_LO((val), (low)), (high))
539+
// per-GP-phase no-delay instances adjusted to allow non-sleeping poll upto
540+
// one jiffies time duration. Mult by 2 is done to factor in the srcu_get_delay()
541+
// called from process_srcu().
542+
#define SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED \
543+
(2UL * USEC_PER_SEC / HZ / SRCU_DEFAULT_RETRY_CHECK_DELAY)
544+
545+
// Maximum per-GP-phase consecutive no-delay instances.
546+
#define SRCU_DEFAULT_MAX_NODELAY_PHASE \
547+
SRCU_UL_CLAMP(SRCU_DEFAULT_MAX_NODELAY_PHASE_ADJUSTED, \
548+
SRCU_DEFAULT_MAX_NODELAY_PHASE_LO, \
549+
SRCU_DEFAULT_MAX_NODELAY_PHASE_HI)
550+
551+
static ulong srcu_max_nodelay_phase = SRCU_DEFAULT_MAX_NODELAY_PHASE;
552+
module_param(srcu_max_nodelay_phase, ulong, 0444);
553+
554+
// Maximum consecutive no-delay instances.
555+
#define SRCU_DEFAULT_MAX_NODELAY (SRCU_DEFAULT_MAX_NODELAY_PHASE > 100 ? \
556+
SRCU_DEFAULT_MAX_NODELAY_PHASE : 100)
557+
558+
static ulong srcu_max_nodelay = SRCU_DEFAULT_MAX_NODELAY;
559+
module_param(srcu_max_nodelay, ulong, 0444);
518560

519561
/*
520562
* Return grace-period delay, zero if there are expedited grace
@@ -535,7 +577,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp)
535577
jbase += j - gpstart;
536578
if (!jbase) {
537579
WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
538-
if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
580+
if (READ_ONCE(ssp->srcu_n_exp_nodelay) > srcu_max_nodelay_phase)
539581
jbase = 1;
540582
}
541583
}
@@ -612,15 +654,6 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
612654
}
613655
EXPORT_SYMBOL_GPL(__srcu_read_unlock);
614656

615-
/*
616-
* We use an adaptive strategy for synchronize_srcu() and especially for
617-
* synchronize_srcu_expedited(). We spin for a fixed time period
618-
* (defined below) to allow SRCU readers to exit their read-side critical
619-
* sections. If there are still some readers after a few microseconds,
620-
* we repeatedly block for 1-millisecond time periods.
621-
*/
622-
#define SRCU_RETRY_CHECK_DELAY 5
623-
624657
/*
625658
* Start an SRCU grace period.
626659
*/
@@ -706,7 +739,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp
706739
*/
707740
static void srcu_gp_end(struct srcu_struct *ssp)
708741
{
709-
unsigned long cbdelay;
742+
unsigned long cbdelay = 1;
710743
bool cbs;
711744
bool last_lvl;
712745
int cpu;
@@ -726,7 +759,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
726759
spin_lock_irq_rcu_node(ssp);
727760
idx = rcu_seq_state(ssp->srcu_gp_seq);
728761
WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
729-
cbdelay = !!srcu_get_delay(ssp);
762+
if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
763+
cbdelay = 0;
764+
730765
WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
731766
rcu_seq_end(&ssp->srcu_gp_seq);
732767
gpseq = rcu_seq_current(&ssp->srcu_gp_seq);
@@ -927,12 +962,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp,
927962
*/
928963
static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount)
929964
{
965+
unsigned long curdelay;
966+
967+
curdelay = !srcu_get_delay(ssp);
968+
930969
for (;;) {
931970
if (srcu_readers_active_idx_check(ssp, idx))
932971
return true;
933-
if (--trycount + !srcu_get_delay(ssp) <= 0)
972+
if ((--trycount + curdelay) <= 0)
934973
return false;
935-
udelay(SRCU_RETRY_CHECK_DELAY);
974+
udelay(srcu_retry_check_delay);
936975
}
937976
}
938977

@@ -1588,7 +1627,7 @@ static void process_srcu(struct work_struct *work)
15881627
j = jiffies;
15891628
if (READ_ONCE(ssp->reschedule_jiffies) == j) {
15901629
WRITE_ONCE(ssp->reschedule_count, READ_ONCE(ssp->reschedule_count) + 1);
1591-
if (READ_ONCE(ssp->reschedule_count) > SRCU_MAX_NODELAY)
1630+
if (READ_ONCE(ssp->reschedule_count) > srcu_max_nodelay)
15921631
curdelay = 1;
15931632
} else {
15941633
WRITE_ONCE(ssp->reschedule_count, 1);
@@ -1680,6 +1719,11 @@ static int __init srcu_bootup_announce(void)
16801719
pr_info("Hierarchical SRCU implementation.\n");
16811720
if (exp_holdoff != DEFAULT_SRCU_EXP_HOLDOFF)
16821721
pr_info("\tNon-default auto-expedite holdoff of %lu ns.\n", exp_holdoff);
1722+
if (srcu_retry_check_delay != SRCU_DEFAULT_RETRY_CHECK_DELAY)
1723+
pr_info("\tNon-default retry check delay of %lu us.\n", srcu_retry_check_delay);
1724+
if (srcu_max_nodelay != SRCU_DEFAULT_MAX_NODELAY)
1725+
pr_info("\tNon-default max no-delay of %lu.\n", srcu_max_nodelay);
1726+
pr_info("\tMax phase no-delay instances is %lu.\n", srcu_max_nodelay_phase);
16831727
return 0;
16841728
}
16851729
early_initcall(srcu_bootup_announce);

0 commit comments

Comments
 (0)