Skip to content

Commit 63ce50f

Browse files
committed
Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_OTHER) improvements: - Remove the old and now unused SIS_PROP code & option - Scan cluster before LLC in the wake-up path - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup NUMA scheduling improvements: - Improve the VMA access-PID code to better skip/scan VMAs - Extend tracing to cover VMA-skipping decisions - Improve/fix the recently introduced sched_numa_find_nth_cpu() code - Generalize numa_map_to_online_node() Energy scheduling improvements: - Remove the EM_MAX_COMPLEXITY limit - Add tracepoints to track energy computation - Make the behavior of the 'sched_energy_aware' sysctl more consistent - Consolidate and clean up access to a CPU's max compute capacity - Fix uclamp code corner cases RT scheduling improvements: - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates - Drive the ->rto_mask with rt_rq->pushable_tasks updates Scheduler scalability improvements: - Rate-limit updates to tg->load_avg - On x86 disable IBRS when CPU is offline to improve single-threaded performance - Micro-optimize in_task() and in_interrupt() - Micro-optimize the PSI code - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes Core scheduler infrastructure improvements: - Use saved_state to reduce some spurious freezer wakeups - Bring in a handful of fast-headers improvements to scheduler headers - Make the scheduler UAPI headers more widely usable by user-space - Simplify the control flow of scheduler syscalls by using lock guards - Fix sched_setaffinity() vs. CPU hotplug race Scheduler debuggability improvements: - Disallow writing invalid values to sched_rt_period_us - Fix a race in the rq-clock debugging code triggering warnings - Fix a warning in the bandwidth distribution code - Micro-optimize in_atomic_preempt_off() checks - Enforce that the tasklist_lock is held in for_each_thread() - Print the TGID in sched_show_task() - Remove the /proc/sys/kernel/sched_child_runs_first sysctl ... and misc cleanups & fixes" * tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) sched/fair: Remove SIS_PROP sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup sched/fair: Scan cluster before scanning LLC in wake-up path sched: Add cpus_share_resources API sched/core: Fix RQCF_ACT_SKIP leak sched/fair: Remove unused 'curr' argument from pick_next_entity() sched/nohz: Update comments about NEWILB_KICK sched/fair: Remove duplicate #include sched/psi: Update poll => rtpoll in relevant comments sched: Make PELT acronym definition searchable sched: Fix stop_one_cpu_nowait() vs hotplug sched/psi: Bail out early from irq time accounting sched/topology: Rename 'DIE' domain to 'PKG' sched/psi: Delete the 'update_total' function parameter from update_triggers() sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes sched/headers: Remove comment referring to rq::cpu_load, since this has been removed sched/numa: Complete scanning of inactive VMAs when there is no alternative sched/numa: Complete scanning of partial VMAs regardless of PID activity sched/numa: Move up the access pid reset logic sched/numa: Trace decisions related to skipping VMAs ...
2 parents 3cf3fab + 984ffb6 commit 63ce50f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+1013
-965
lines changed

Documentation/admin-guide/pm/intel_idle.rst

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ and ``idle=nomwait``. If any of them is present in the kernel command line, the
170170
``MWAIT`` instruction is not allowed to be used, so the initialization of
171171
``intel_idle`` will fail.
172172

173-
Apart from that there are four module parameters recognized by ``intel_idle``
173+
Apart from that there are five module parameters recognized by ``intel_idle``
174174
itself that can be set via the kernel command line (they cannot be updated via
175175
sysfs, so that is the only way to change their values).
176176

@@ -216,6 +216,21 @@ are ignored).
216216
The idle states disabled this way can be enabled (on a per-CPU basis) from user
217217
space via ``sysfs``.
218218

219+
The ``ibrs_off`` module parameter is a boolean flag (defaults to
220+
false). If set, it is used to control if IBRS (Indirect Branch Restricted
221+
Speculation) should be turned off when the CPU enters an idle state.
222+
This flag does not affect CPUs that use Enhanced IBRS which can remain
223+
on with little performance impact.
224+
225+
For some CPUs, IBRS will be selected as mitigation for Spectre v2 and Retbleed
226+
security vulnerabilities by default. Leaving the IBRS mode on while idling may
227+
have a performance impact on its sibling CPU. The IBRS mode will be turned off
228+
by default when the CPU enters into a deep idle state, but not in some
229+
shallower ones. Setting the ``ibrs_off`` module parameter will force the IBRS
230+
mode to off when the CPU is in any one of the available idle states. This may
231+
help performance of a sibling CPU at the expense of a slightly higher wakeup
232+
latency for the idle CPU.
233+
219234

220235
.. _intel-idle-core-and-package-idle-states:
221236

Documentation/admin-guide/sysctl/kernel.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1182,7 +1182,8 @@ automatically on platforms where it can run (that is,
11821182
platforms with asymmetric CPU topologies and having an Energy
11831183
Model available). If your platform happens to meet the
11841184
requirements for EAS but you do not want to use it, change
1185-
this value to 0.
1185+
this value to 0. On Non-EAS platforms, write operation fails and
1186+
read doesn't return anything.
11861187

11871188
task_delayacct
11881189
===============

Documentation/scheduler/sched-capacity.rst

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -39,14 +39,15 @@ per Hz, leading to::
3939
-------------------
4040

4141
Two different capacity values are used within the scheduler. A CPU's
42-
``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
43-
attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
44-
which some loss of available performance (e.g. time spent handling IRQs) is
45-
subtracted.
42+
``original capacity`` is its maximum attainable capacity, i.e. its maximum
43+
attainable performance level. This original capacity is returned by
44+
the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
45+
capacity`` to which some loss of available performance (e.g. time spent
46+
handling IRQs) is subtracted.
4647

4748
Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
48-
while ``capacity_orig`` is class-agnostic. The rest of this document will use
49-
the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
49+
while ``original capacity`` is class-agnostic. The rest of this document will use
50+
the term ``capacity`` interchangeably with ``original capacity`` for the sake of
5051
brevity.
5152

5253
1.3 Platform examples

Documentation/scheduler/sched-energy.rst

Lines changed: 3 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -359,32 +359,9 @@ in milli-Watts or in an 'abstract scale'.
359359
6.3 - Energy Model complexity
360360
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361361

362-
The task wake-up path is very latency-sensitive. When the EM of a platform is
363-
too complex (too many CPUs, too many performance domains, too many performance
364-
states, ...), the cost of using it in the wake-up path can become prohibitive.
365-
The energy-aware wake-up algorithm has a complexity of:
366-
367-
C = Nd * (Nc + Ns)
368-
369-
with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
370-
total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
371-
372-
A complexity check is performed at the root domain level, when scheduling
373-
domains are built. EAS will not start on a root domain if its C happens to be
374-
higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
375-
time of writing).
376-
377-
If you really want to use EAS but the complexity of your platform's Energy
378-
Model is too high to be used with a single root domain, you're left with only
379-
two possible options:
380-
381-
1. split your system into separate, smaller, root domains using exclusive
382-
cpusets and enable EAS locally on each of them. This option has the
383-
benefit to work out of the box but the drawback of preventing load
384-
balance between root domains, which can result in an unbalanced system
385-
overall;
386-
2. submit patches to reduce the complexity of the EAS wake-up algorithm,
387-
hence enabling it to cope with larger EMs in reasonable time.
362+
EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but
363+
restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during
364+
the energy estimation.
388365

389366

390367
6.4 - Schedutil governor

Documentation/scheduler/sched-rt-group.rst

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -39,25 +39,25 @@ Most notable:
3939
1.1 The problem
4040
---------------
4141

42-
Realtime scheduling is all about determinism, a group has to be able to rely on
42+
Real-time scheduling is all about determinism, a group has to be able to rely on
4343
the amount of bandwidth (eg. CPU time) being constant. In order to schedule
44-
multiple groups of realtime tasks, each group must be assigned a fixed portion
45-
of the CPU time available. Without a minimum guarantee a realtime group can
44+
multiple groups of real-time tasks, each group must be assigned a fixed portion
45+
of the CPU time available. Without a minimum guarantee a real-time group can
4646
obviously fall short. A fuzzy upper limit is of no use since it cannot be
4747
relied upon. Which leaves us with just the single fixed portion.
4848

4949
1.2 The solution
5050
----------------
5151

5252
CPU time is divided by means of specifying how much time can be spent running
53-
in a given period. We allocate this "run time" for each realtime group which
54-
the other realtime groups will not be permitted to use.
53+
in a given period. We allocate this "run time" for each real-time group which
54+
the other real-time groups will not be permitted to use.
5555

56-
Any time not allocated to a realtime group will be used to run normal priority
56+
Any time not allocated to a real-time group will be used to run normal priority
5757
tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
5858
SCHED_OTHER.
5959

60-
Let's consider an example: a frame fixed realtime renderer must deliver 25
60+
Let's consider an example: a frame fixed real-time renderer must deliver 25
6161
frames a second, which yields a period of 0.04s per frame. Now say it will also
6262
have to play some music and respond to input, leaving it with around 80% CPU
6363
time dedicated for the graphics. We can then give this group a run time of 0.8
@@ -70,7 +70,7 @@ needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
7070
of 0.00015s.
7171

7272
The remaining CPU time will be used for user input and other tasks. Because
73-
realtime tasks have explicitly allocated the CPU time they need to perform
73+
real-time tasks have explicitly allocated the CPU time they need to perform
7474
their tasks, buffer underruns in the graphics or audio can be eliminated.
7575

7676
NOTE: the above example is not fully implemented yet. We still
@@ -87,18 +87,20 @@ lack an EDF scheduler to make non-uniform periods usable.
8787
The system wide settings are configured under the /proc virtual file system:
8888

8989
/proc/sys/kernel/sched_rt_period_us:
90-
The scheduling period that is equivalent to 100% CPU bandwidth
90+
The scheduling period that is equivalent to 100% CPU bandwidth.
9191

9292
/proc/sys/kernel/sched_rt_runtime_us:
93-
A global limit on how much time realtime scheduling may use. Even without
94-
CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
95-
processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
96-
available to all realtime groups.
93+
A global limit on how much time real-time scheduling may use. This is always
94+
less or equal to the period_us, as it denotes the time allocated from the
95+
period_us for the real-time tasks. Even without CONFIG_RT_GROUP_SCHED enabled,
96+
this will limit time reserved to real-time processes. With
97+
CONFIG_RT_GROUP_SCHED=y it signifies the total bandwidth available to all
98+
real-time groups.
9799

98100
* Time is specified in us because the interface is s32. This gives an
99101
operating range from 1us to about 35 minutes.
100102
* sched_rt_period_us takes values from 1 to INT_MAX.
101-
* sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
103+
* sched_rt_runtime_us takes values from -1 to sched_rt_period_us.
102104
* A run time of -1 specifies runtime == period, ie. no limit.
103105

104106

@@ -108,18 +110,18 @@ The system wide settings are configured under the /proc virtual file system:
108110
The default values for sched_rt_period_us (1000000 or 1s) and
109111
sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
110112
SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
111-
realtime tasks will not lock up the machine but leave a little time to recover
113+
real-time tasks will not lock up the machine but leave a little time to recover
112114
it. By setting runtime to -1 you'd get the old behaviour back.
113115

114116
By default all bandwidth is assigned to the root group and new groups get the
115117
period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
116118
want to assign bandwidth to another group, reduce the root group's bandwidth
117119
and assign some or all of the difference to another group.
118120

119-
Realtime group scheduling means you have to assign a portion of total CPU
120-
bandwidth to the group before it will accept realtime tasks. Therefore you will
121-
not be able to run realtime tasks as any user other than root until you have
122-
done that, even if the user has the rights to run processes with realtime
121+
Real-time group scheduling means you have to assign a portion of total CPU
122+
bandwidth to the group before it will accept real-time tasks. Therefore you will
123+
not be able to run real-time tasks as any user other than root until you have
124+
done that, even if the user has the rights to run processes with real-time
123125
priority!
124126

125127

arch/powerpc/kernel/smp.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1051,7 +1051,7 @@ static struct sched_domain_topology_level powerpc_topology[] = {
10511051
#endif
10521052
{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
10531053
{ cpu_mc_mask, SD_INIT_NAME(MC) },
1054-
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
1054+
{ cpu_cpu_mask, SD_INIT_NAME(PKG) },
10551055
{ NULL, },
10561056
};
10571057

@@ -1595,7 +1595,7 @@ static void add_cpu_to_masks(int cpu)
15951595
/* Skip all CPUs already part of current CPU core mask */
15961596
cpumask_andnot(mask, cpu_online_mask, cpu_core_mask(cpu));
15971597

1598-
/* If chip_id is -1; limit the cpu_core_mask to within DIE*/
1598+
/* If chip_id is -1; limit the cpu_core_mask to within PKG */
15991599
if (chip_id == -1)
16001600
cpumask_and(mask, mask, cpu_cpu_mask(cpu));
16011601

arch/s390/kernel/topology.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -522,7 +522,7 @@ static struct sched_domain_topology_level s390_topology[] = {
522522
{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
523523
{ cpu_book_mask, SD_INIT_NAME(BOOK) },
524524
{ cpu_drawer_mask, SD_INIT_NAME(DRAWER) },
525-
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
525+
{ cpu_cpu_mask, SD_INIT_NAME(PKG) },
526526
{ NULL, },
527527
};
528528

arch/x86/include/asm/spec-ctrl.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
#include <linux/thread_info.h>
66
#include <asm/nospec-branch.h>
7+
#include <asm/msr.h>
78

89
/*
910
* On VMENTER we must preserve whatever view of the SPEC_CTRL MSR
@@ -76,6 +77,16 @@ static inline u64 ssbd_tif_to_amd_ls_cfg(u64 tifn)
7677
return (tifn & _TIF_SSBD) ? x86_amd_ls_cfg_ssbd_mask : 0ULL;
7778
}
7879

80+
/*
81+
* This can be used in noinstr functions & should only be called in bare
82+
* metal context.
83+
*/
84+
static __always_inline void __update_spec_ctrl(u64 val)
85+
{
86+
__this_cpu_write(x86_spec_ctrl_current, val);
87+
native_wrmsrl(MSR_IA32_SPEC_CTRL, val);
88+
}
89+
7990
#ifdef CONFIG_SMP
8091
extern void speculative_store_bypass_ht_init(void);
8192
#else

arch/x86/kernel/smpboot.c

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@
8787
#include <asm/hw_irq.h>
8888
#include <asm/stackprotector.h>
8989
#include <asm/sev.h>
90+
#include <asm/spec-ctrl.h>
9091

9192
/* representing HT siblings of each logical CPU */
9293
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
@@ -640,13 +641,13 @@ static void __init build_sched_topology(void)
640641
};
641642
#endif
642643
/*
643-
* When there is NUMA topology inside the package skip the DIE domain
644+
* When there is NUMA topology inside the package skip the PKG domain
644645
* since the NUMA domains will auto-magically create the right spanning
645646
* domains based on the SLIT.
646647
*/
647648
if (!x86_has_numa_in_package) {
648649
x86_topology[i++] = (struct sched_domain_topology_level){
649-
cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(DIE)
650+
cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(PKG)
650651
};
651652
}
652653

@@ -1596,8 +1597,15 @@ void __noreturn hlt_play_dead(void)
15961597
native_halt();
15971598
}
15981599

1600+
/*
1601+
* native_play_dead() is essentially a __noreturn function, but it can't
1602+
* be marked as such as the compiler may complain about it.
1603+
*/
15991604
void native_play_dead(void)
16001605
{
1606+
if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS))
1607+
__update_spec_ctrl(0);
1608+
16011609
play_dead_common();
16021610
tboot_shutdown(TB_SHUTDOWN_WFS);
16031611

drivers/idle/intel_idle.c

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,8 @@
5353
#include <linux/moduleparam.h>
5454
#include <asm/cpu_device_id.h>
5555
#include <asm/intel-family.h>
56-
#include <asm/nospec-branch.h>
5756
#include <asm/mwait.h>
58-
#include <asm/msr.h>
57+
#include <asm/spec-ctrl.h>
5958
#include <asm/fpu/api.h>
6059

6160
#define INTEL_IDLE_VERSION "0.5.1"
@@ -69,6 +68,7 @@ static int max_cstate = CPUIDLE_STATE_MAX - 1;
6968
static unsigned int disabled_states_mask __read_mostly;
7069
static unsigned int preferred_states_mask __read_mostly;
7170
static bool force_irq_on __read_mostly;
71+
static bool ibrs_off __read_mostly;
7272

7373
static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;
7474

@@ -182,12 +182,12 @@ static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
182182
int ret;
183183

184184
if (smt_active)
185-
native_wrmsrl(MSR_IA32_SPEC_CTRL, 0);
185+
__update_spec_ctrl(0);
186186

187187
ret = __intel_idle(dev, drv, index);
188188

189189
if (smt_active)
190-
native_wrmsrl(MSR_IA32_SPEC_CTRL, spec_ctrl);
190+
__update_spec_ctrl(spec_ctrl);
191191

192192
return ret;
193193
}
@@ -1853,11 +1853,13 @@ static void state_update_enter_method(struct cpuidle_state *state, int cstate)
18531853
}
18541854

18551855
if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) &&
1856-
state->flags & CPUIDLE_FLAG_IBRS) {
1856+
((state->flags & CPUIDLE_FLAG_IBRS) || ibrs_off)) {
18571857
/*
18581858
* IBRS mitigation requires that C-states are entered
18591859
* with interrupts disabled.
18601860
*/
1861+
if (ibrs_off && (state->flags & CPUIDLE_FLAG_IRQ_ENABLE))
1862+
state->flags &= ~CPUIDLE_FLAG_IRQ_ENABLE;
18611863
WARN_ON_ONCE(state->flags & CPUIDLE_FLAG_IRQ_ENABLE);
18621864
state->enter = intel_idle_ibrs;
18631865
return;
@@ -2176,3 +2178,9 @@ MODULE_PARM_DESC(preferred_cstates, "Mask of preferred idle states");
21762178
* 'CPUIDLE_FLAG_INIT_XSTATE' and 'CPUIDLE_FLAG_IBRS' flags.
21772179
*/
21782180
module_param(force_irq_on, bool, 0444);
2181+
/*
2182+
* Force the disabling of IBRS when X86_FEATURE_KERNEL_IBRS is on and
2183+
* CPUIDLE_FLAG_IRQ_ENABLE isn't set.
2184+
*/
2185+
module_param(ibrs_off, bool, 0444);
2186+
MODULE_PARM_DESC(ibrs_off, "Disable IBRS when idle");

0 commit comments

Comments
 (0)