Skip to content

Commit 3fe2f74

Browse files
committed
Merge tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: - Cleanups for SCHED_DEADLINE - Tracing updates/fixes - CPU Accounting fixes - First wave of changes to optimize the overhead of the scheduler build, from the fast-headers tree - including placeholder *_api.h headers for later header split-ups. - Preempt-dynamic using static_branch() for ARM64 - Isolation housekeeping mask rework; preperatory for further changes - NUMA-balancing: deal with CPU-less nodes - NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD) - Updates to RSEQ UAPI in preparation for glibc usage - Lots of RSEQ/selftests, for same - Add Suren as PSI co-maintainer * tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits) sched/headers: ARM needs asm/paravirt_api_clock.h too sched/numa: Fix boot crash on arm64 systems headers/prep: Fix header to build standalone: <linux/psi.h> sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y cgroup: Fix suspicious rcu_dereference_check() usage warning sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity() sched/deadline,rt: Remove unused functions for !CONFIG_SMP sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy() sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file sched/deadline: Remove unused def_dl_bandwidth sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE sched/tracing: Don't re-read p->state when emitting sched_switch event sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race sched/cpuacct: Remove redundant RCU read lock sched/cpuacct: Optimize away RCU read lock sched/cpuacct: Fix charge percpu cpuusage sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies ...
2 parents ebd326c + ffea9fb commit 3fe2f74

File tree

135 files changed

+2345
-1307
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

135 files changed

+2345
-1307
lines changed

Documentation/admin-guide/sysctl/kernel.rst

Lines changed: 1 addition & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -609,51 +609,7 @@ be migrated to a local memory node.
609609
The unmapping of pages and trapping faults incur additional overhead that
610610
ideally is offset by improved memory locality but there is no universal
611611
guarantee. If the target workload is already bound to NUMA nodes then this
612-
feature should be disabled. Otherwise, if the system overhead from the
613-
feature is too high then the rate the kernel samples for NUMA hinting
614-
faults may be controlled by the `numa_balancing_scan_period_min_ms,
615-
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
616-
numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls.
617-
618-
619-
numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
620-
===============================================================================================================================
621-
622-
623-
Automatic NUMA balancing scans tasks address space and unmaps pages to
624-
detect if pages are properly placed or if the data should be migrated to a
625-
memory node local to where the task is running. Every "scan delay" the task
626-
scans the next "scan size" number of pages in its address space. When the
627-
end of the address space is reached the scanner restarts from the beginning.
628-
629-
In combination, the "scan delay" and "scan size" determine the scan rate.
630-
When "scan delay" decreases, the scan rate increases. The scan delay and
631-
hence the scan rate of every task is adaptive and depends on historical
632-
behaviour. If pages are properly placed then the scan delay increases,
633-
otherwise the scan delay decreases. The "scan size" is not adaptive but
634-
the higher the "scan size", the higher the scan rate.
635-
636-
Higher scan rates incur higher system overhead as page faults must be
637-
trapped and potentially data must be migrated. However, the higher the scan
638-
rate, the more quickly a tasks memory is migrated to a local node if the
639-
workload pattern changes and minimises performance impact due to remote
640-
memory accesses. These sysctls control the thresholds for scan delays and
641-
the number of pages scanned.
642-
643-
``numa_balancing_scan_period_min_ms`` is the minimum time in milliseconds to
644-
scan a tasks virtual memory. It effectively controls the maximum scanning
645-
rate for each task.
646-
647-
``numa_balancing_scan_delay_ms`` is the starting "scan delay" used for a task
648-
when it initially forks.
649-
650-
``numa_balancing_scan_period_max_ms`` is the maximum time in milliseconds to
651-
scan a tasks virtual memory. It effectively controls the minimum scanning
652-
rate for each task.
653-
654-
``numa_balancing_scan_size_mb`` is how many megabytes worth of pages are
655-
scanned for a given scan.
656-
612+
feature should be disabled.
657613

658614
oops_all_cpu_backtrace
659615
======================

Documentation/scheduler/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Linux Scheduler
1818
sched-nice-design
1919
sched-rt-group
2020
sched-stats
21+
sched-debug
2122

2223
text_files
2324

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
=================
2+
Scheduler debugfs
3+
=================
4+
5+
Booting a kernel with CONFIG_SCHED_DEBUG=y will give access to
6+
scheduler specific debug files under /sys/kernel/debug/sched. Some of
7+
those files are described below.
8+
9+
numa_balancing
10+
==============
11+
12+
`numa_balancing` directory is used to hold files to control NUMA
13+
balancing feature. If the system overhead from the feature is too
14+
high then the rate the kernel samples for NUMA hinting faults may be
15+
controlled by the `scan_period_min_ms, scan_delay_ms,
16+
scan_period_max_ms, scan_size_mb` files.
17+
18+
19+
scan_period_min_ms, scan_delay_ms, scan_period_max_ms, scan_size_mb
20+
-------------------------------------------------------------------
21+
22+
Automatic NUMA balancing scans tasks address space and unmaps pages to
23+
detect if pages are properly placed or if the data should be migrated to a
24+
memory node local to where the task is running. Every "scan delay" the task
25+
scans the next "scan size" number of pages in its address space. When the
26+
end of the address space is reached the scanner restarts from the beginning.
27+
28+
In combination, the "scan delay" and "scan size" determine the scan rate.
29+
When "scan delay" decreases, the scan rate increases. The scan delay and
30+
hence the scan rate of every task is adaptive and depends on historical
31+
behaviour. If pages are properly placed then the scan delay increases,
32+
otherwise the scan delay decreases. The "scan size" is not adaptive but
33+
the higher the "scan size", the higher the scan rate.
34+
35+
Higher scan rates incur higher system overhead as page faults must be
36+
trapped and potentially data must be migrated. However, the higher the scan
37+
rate, the more quickly a tasks memory is migrated to a local node if the
38+
workload pattern changes and minimises performance impact due to remote
39+
memory accesses. These files control the thresholds for scan delays and
40+
the number of pages scanned.
41+
42+
``scan_period_min_ms`` is the minimum time in milliseconds to scan a
43+
tasks virtual memory. It effectively controls the maximum scanning
44+
rate for each task.
45+
46+
``scan_delay_ms`` is the starting "scan delay" used for a task when it
47+
initially forks.
48+
49+
``scan_period_max_ms`` is the maximum time in milliseconds to scan a
50+
tasks virtual memory. It effectively controls the minimum scanning
51+
rate for each task.
52+
53+
``scan_size_mb`` is how many megabytes worth of pages are scanned for
54+
a given scan.

MAINTAINERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15566,6 +15566,7 @@ F: drivers/net/ppp/pptp.c
1556615566

1556715567
PRESSURE STALL INFORMATION (PSI)
1556815568
M: Johannes Weiner <[email protected]>
15569+
M: Suren Baghdasaryan <[email protected]>
1556915570
S: Maintained
1557015571
F: include/linux/psi*
1557115572
F: kernel/sched/psi.c

arch/Kconfig

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1293,12 +1293,41 @@ config HAVE_STATIC_CALL_INLINE
12931293

12941294
config HAVE_PREEMPT_DYNAMIC
12951295
bool
1296+
1297+
config HAVE_PREEMPT_DYNAMIC_CALL
1298+
bool
12961299
depends on HAVE_STATIC_CALL
1297-
depends on GENERIC_ENTRY
1300+
select HAVE_PREEMPT_DYNAMIC
1301+
help
1302+
An architecture should select this if it can handle the preemption
1303+
model being selected at boot time using static calls.
1304+
1305+
Where an architecture selects HAVE_STATIC_CALL_INLINE, any call to a
1306+
preemption function will be patched directly.
1307+
1308+
Where an architecture does not select HAVE_STATIC_CALL_INLINE, any
1309+
call to a preemption function will go through a trampoline, and the
1310+
trampoline will be patched.
1311+
1312+
It is strongly advised to support inline static call to avoid any
1313+
overhead.
1314+
1315+
config HAVE_PREEMPT_DYNAMIC_KEY
1316+
bool
1317+
depends on HAVE_ARCH_JUMP_LABEL && CC_HAS_ASM_GOTO
1318+
select HAVE_PREEMPT_DYNAMIC
12981319
help
1299-
Select this if the architecture support boot time preempt setting
1300-
on top of static calls. It is strongly advised to support inline
1301-
static call to avoid any overhead.
1320+
An architecture should select this if it can handle the preemption
1321+
model being selected at boot time using static keys.
1322+
1323+
Each preemption function will be given an early return based on a
1324+
static key. This should have slightly lower overhead than non-inline
1325+
static calls, as this effectively inlines each trampoline into the
1326+
start of its callee. This may avoid redundant work, and may
1327+
integrate better with CFI schemes.
1328+
1329+
This will have greater overhead than using inline static calls as
1330+
the call to the preemption function cannot be entirely elided.
13021331

13031332
config ARCH_WANT_LD_ORPHAN_WARN
13041333
bool
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
#include <asm/paravirt.h>

arch/arm64/Kconfig

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,7 @@ config ARM64
194194
select HAVE_PERF_EVENTS
195195
select HAVE_PERF_REGS
196196
select HAVE_PERF_USER_STACK_DUMP
197+
select HAVE_PREEMPT_DYNAMIC_KEY
197198
select HAVE_REGS_AND_STACK_ACCESS_API
198199
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
199200
select HAVE_FUNCTION_ARG_ACCESS_API
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
#include <asm/paravirt.h>

arch/arm64/include/asm/preempt.h

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
#ifndef __ASM_PREEMPT_H
33
#define __ASM_PREEMPT_H
44

5+
#include <linux/jump_label.h>
56
#include <linux/thread_info.h>
67

78
#define PREEMPT_NEED_RESCHED BIT(32)
@@ -80,10 +81,24 @@ static inline bool should_resched(int preempt_offset)
8081
}
8182

8283
#ifdef CONFIG_PREEMPTION
84+
8385
void preempt_schedule(void);
84-
#define __preempt_schedule() preempt_schedule()
8586
void preempt_schedule_notrace(void);
86-
#define __preempt_schedule_notrace() preempt_schedule_notrace()
87+
88+
#ifdef CONFIG_PREEMPT_DYNAMIC
89+
90+
DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
91+
void dynamic_preempt_schedule(void);
92+
#define __preempt_schedule() dynamic_preempt_schedule()
93+
void dynamic_preempt_schedule_notrace(void);
94+
#define __preempt_schedule_notrace() dynamic_preempt_schedule_notrace()
95+
96+
#else /* CONFIG_PREEMPT_DYNAMIC */
97+
98+
#define __preempt_schedule() preempt_schedule()
99+
#define __preempt_schedule_notrace() preempt_schedule_notrace()
100+
101+
#endif /* CONFIG_PREEMPT_DYNAMIC */
87102
#endif /* CONFIG_PREEMPTION */
88103

89104
#endif /* __ASM_PREEMPT_H */

arch/arm64/kernel/entry-common.c

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -223,9 +223,26 @@ static void noinstr arm64_exit_el1_dbg(struct pt_regs *regs)
223223
lockdep_hardirqs_on(CALLER_ADDR0);
224224
}
225225

226+
#ifdef CONFIG_PREEMPT_DYNAMIC
227+
DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
228+
#define need_irq_preemption() \
229+
(static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
230+
#else
231+
#define need_irq_preemption() (IS_ENABLED(CONFIG_PREEMPTION))
232+
#endif
233+
226234
static void __sched arm64_preempt_schedule_irq(void)
227235
{
228-
lockdep_assert_irqs_disabled();
236+
if (!need_irq_preemption())
237+
return;
238+
239+
/*
240+
* Note: thread_info::preempt_count includes both thread_info::count
241+
* and thread_info::need_resched, and is not equivalent to
242+
* preempt_count().
243+
*/
244+
if (READ_ONCE(current_thread_info()->preempt_count) != 0)
245+
return;
229246

230247
/*
231248
* DAIF.DA are cleared at the start of IRQ/FIQ handling, and when GIC
@@ -441,14 +458,7 @@ static __always_inline void __el1_irq(struct pt_regs *regs,
441458
do_interrupt_handler(regs, handler);
442459
irq_exit_rcu();
443460

444-
/*
445-
* Note: thread_info::preempt_count includes both thread_info::count
446-
* and thread_info::need_resched, and is not equivalent to
447-
* preempt_count().
448-
*/
449-
if (IS_ENABLED(CONFIG_PREEMPTION) &&
450-
READ_ONCE(current_thread_info()->preempt_count) == 0)
451-
arm64_preempt_schedule_irq();
461+
arm64_preempt_schedule_irq();
452462

453463
exit_to_kernel_mode(regs);
454464
}

0 commit comments

Comments
 (0)