Skip to content

Commit 9a7e0a9

Browse files
committed
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner: - Revert the printk format based wchan() symbol resolution as it can leak the raw value in case that the symbol is not resolvable. - Make wchan() more robust and work with all kind of unwinders by enforcing that the task stays blocked while unwinding is in progress. - Prevent sched_fork() from accessing an invalid sched_task_group - Improve asymmetric packing logic - Extend scheduler statistics to RT and DL scheduling classes and add statistics for bandwith burst to the SCHED_FAIR class. - Properly account SCHED_IDLE entities - Prevent a potential deadlock when initial priority is assigned to a newly created kthread. A recent change to plug a race between cpuset and __sched_setscheduler() introduced a new lock dependency which is now triggered. Break the lock dependency chain by moving the priority assignment to the thread function. - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems. - Improve idle balancing in general and especially for NOHZ enabled systems. - Provide proper interfaces for live patching so it does not have to fiddle with scheduler internals. - Add cluster aware scheduling support. - A small set of tweaks for RT (irqwork, wait_task_inactive(), various scheduler options and delaying mmdrop) - The usual small tweaks and improvements all over the place * tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits) sched/fair: Cleanup newidle_balance sched/fair: Remove sysctl_sched_migration_cost condition sched/fair: Wait before decaying max_newidle_lb_cost sched/fair: Skip update_blocked_averages if we are defering load balance sched/fair: Account update_blocked_averages in newidle_balance cost x86: Fix __get_wchan() for !STACKTRACE sched,x86: Fix L2 cache mask sched/core: Remove rq_relock() sched: Improve wake_up_all_idle_cpus() take #2 irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support. sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ sched: Add cluster scheduler level for x86 sched: Add cluster scheduler level in core and related Kconfig for ARM64 topology: Represent clusters of CPUs within a die sched: Disable -Wunused-but-set-variable sched: Add wrapper for get_wchan() to keep task blocked x86: Fix get_wchan() to support the ORC unwinder proc: Use task_is_running() for wchan in /proc/$pid/stat ...
2 parents 57a315c + 8ea9183 commit 9a7e0a9

File tree

105 files changed

+1682
-789
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

105 files changed

+1682
-789
lines changed

Documentation/ABI/stable/sysfs-devices-system-cpu

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,12 @@ Description: the CPU core ID of cpuX. Typically it is the hardware platform's
4242
architecture and platform dependent.
4343
Values: integer
4444

45+
What: /sys/devices/system/cpu/cpuX/topology/cluster_id
46+
Description: the cluster ID of cpuX. Typically it is the hardware platform's
47+
identifier (rather than the kernel's). The actual value is
48+
architecture and platform dependent.
49+
Values: integer
50+
4551
What: /sys/devices/system/cpu/cpuX/topology/book_id
4652
Description: the book ID of cpuX. Typically it is the hardware platform's
4753
identifier (rather than the kernel's). The actual value is
@@ -85,6 +91,15 @@ Description: human-readable list of CPUs within the same die.
8591
The format is like 0-3, 8-11, 14,17.
8692
Values: decimal list.
8793

94+
What: /sys/devices/system/cpu/cpuX/topology/cluster_cpus
95+
Description: internal kernel map of CPUs within the same cluster.
96+
Values: hexadecimal bitmask.
97+
98+
What: /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
99+
Description: human-readable list of CPUs within the same cluster.
100+
The format is like 0-3, 8-11, 14,17.
101+
Values: decimal list.
102+
88103
What: /sys/devices/system/cpu/cpuX/topology/book_siblings
89104
Description: internal kernel map of cpuX's hardware threads within the same
90105
book_id. it's only used on s390.

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1016,6 +1016,8 @@ All time durations are in microseconds.
10161016
- nr_periods
10171017
- nr_throttled
10181018
- throttled_usec
1019+
- nr_bursts
1020+
- burst_usec
10191021

10201022
cpu.weight
10211023
A read-write single value file which exists on non-root
@@ -1047,6 +1049,12 @@ All time durations are in microseconds.
10471049
$PERIOD duration. "max" for $MAX indicates no limit. If only
10481050
one number is written, $MAX is updated.
10491051

1052+
cpu.max.burst
1053+
A read-write single value file which exists on non-root
1054+
cgroups. The default is "0".
1055+
1056+
The burst in the range [0, $MAX].
1057+
10501058
cpu.pressure
10511059
A read-write nested-keyed file.
10521060

Documentation/admin-guide/cputopology.rst

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
1919

2020
#define topology_physical_package_id(cpu)
2121
#define topology_die_id(cpu)
22+
#define topology_cluster_id(cpu)
2223
#define topology_core_id(cpu)
2324
#define topology_book_id(cpu)
2425
#define topology_drawer_id(cpu)
2526
#define topology_sibling_cpumask(cpu)
2627
#define topology_core_cpumask(cpu)
28+
#define topology_cluster_cpumask(cpu)
2729
#define topology_die_cpumask(cpu)
2830
#define topology_book_cpumask(cpu)
2931
#define topology_drawer_cpumask(cpu)
@@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
3941

4042
1) topology_physical_package_id: -1
4143
2) topology_die_id: -1
42-
3) topology_core_id: 0
43-
4) topology_sibling_cpumask: just the given CPU
44-
5) topology_core_cpumask: just the given CPU
45-
6) topology_die_cpumask: just the given CPU
44+
3) topology_cluster_id: -1
45+
4) topology_core_id: 0
46+
5) topology_sibling_cpumask: just the given CPU
47+
6) topology_core_cpumask: just the given CPU
48+
7) topology_cluster_cpumask: just the given CPU
49+
8) topology_die_cpumask: just the given CPU
4650

4751
For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
4852
default definitions for topology_book_id() and topology_book_cpumask().

Documentation/scheduler/sched-bwc.rst

Lines changed: 75 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,39 +22,90 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
2222
is transferred to cpu-local "silos" on a demand basis. The amount transferred
2323
within each of these updates is tunable and described as the "slice".
2424

25+
Burst feature
26+
-------------
27+
This feature borrows time now against our future underrun, at the cost of
28+
increased interference against the other system users. All nicely bounded.
29+
30+
Traditional (UP-EDF) bandwidth control is something like:
31+
32+
(U = \Sum u_i) <= 1
33+
34+
This guaranteeds both that every deadline is met and that the system is
35+
stable. After all, if U were > 1, then for every second of walltime,
36+
we'd have to run more than a second of program time, and obviously miss
37+
our deadline, but the next deadline will be further out still, there is
38+
never time to catch up, unbounded fail.
39+
40+
The burst feature observes that a workload doesn't always executes the full
41+
quota; this enables one to describe u_i as a statistical distribution.
42+
43+
For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
44+
(the traditional WCET). This effectively allows u to be smaller,
45+
increasing the efficiency (we can pack more tasks in the system), but at
46+
the cost of missing deadlines when all the odds line up. However, it
47+
does maintain stability, since every overrun must be paired with an
48+
underrun as long as our x is above the average.
49+
50+
That is, suppose we have 2 tasks, both specify a p(95) value, then we
51+
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
52+
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
53+
both tasks will exceed their quota at the same time (guaranteed deadline
54+
fail). Somewhere in between there's a threshold where one exceeds and
55+
the other doesn't underrun enough to compensate; this depends on the
56+
specific CDFs.
57+
58+
At the same time, we can say that the worst case deadline miss, will be
59+
\Sum e_i; that is, there is a bounded tardiness (under the assumption
60+
that x+e is indeed WCET).
61+
62+
The interferenece when using burst is valued by the possibilities for
63+
missing the deadline and the average WCET. Test results showed that when
64+
there many cgroups or CPU is under utilized, the interference is
65+
limited. More details are shown in:
66+
https://lore.kernel.org/lkml/[email protected]/
67+
2568
Management
2669
----------
27-
Quota and period are managed within the cpu subsystem via cgroupfs.
70+
Quota, period and burst are managed within the cpu subsystem via cgroupfs.
2871

2972
.. note::
3073
The cgroupfs files described in this section are only applicable
3174
to cgroup v1. For cgroup v2, see
3275
:ref:`Documentation/admin-guide/cgroup-v2.rst <cgroup-v2-cpu>`.
3376

3477
- cpu.cfs_quota_us: the total available run-time within a period (in
35-
microseconds)
78+
- cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
3679
- cpu.cfs_period_us: the length of a period (in microseconds)
3780
- cpu.stat: exports throttling statistics [explained further below]
81+
- cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
3882

3983
The default values are::
4084

4185
cpu.cfs_period_us=100ms
42-
cpu.cfs_quota=-1
86+
cpu.cfs_quota_us=-1
87+
cpu.cfs_burst_us=0
4388

4489
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
4590
bandwidth restriction in place, such a group is described as an unconstrained
4691
bandwidth group. This represents the traditional work-conserving behavior for
4792
CFS.
4893

49-
Writing any (valid) positive value(s) will enact the specified bandwidth limit.
50-
The minimum quota allowed for the quota or period is 1ms. There is also an
51-
upper bound on the period length of 1s. Additional restrictions exist when
52-
bandwidth limits are used in a hierarchical fashion, these are explained in
53-
more detail below.
94+
Writing any (valid) positive value(s) no smaller than cpu.cfs_burst_us will
95+
enact the specified bandwidth limit. The minimum quota allowed for the quota or
96+
period is 1ms. There is also an upper bound on the period length of 1s.
97+
Additional restrictions exist when bandwidth limits are used in a hierarchical
98+
fashion, these are explained in more detail below.
5499

55100
Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
56101
and return the group to an unconstrained state once more.
57102

103+
A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
104+
any unused bandwidth. It makes the traditional bandwidth control behavior for
105+
CFS unchanged. Writing any (valid) positive value(s) no larger than
106+
cpu.cfs_quota_us into cpu.cfs_burst_us will enact the cap on unused bandwidth
107+
accumulation.
108+
58109
Any updates to a group's bandwidth specification will result in it becoming
59110
unthrottled if it is in a constrained state.
60111

@@ -74,14 +125,17 @@ for more fine-grained consumption.
74125

75126
Statistics
76127
----------
77-
A group's bandwidth statistics are exported via 3 fields in cpu.stat.
128+
A group's bandwidth statistics are exported via 5 fields in cpu.stat.
78129

79130
cpu.stat:
80131

81132
- nr_periods: Number of enforcement intervals that have elapsed.
82133
- nr_throttled: Number of times the group has been throttled/limited.
83134
- throttled_time: The total time duration (in nanoseconds) for which entities
84135
of the group have been throttled.
136+
- nr_bursts: Number of periods burst occurs.
137+
- burst_time: Cumulative wall-time (in nanoseconds) that any CPUs has used
138+
above quota in respective periods
85139

86140
This interface is read-only.
87141

@@ -179,3 +233,15 @@ Examples
179233
180234
By using a small period here we are ensuring a consistent latency
181235
response at the expense of burst capacity.
236+
237+
4. Limit a group to 40% of 1 CPU, and allow accumulate up to 20% of 1 CPU
238+
additionally, in case accumulation has been done.
239+
240+
With 50ms period, 20ms quota will be equivalent to 40% of 1 CPU.
241+
And 10ms burst will be equivalent to 20% of 1 CPU.
242+
243+
# echo 20000 > cpu.cfs_quota_us /* quota = 20ms */
244+
# echo 50000 > cpu.cfs_period_us /* period = 50ms */
245+
# echo 10000 > cpu.cfs_burst_us /* burst = 10ms */
246+
247+
Larger buffer setting (no larger than quota) allows greater burst capacity.

arch/alpha/include/asm/processor.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ extern void start_thread(struct pt_regs *, unsigned long, unsigned long);
4242
struct task_struct;
4343
extern void release_thread(struct task_struct *);
4444

45-
unsigned long get_wchan(struct task_struct *p);
45+
unsigned long __get_wchan(struct task_struct *p);
4646

4747
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->pc)
4848

arch/alpha/kernel/process.c

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -376,12 +376,11 @@ thread_saved_pc(struct task_struct *t)
376376
}
377377

378378
unsigned long
379-
get_wchan(struct task_struct *p)
379+
__get_wchan(struct task_struct *p)
380380
{
381381
unsigned long schedule_frame;
382382
unsigned long pc;
383-
if (!p || p == current || task_is_running(p))
384-
return 0;
383+
385384
/*
386385
* This one depends on the frame size of schedule(). Do a
387386
* "disass schedule" in gdb to find the frame size. Also, the

arch/arc/include/asm/processor.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ struct task_struct;
7070
extern void start_thread(struct pt_regs * regs, unsigned long pc,
7171
unsigned long usp);
7272

73-
extern unsigned int get_wchan(struct task_struct *p);
73+
extern unsigned int __get_wchan(struct task_struct *p);
7474

7575
#endif /* !__ASSEMBLY__ */
7676

arch/arc/kernel/stacktrace.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
* = specifics of data structs where trace is saved(CONFIG_STACKTRACE etc)
1616
*
1717
* vineetg: March 2009
18-
* -Implemented correct versions of thread_saved_pc() and get_wchan()
18+
* -Implemented correct versions of thread_saved_pc() and __get_wchan()
1919
*
2020
* rajeshwarr: 2008
2121
* -Initial implementation
@@ -248,7 +248,7 @@ void show_stack(struct task_struct *tsk, unsigned long *sp, const char *loglvl)
248248
* Of course just returning schedule( ) would be pointless so unwind until
249249
* the function is not in schedular code
250250
*/
251-
unsigned int get_wchan(struct task_struct *tsk)
251+
unsigned int __get_wchan(struct task_struct *tsk)
252252
{
253253
return arc_unwind_core(tsk, NULL, __get_first_nonsched, NULL);
254254
}

arch/arm/include/asm/processor.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ struct task_struct;
8484
/* Free all resources held by a thread. */
8585
extern void release_thread(struct task_struct *);
8686

87-
unsigned long get_wchan(struct task_struct *p);
87+
unsigned long __get_wchan(struct task_struct *p);
8888

8989
#define task_pt_regs(p) \
9090
((struct pt_regs *)(THREAD_START_SP + task_stack_page(p)) - 1)

arch/arm/kernel/process.c

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -276,13 +276,11 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
276276
return 0;
277277
}
278278

279-
unsigned long get_wchan(struct task_struct *p)
279+
unsigned long __get_wchan(struct task_struct *p)
280280
{
281281
struct stackframe frame;
282282
unsigned long stack_page;
283283
int count = 0;
284-
if (!p || p == current || task_is_running(p))
285-
return 0;
286284

287285
frame.fp = thread_saved_fp(p);
288286
frame.sp = thread_saved_sp(p);

0 commit comments

Comments
 (0)