Skip to content

Commit 8f7c8b8

Browse files
committed
Merge tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo: - Improve the default select_cpu() implementation making it topology aware and handle WAKE_SYNC better. - set_arg_maybe_null() was used to inform the verifier which ops args could be NULL in a rather hackish way. Use the new __nullable CFI stub tags instead. - On Sapphire Rapids multi-socket systems, a BPF scheduler, by hammering on the same queue across sockets, could live-lock the system to the point where the system couldn't make reasonable forward progress. This could lead to soft-lockup triggered resets or stalling out bypass mode switch and thus BPF scheduler ejection for tens of minutes if not hours. After trying a number of mitigations, the following set worked reliably: - Injecting artificial cpu_relax() loops in two places while sched_ext is trying to turn on the bypass mode. - Triggering scheduler ejection when soft-lockup detection is imminent (a quarter of threshold left). While not the prettiest, the impact both in terms of code complexity and overhead is minimal. - A common complaint on the API is the overuse of the word "dispatch" and the confusion around "consume". This is due to how the dispatch queues became more generic over time. Rename the affected kfuncs for clarity. Thanks to BPF's compatibility features, this change can be made in a way that's both forward and backward compatible. The compatibility code will be dropped in a few releases. - Other misc changes * tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (21 commits) sched_ext: Replace scx_next_task_picked() with switch_class() in comment sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local() sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]() sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl() sched_ext: Clarify sched_ext_ops table for userland scheduler sched_ext: Enable the ops breather and eject BPF scheduler on softlockup sched_ext: Avoid live-locking bypass mode switching sched_ext: Fix incorrect use of bitwise AND sched_ext: Do not enable LLC/NUMA optimizations when domains overlap sched_ext: Introduce NUMA awareness to the default idle selection policy sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags sched_ext: Rename CFI stubs to names that are recognized by BPF sched_ext: Introduce LLC awareness to the default idle selection policy sched_ext: Clarify ops.select_cpu() for single-CPU tasks sched_ext: improve WAKE_SYNC behavior for default idle CPU selection sched_ext: Use btf_ids to resolve task_struct sched/ext: Use tg_cgroup() to elieminate duplicate code sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED ...
2 parents 7586d52 + 6b8950e commit 8f7c8b8

File tree

11 files changed

+878
-394
lines changed

11 files changed

+878
-394
lines changed

Documentation/scheduler/sched-ext.rst

Lines changed: 35 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ optional. The following modified excerpt is from
130130
* Decide which CPU a task should be migrated to before being
131131
* enqueued (either at wakeup, fork time, or exec time). If an
132132
* idle core is found by the default ops.select_cpu() implementation,
133-
* then dispatch the task directly to SCX_DSQ_LOCAL and skip the
133+
* then insert the task directly into SCX_DSQ_LOCAL and skip the
134134
* ops.enqueue() callback.
135135
*
136136
* Note that this implementation has exactly the same behavior as the
@@ -148,15 +148,15 @@ optional. The following modified excerpt is from
148148
cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
149149
150150
if (direct)
151-
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
151+
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
152152
153153
return cpu;
154154
}
155155
156156
/*
157-
* Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
158-
* callback will only be invoked if we failed to find a core to dispatch
159-
* to in ops.select_cpu() above.
157+
* Do a direct insertion of a task to the global DSQ. This ops.enqueue()
158+
* callback will only be invoked if we failed to find a core to insert
159+
* into in ops.select_cpu() above.
160160
*
161161
* Note that this implementation has exactly the same behavior as the
162162
* default ops.enqueue implementation, which just dispatches the task
@@ -166,7 +166,7 @@ optional. The following modified excerpt is from
166166
*/
167167
void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
168168
{
169-
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
169+
scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
170170
}
171171
172172
s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
@@ -202,14 +202,13 @@ and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
202202
an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
203203
``scx_bpf_destroy_dsq()``.
204204

205-
A CPU always executes a task from its local DSQ. A task is "dispatched" to a
206-
DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
207-
local DSQ.
205+
A CPU always executes a task from its local DSQ. A task is "inserted" into a
206+
DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ.
208207

209208
When a CPU is looking for the next task to run, if the local DSQ is not
210-
empty, the first task is picked. Otherwise, the CPU tries to consume the
211-
global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
212-
is invoked.
209+
empty, the first task is picked. Otherwise, the CPU tries to move a task
210+
from the global DSQ. If that doesn't yield a runnable task either,
211+
``ops.dispatch()`` is invoked.
213212

214213
Scheduling Cycle
215214
----------------
@@ -229,26 +228,26 @@ The following briefly shows how a waking task is scheduled and executed.
229228
scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
230229
using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
231230

232-
A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
233-
calling ``scx_bpf_dispatch()``. If the task is dispatched to
234-
``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
231+
A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
232+
by calling ``scx_bpf_dsq_insert()``. If the task is inserted into
233+
``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the
235234
local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
236-
Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
235+
Additionally, inserting directly from ``ops.select_cpu()`` will cause the
237236
``ops.enqueue()`` callback to be skipped.
238237

239238
Note that the scheduler core will ignore an invalid CPU selection, for
240239
example, if it's outside the allowed cpumask of the task.
241240

242241
2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
243-
task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
242+
task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
244243
can make one of the following decisions:
245244

246-
* Immediately dispatch the task to either the global or local DSQ by
247-
calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
245+
* Immediately insert the task into either the global or local DSQ by
246+
calling ``scx_bpf_dsq_insert()`` with ``SCX_DSQ_GLOBAL`` or
248247
``SCX_DSQ_LOCAL``, respectively.
249248

250-
* Immediately dispatch the task to a custom DSQ by calling
251-
``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
249+
* Immediately insert the task into a custom DSQ by calling
250+
``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63.
252251

253252
* Queue the task on the BPF side.
254253

@@ -257,23 +256,23 @@ The following briefly shows how a waking task is scheduled and executed.
257256
run, ``ops.dispatch()`` is invoked which can use the following two
258257
functions to populate the local DSQ.
259258

260-
* ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
261-
be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
262-
``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
259+
* ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be
260+
used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
261+
``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()``
263262
currently can't be called with BPF locks held, this is being worked on
264-
and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
263+
and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion
265264
rather than performing them immediately. There can be up to
266265
``ops.dispatch_max_batch`` pending tasks.
267266

268-
* ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
269-
to the dispatching DSQ. This function cannot be called with any BPF
270-
locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
271-
before trying to consume the specified DSQ.
267+
* ``scx_bpf_move_to_local()`` moves a task from the specified non-local
268+
DSQ to the dispatching DSQ. This function cannot be called with any BPF
269+
locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions
270+
tasks before trying to move from the specified DSQ.
272271

273272
4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
274273
the CPU runs the first one. If empty, the following steps are taken:
275274

276-
* Try to consume the global DSQ. If successful, run the task.
275+
* Try to move from the global DSQ. If successful, run the task.
277276

278277
* If ``ops.dispatch()`` has dispatched any tasks, retry #3.
279278

@@ -286,14 +285,14 @@ Note that the BPF scheduler can always choose to dispatch tasks immediately
286285
in ``ops.enqueue()`` as illustrated in the above simple example. If only the
287286
built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
288287
a task is never queued on the BPF scheduler and both the local and global
289-
DSQs are consumed automatically.
288+
DSQs are executed automatically.
290289

291-
``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
292-
``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
290+
``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use
291+
``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as
293292
``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
294-
dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the
295-
function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
296-
more information.
293+
dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See
294+
the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c``
295+
for more information.
297296

298297
Where to Look
299298
=============

include/linux/sched/ext.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,11 +204,13 @@ struct sched_ext_entity {
204204

205205
void sched_ext_free(struct task_struct *p);
206206
void print_scx_info(const char *log_lvl, struct task_struct *p);
207+
void scx_softlockup(u32 dur_s);
207208

208209
#else /* !CONFIG_SCHED_CLASS_EXT */
209210

210211
static inline void sched_ext_free(struct task_struct *p) {}
211212
static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
213+
static inline void scx_softlockup(u32 dur_s) {}
212214

213215
#endif /* CONFIG_SCHED_CLASS_EXT */
214216
#endif /* _LINUX_SCHED_EXT_H */

0 commit comments

Comments
 (0)