Skip to content

Commit 7c65ae8

Browse files
committed
sched_ext: Don't call put_prev_task_scx() before picking the next task
fd03c5b ("sched: Rework pick_next_task()") changed the definition of pick_next_task() from: pick_next_task() := pick_task() + set_next_task(.first = true) to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) making invoking put_prev_task() pick_next_task()'s responsibility. This reordering allows pick_task() to be shared between regular and core-sched paths and put_prev_task() to know the next task. sched_ext depended on put_prev_task_scx() enqueueing the current task before pick_next_task_scx() is called. While pulling sched/core changes, 70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx() before picking the first task as a workaround. Clean it up and adopt the conventions that other sched classes are following. The operation of keeping running the current task was spread and required the task to be put on the local DSQ before picking: - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still runnable, hasn't exhausted its slice, and thus should keep running. - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the only runnable task. do_enqueue_task() in turn decided whether to use the local DSQ depending on SCX_OPS_ENQ_LAST. Consolidate the logic in balance_one() as it always knows whether it is going to keep the current task. balance_one() now considers all conditions where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell pick_next_task_scx() to keep the current task instead of picking one from the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is updated to pick the current task if SCX_TASK_BAL_KEEP is set. The workaround put_prev_task[_scx]() calls are replaced with put_prev_set_next_task(). This causes two behavior changes observable from the BPF scheduler: - When a task keep running, it no longer goes through enqueue/dequeue cycle and thus ops.stopping/running() transitions. The new behavior is better and all the existing schedulers should be able to handle the new behavior. - The BPF scheduler cannot keep executing the current task by enqueueing SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the BPF scheduler is responsible for resuming execution after each SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling decisions are not made on the local CPU - e.g. central or userspace-driven schedulin - and the new behavior is more logical and shouldn't pose any problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it doesn't fit that well anymore and the last task handling is moved to the end of qmap_dispatch(). Signed-off-by: Tejun Heo <[email protected]> Cc: David Vernet <[email protected]> Cc: Andrea Righi <[email protected]> Cc: Changwoo Min <[email protected]> Cc: Daniel Hodges <[email protected]> Cc: Dan Schatzberg <[email protected]>
1 parent d7b01ae commit 7c65ae8

File tree

2 files changed

+80
-72
lines changed

2 files changed

+80
-72
lines changed

kernel/sched/ext.c

Lines changed: 62 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -630,11 +630,8 @@ enum scx_enq_flags {
630630
* %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with the
631631
* %SCX_ENQ_LAST flag set.
632632
*
633-
* If the BPF scheduler wants to continue executing the task,
634-
* ops.enqueue() should dispatch the task to %SCX_DSQ_LOCAL immediately.
635-
* If the task gets queued on a different dsq or the BPF side, the BPF
636-
* scheduler is responsible for triggering a follow-up scheduling event.
637-
* Otherwise, Execution may stall.
633+
* The BPF scheduler is responsible for triggering a follow-up
634+
* scheduling event. Otherwise, Execution may stall.
638635
*/
639636
SCX_ENQ_LAST = 1LLU << 41,
640637

@@ -1852,12 +1849,8 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
18521849
if (!scx_rq_online(rq))
18531850
goto local;
18541851

1855-
if (scx_ops_bypassing()) {
1856-
if (enq_flags & SCX_ENQ_LAST)
1857-
goto local;
1858-
else
1859-
goto global;
1860-
}
1852+
if (scx_ops_bypassing())
1853+
goto global;
18611854

18621855
if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
18631856
goto direct;
@@ -1867,11 +1860,6 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
18671860
unlikely(p->flags & PF_EXITING))
18681861
goto local;
18691862

1870-
/* see %SCX_OPS_ENQ_LAST */
1871-
if (!static_branch_unlikely(&scx_ops_enq_last) &&
1872-
(enq_flags & SCX_ENQ_LAST))
1873-
goto local;
1874-
18751863
if (!SCX_HAS_OP(enqueue))
18761864
goto global;
18771865

@@ -2517,7 +2505,6 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local)
25172505
struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
25182506
bool prev_on_scx = prev->sched_class == &ext_sched_class;
25192507
int nr_loops = SCX_DSP_MAX_LOOPS;
2520-
bool has_tasks = false;
25212508

25222509
lockdep_assert_rq_held(rq);
25232510
rq->scx.flags |= SCX_RQ_IN_BALANCE;
@@ -2542,9 +2529,9 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local)
25422529
/*
25432530
* If @prev is runnable & has slice left, it has priority and
25442531
* fetching more just increases latency for the fetched tasks.
2545-
* Tell put_prev_task_scx() to put @prev on local_dsq. If the
2546-
* BPF scheduler wants to handle this explicitly, it should
2547-
* implement ->cpu_released().
2532+
* Tell pick_next_task_scx() to keep running @prev. If the BPF
2533+
* scheduler wants to handle this explicitly, it should
2534+
* implement ->cpu_release().
25482535
*
25492536
* See scx_ops_disable_workfn() for the explanation on the
25502537
* bypassing test.
@@ -2570,7 +2557,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local)
25702557
goto has_tasks;
25712558

25722559
if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing() || !scx_rq_online(rq))
2573-
goto out;
2560+
goto no_tasks;
25742561

25752562
dspc->rq = rq;
25762563

@@ -2609,13 +2596,23 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local)
26092596
}
26102597
} while (dspc->nr_tasks);
26112598

2612-
goto out;
2599+
no_tasks:
2600+
/*
2601+
* Didn't find another task to run. Keep running @prev unless
2602+
* %SCX_OPS_ENQ_LAST is in effect.
2603+
*/
2604+
if ((prev->scx.flags & SCX_TASK_QUEUED) &&
2605+
(!static_branch_unlikely(&scx_ops_enq_last) || scx_ops_bypassing())) {
2606+
if (local)
2607+
prev->scx.flags |= SCX_TASK_BAL_KEEP;
2608+
goto has_tasks;
2609+
}
2610+
rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
2611+
return false;
26132612

26142613
has_tasks:
2615-
has_tasks = true;
2616-
out:
26172614
rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
2618-
return has_tasks;
2615+
return true;
26192616
}
26202617

26212618
static int balance_scx(struct rq *rq, struct task_struct *prev,
@@ -2728,44 +2725,34 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
27282725
if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
27292726
SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true);
27302727

2731-
/*
2732-
* If we're being called from put_prev_task_balance(), balance_scx() may
2733-
* have decided that @p should keep running.
2734-
*/
2735-
if (p->scx.flags & SCX_TASK_BAL_KEEP) {
2728+
if (p->scx.flags & SCX_TASK_QUEUED) {
27362729
p->scx.flags &= ~SCX_TASK_BAL_KEEP;
2737-
set_task_runnable(rq, p);
2738-
dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
2739-
return;
2740-
}
27412730

2742-
if (p->scx.flags & SCX_TASK_QUEUED) {
27432731
set_task_runnable(rq, p);
27442732

27452733
/*
2746-
* If @p has slice left and balance_scx() didn't tag it for
2747-
* keeping, @p is getting preempted by a higher priority
2748-
* scheduler class or core-sched forcing a different task. Leave
2749-
* it at the head of the local DSQ.
2734+
* If @p has slice left and is being put, @p is getting
2735+
* preempted by a higher priority scheduler class or core-sched
2736+
* forcing a different task. Leave it at the head of the local
2737+
* DSQ.
27502738
*/
27512739
if (p->scx.slice && !scx_ops_bypassing()) {
27522740
dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
27532741
return;
27542742
}
27552743

27562744
/*
2757-
* If we're in the pick_next_task path, balance_scx() should
2758-
* have already populated the local DSQ if there are any other
2759-
* available tasks. If empty, tell ops.enqueue() that @p is the
2760-
* only one available for this cpu. ops.enqueue() should put it
2761-
* on the local DSQ so that the subsequent pick_next_task_scx()
2762-
* can find the task unless it wants to trigger a separate
2763-
* follow-up scheduling event.
2745+
* If @p is runnable but we're about to enter a lower
2746+
* sched_class, %SCX_OPS_ENQ_LAST must be set. Tell
2747+
* ops.enqueue() that @p is the only one available for this cpu,
2748+
* which should trigger an explicit follow-up scheduling event.
27642749
*/
2765-
if (list_empty(&rq->scx.local_dsq.list))
2750+
if (sched_class_above(&ext_sched_class, next->sched_class)) {
2751+
WARN_ON_ONCE(!static_branch_unlikely(&scx_ops_enq_last));
27662752
do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
2767-
else
2753+
} else {
27682754
do_enqueue_task(rq, p, 0, -1);
2755+
}
27692756
}
27702757
}
27712758

@@ -2780,27 +2767,33 @@ static struct task_struct *pick_next_task_scx(struct rq *rq,
27802767
{
27812768
struct task_struct *p;
27822769

2783-
if (prev->sched_class == &ext_sched_class)
2784-
put_prev_task_scx(rq, prev, NULL);
2785-
2786-
p = first_local_task(rq);
2787-
if (!p)
2788-
return NULL;
2789-
2790-
if (prev->sched_class != &ext_sched_class)
2791-
prev->sched_class->put_prev_task(rq, prev, p);
2792-
2793-
set_next_task_scx(rq, p, true);
2770+
/*
2771+
* If balance_scx() is telling us to keep running @prev, replenish slice
2772+
* if necessary and keep running @prev. Otherwise, pop the first one
2773+
* from the local DSQ.
2774+
*/
2775+
if (prev->scx.flags & SCX_TASK_BAL_KEEP) {
2776+
prev->scx.flags &= ~SCX_TASK_BAL_KEEP;
2777+
p = prev;
2778+
if (!p->scx.slice)
2779+
p->scx.slice = SCX_SLICE_DFL;
2780+
} else {
2781+
p = first_local_task(rq);
2782+
if (!p)
2783+
return NULL;
27942784

2795-
if (unlikely(!p->scx.slice)) {
2796-
if (!scx_ops_bypassing() && !scx_warned_zero_slice) {
2797-
printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
2798-
p->comm, p->pid);
2799-
scx_warned_zero_slice = true;
2785+
if (unlikely(!p->scx.slice)) {
2786+
if (!scx_ops_bypassing() && !scx_warned_zero_slice) {
2787+
printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
2788+
p->comm, p->pid);
2789+
scx_warned_zero_slice = true;
2790+
}
2791+
p->scx.slice = SCX_SLICE_DFL;
28002792
}
2801-
p->scx.slice = SCX_SLICE_DFL;
28022793
}
28032794

2795+
put_prev_set_next_task(rq, prev, p);
2796+
28042797
return p;
28052798
}
28062799

@@ -3875,12 +3868,13 @@ bool task_should_scx(struct task_struct *p)
38753868
* to force global FIFO scheduling.
38763869
*
38773870
* a. ops.enqueue() is ignored and tasks are queued in simple global FIFO order.
3871+
* %SCX_OPS_ENQ_LAST is also ignored.
38783872
*
38793873
* b. ops.dispatch() is ignored.
38803874
*
3881-
* c. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value can't be
3882-
* trusted. Whenever a tick triggers, the running task is rotated to the tail
3883-
* of the queue with core_sched_at touched.
3875+
* c. balance_scx() does not set %SCX_TASK_BAL_KEEP on non-zero slice as slice
3876+
* can't be trusted. Whenever a tick triggers, the running task is rotated to
3877+
* the tail of the queue with core_sched_at touched.
38843878
*
38853879
* d. pick_next_task() suppresses zero slice warning.
38863880
*

tools/sched_ext/scx_qmap.bpf.c

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -205,16 +205,15 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
205205

206206
/*
207207
* All enqueued tasks must have their core_sched_seq updated for correct
208-
* core-sched ordering, which is why %SCX_OPS_ENQ_LAST is specified in
209-
* qmap_ops.flags.
208+
* core-sched ordering. Also, take a look at the end of qmap_dispatch().
210209
*/
211210
tctx->core_sched_seq = core_sched_tail_seqs[idx]++;
212211

213212
/*
214213
* If qmap_select_cpu() is telling us to or this is the last runnable
215214
* task on the CPU, enqueue locally.
216215
*/
217-
if (tctx->force_local || (enq_flags & SCX_ENQ_LAST)) {
216+
if (tctx->force_local) {
218217
tctx->force_local = false;
219218
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);
220219
return;
@@ -285,6 +284,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
285284
{
286285
struct task_struct *p;
287286
struct cpu_ctx *cpuc;
287+
struct task_ctx *tctx;
288288
u32 zero = 0, batch = dsp_batch ?: 1;
289289
void *fifo;
290290
s32 i, pid;
@@ -349,6 +349,21 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
349349

350350
cpuc->dsp_cnt = 0;
351351
}
352+
353+
/*
354+
* No other tasks. @prev will keep running. Update its core_sched_seq as
355+
* if the task were enqueued and dispatched immediately.
356+
*/
357+
if (prev) {
358+
tctx = bpf_task_storage_get(&task_ctx_stor, prev, 0, 0);
359+
if (!tctx) {
360+
scx_bpf_error("task_ctx lookup failed");
361+
return;
362+
}
363+
364+
tctx->core_sched_seq =
365+
core_sched_tail_seqs[weight_to_idx(prev->scx.weight)]++;
366+
}
352367
}
353368

354369
void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
@@ -701,6 +716,5 @@ SCX_OPS_DEFINE(qmap_ops,
701716
.cpu_offline = (void *)qmap_cpu_offline,
702717
.init = (void *)qmap_init,
703718
.exit = (void *)qmap_exit,
704-
.flags = SCX_OPS_ENQ_LAST,
705719
.timeout_ms = 5000U,
706720
.name = "qmap");

0 commit comments

Comments
 (0)