Skip to content

Commit cff2907

Browse files
committed
Replace CFS/EEVDF with compact O(1) tiny scheduler
Linux 7.0's kernel/sched/fair.c carries no #ifdef CONFIG_SMP guard. On a UP NOMMU image the SMP load-balancer (select_task_rq_fair 1,484, sched_balance_rq 1,460, update_sd_lb_stats 912, sched_balance_find_*_group 1,262, _nohz_idle_balance 424, can_migrate_task 324, active_load_balance_cpu_stop 316, ~7.8KB total) gets pinned by the sched_class callback table; --gc-sections cannot reach it through the table. Add the same kind of out-of-tree gate 0012/0013 used for debug.c and deadline.c, but for the whole class. CONFIG_SCHED_FAIR_TINY (default n) wraps fair.c body in #ifndef and provides a three-priority O(1) class in the #else branch: - per-CPU bitmap + per-priority FIFO (HIGH/NORMAL/LOW) - O(1) pick: find_first_bit(active) + list_first_entry - O(1) enqueue: list_add_tail + __set_bit - O(1) dequeue: list_del_init + __clear_bit when queue empties - cross-priority preemption at wakeup; round-robin within a priority via a fixed jiffies time-slice reset on set_next_task Priority is a pure function of nice value: nice<0 -> HIGH, nice==0 -> NORMAL, nice>0 -> LOW; SCHED_IDLE collapses to LOW; SCHED_BATCH uses nice normally. Tasks chain through the existing &p->se.group_node (dead under !FAIR_GROUP_SCHED) so task_struct stays unchanged. The bucket index is recomputed from p->static_prio on every callback; core.c's dequeue-modify-enqueue protocol (verified across all four static_prio mutation sites: sched_fork at 4650/4653, syscalls.c set_user_nice at 84 for RT/DL and at 89 for fair via scoped_guard(sched_change, ...)) keeps the value stable across the removal/insertion bracket. RT preemption is unchanged: rt_sched_class still preempts fair via the existing class chain walk in pick_next_task_balance. Not the historical 2.6 O(1) scheduler -- no active/expired arrays, no interactivity estimator (the gameable heuristic that motivated CFS), no priority recalculation. Just the priority bitmap + FIFO data structure that O(1) got right, without the policy machinery that O(1) got wrong. The #else branch re-exports every symbol other TUs depend on: update_curr_common (rt.c / deadline-class stub / stop_task.c / ext.c runtime accounting), init_cfs_rq, fair_server_init, init_sched_fair_class, sched_init_granularity, update_max_interval, init_entity_runnable_average, post_init_entity_util_avg, sched_balance_trigger, nohz_balance_{enter,exit}_idle, nohz_run_idle_balance, update_group_capacity, __setparam_fair, arch_asym_cpu_priority, plus sysctl_sched_base_slice and sysctl_sched_migration_cost storage. switched_to_fair and prio_changed_fair filter on rq->donor->sched_class != fair_sched_class to avoid spurious resched_curr when the runner is RT (mirrors mainline behavior; the kernel dispatches both hooks unconditionally). pelt.c is left untouched. With fair.c gated, its CFS-side entry points lose their callers; rt.c keeps update_rt_rq_load_avg live. The remaining PELT symbols are non-static so LTO largely cannot strip them, but the cost is small (1.7KB). Build wiring: build.sh adds the 0014 patch to the apply glob, sets CONFIG_SCHED_FAIR_TINY=y in the inline kernel .config block, and extends the post-olddefconfig verifier with a positive presence check. Result: linux.axf 1,204,768 -> 1,188,352 bytes (-16,416 / -1.36%); vmlinux .text 729,380 -> 713,412 (-15,968), .rodata -96, .init.text -116, .bss -36, .data -32; kernel/sched/fair.c collapses 16,782 / 97 syms -> ~1,160 bytes / 22 syms; pick_task_fair compiles to 24 bytes of pure O(1) machine code (find_first_bit + list head deref + container_of, no loops or rb-tree walks). QEMU MPS2-AN386 boots clean to the BusyBox shell across three back-to-back validate-qemu.sh runs against the full PGO workload (17 fork/exec/wait sequences exercising hush spawn, cp /bin/busybox, mv, ln, mkdir, rm, test pipelines).
1 parent 9000871 commit cff2907

2 files changed

Lines changed: 530 additions & 2 deletions

File tree

build.sh

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -666,7 +666,7 @@ build_linux() {
666666
cd linux-${LINUX_VERSION}
667667

668668
# Apply linux-tiny patches for reduced memory footprint and LTO support
669-
for p in ../patches/0002-*.patch ../patches/0003-*.patch ../patches/0004-*.patch ../patches/0005-*.patch ../patches/0006-*.patch ../patches/0010-*.patch ../patches/0011-*.patch ../patches/0012-*.patch ../patches/0013-*.patch; do
669+
for p in ../patches/0002-*.patch ../patches/0003-*.patch ../patches/0004-*.patch ../patches/0005-*.patch ../patches/0006-*.patch ../patches/0010-*.patch ../patches/0011-*.patch ../patches/0012-*.patch ../patches/0013-*.patch ../patches/0014-*.patch; do
670670
[ -f "${p}" ] || continue
671671
apply_patch_once "${p}"
672672
done
@@ -938,6 +938,20 @@ build_linux() {
938938
echo "# CONFIG_CGROUPS is not set" >>.config
939939
echo "# CONFIG_SCHED_AUTOGROUP is not set" >>.config
940940

941+
# Patch 0014 replaces fair.c (CFS/EEVDF) with a compact O(1) priority
942+
# round-robin SCHED_NORMAL class under CONFIG_SCHED_FAIR_TINY. fair.c
943+
# body is wrapped in #ifndef; the #else branch provides a per-CPU
944+
# bitmap + per-priority FIFO (HIGH/NORMAL/LOW) plus stubs for every
945+
# symbol other TUs (rt.c, deadline.c stub, syscalls.c, topology.c,
946+
# idle.c, build_utility.c) reference. Linux 7.0 has no CONFIG_SMP
947+
# guard inside fair.c so balance code (select_task_rq, sched_balance_*,
948+
# _nohz_idle_balance, ~7.8KB total) stays linked under UP via the
949+
# sched_class table; this knob is the only way to remove it. nice
950+
# values quantise to the three buckets (nice<0 -> HIGH, nice==0 ->
951+
# NORMAL, nice>0 -> LOW); SCHED_IDLE collapses to LOW; rt_sched_class
952+
# still pre-empts via the existing class chain walk.
953+
echo "CONFIG_SCHED_FAIR_TINY=y" >>.config
954+
941955
run_logged "olddefconfig" kernel_make olddefconfig
942956

943957
# Verify critical config options survived olddefconfig resolution.
@@ -1003,7 +1017,8 @@ build_linux() {
10031017
"# CONFIG_SCHED_DEADLINE_CLASS is not set" \
10041018
"# CONFIG_PSI is not set" \
10051019
"# CONFIG_CGROUPS is not set" \
1006-
"# CONFIG_SCHED_AUTOGROUP is not set"; do
1020+
"# CONFIG_SCHED_AUTOGROUP is not set" \
1021+
"CONFIG_SCHED_FAIR_TINY=y"; do
10071022
if ! grep -q "^${opt}\$" .config; then
10081023
echo "ERROR: expected '${opt}' in .config after olddefconfig"
10091024
exit 1

0 commit comments

Comments
 (0)