Skip to content

Conversation

@vicLin8712
Copy link
Collaborator

@vicLin8712 vicLin8712 commented Nov 17, 2025

New O(1) time complexity scheduler

This PR introduces a new O(1) priority-based scheduler that replaces the original O(n) round-robin scheduler. The previous design scanned the global task list linearly to select the next runnable task (TASK_READY), which became a bottleneck as the number of tasks increased and did not support task selection on different task priorities.

Changes

The following diagrams illustrate the differences between the original and new schedulers in this PR.

Original scheduler

As shown in the figure, the original scheduler selected a new task based on its state. Once a runnable task is found, it would be updated to the task_current and context switched.

未命名绘图 drawio (25)

This linear search introduces a significant performance issue in the scheduler, especially when the number of runnable tasks increases. The original scheduler iterates over the task list circularly, but because it cannot guarantee that all tasks are visited safely, the iteration count is capped with an artificial limit (IMAX = 500).

New scheduler design in this PR

The new scheduler introduces a sched_t structure that provides constant-time (O(1)) tracking and selection of runnable tasks. The main components are:

  • Bitmap (bitmap)
    A compact bitmask where each bit (0–7) represents one priority level (from bit 0 critical to bit 7 idle). A bit is set when at least one task of the corresponding priority is runnable. This enables O(1) identification of the highest runnable priority via a De Bruijn–based least-significant-bit (LSB) helper.

  • Ready queues (ready_queue[])
    An array of per-priority linked lists. Each ready queue contains only the runnable tasks of its priority level. Blocked, suspended, or delayed tasks are removed from this list; waking up or resuming a task re-enqueues it.

  • Round-robin cursor (rr_cursor[])
    For each priority level, an RR cursor tracks the next task in the corresponding ready queue for round-robin scheduling among tasks of the same priority.

未命名绘图 drawio (26)

New task selection logic in this PR

The new scheduler selects the next task in three main steps:

  1. Find the highest runnable priority from the bitmap
    The scheduler uses a De Bruijn–based helper on the bitmap to obtain the index of the highest-priority runnable level in O(1) time.

  2. Pick the next task via the round-robin cursor
    For that priority level, the scheduler reads the corresponding rr_cursor, which points to the next runnable task in the ready queue, and assigns it to task_current.

  3. Advance the round-robin cursor
    After selecting the task, the rr_cursor is advanced to the next node in the ready queue (wrapping around when reaching the end), preserving round-robin scheduling among tasks of the same priority.

With this design, the scheduler no longer scans the entire task list. Instead, it uses the bitmap plus per-priority cursors to achieve deterministic O(1) task selection while still providing fairness within each priority level.

Features

The new scheduler includes the following features:

  • O(1) priority-based scheduler

    • Per-priority (8 total) ready queues for runnable tasks.
    • Bitmap to track which priority levels have ready tasks.
    • O(1) task count tracing.
    • De Bruijn–based helper to find the highest-priority ready task in O(1).
  • Strict priority scheduling policy

    • Higher-priority tasks always preempt lower-priority ones.
    • Round-robin within the same priority level (RR cursor preserved).
  • Default idle task (sysmem)

    • Automatically initialized during system startup, ensuring the kernel always has a runnable task.
    • Managed directly by the sched_t structure and does not appear in any ready queue or global task list.
    • Serves as the initial running task and yields immediately once user tasks become available after system initialization.

Implementation detail

This PR introduces an O(1) scheduler by refactoring the internal scheduling logic and reorganizing task management around a new data structure sched_t. The sched_t instance contains three key components: a bitmap that tracks which priority levels contain runnable tasks, an array of per-priority ready queues, and round-robin cursors used to determine the next task within each priority level. All enqueue and dequeue operations now funnel through unified helpers (sched_enqueue_task() and sched_dequeue_task()), ensuring consistent updates to both the bitmap and ready queues. Task state transitions were updated accordingly: when a task becomes blocked, suspended, delayed, or cancelled, it is removed from its ready queue; when it becomes runnable again, it is reinserted into the appropriate queue. The scheduler's main selection function now uses the bitmap and a De Bruijn–based LSB helper to perform constant-time priority lookup, then reads and advances the per-priority round-robin cursor to select the next task. The idle task is handled specially: it is not placed in any ready queue and is selected only when all bitmap bits are clear. No changes were made to the global task list structure or the task state model; only the scheduling backend has been redesigned to provide deterministic behavior and strict priority semantics.

Task state transition

The task state transition is the same as the original scheduler; only the ready queue dequeue/enqueue path is added in the new scheduler.

All tasks with states TASK_READY and TASK_RUNNING are in the ready queue of the corresponding task priority. Leave/enter above task states will dequeue/enqueue from/into the ready queue.
image

Validation

1. Backward compatible

All applications under the app/ directory have been executed and verified to run correctly without modification. No functional regressions were observed when switching from the original scheduler to the new O(1) scheduler.

2. Unit test

This unit test focuses on verifying the consistency of the bitmap and the O(1) task count tracing maintained in sched_t during task state transitions and priority changes.

Approach

  • A dedicated controller task is created with priority TASK_PRIO_CRIT to orchestrate the entire test process and ensure deterministic sequencing.
  • After each state change of the test tasks, the unit test checks both bitmap correctness and per-priority task count consistency to ensure alignment with the ready-queue state.

Task types

  • Controller task: Responsible for coordinating the test flow and triggering all state transitions.
  • Delay task: A runnable task that enters TASK_BLOCKED through mo_task_delay(), allowing verification of dequeue behavior and ready-queue updates.
  • Normal task: A simple infinite-loop task that remains runnable unless externally suspended or cancelled, serving as the primary subject for state transition tests.

Verified state points
The bitmap and task count will be verified after the following actions.

  • Normal task state transitions

    • Task creation (TASK_READY).
    • Priority changes.
    • Suspension initiated by the controller (TASK_READY → TASK_SUSPEND).
    • Resumption by the controller (TASK_SUSPEND → TASK_READY).
    • Cancellation by the controller (TASK_READY → TASK_CANCELLED), ensuring it is removed from ready queues and no bitmap bits remain set.
  • Blocked task behavior (TASK_RUNNING → TASK_BLOCKED)

    • Delay task is created and its priority changed to match the controller's priority (TASK_READY).
    • When the controller yields, the delay task becomes active, invokes mo_task_delay(), transitions into TASK_BLOCKED, and the controller resumes execution. The test verifies that the blocked task is fully removed from the ready queue and its priority bit is cleared in the bitmap.

Expected results
All state transitions maintain consistent bitmap states, correct ready-queue membership, and accurate per-priority task count tracking. No unexpected runnable tasks appear, and no ready-queue entries persist after a task transitions to BLOCKED, SUSPENDED, or CANCELLED.

Test result

Linmo kernel is starting...
Heap initialized, 130005992 bytes available
idle id 1: entry=80001900 stack=80004488 size=4096
task 2: entry=80000788 stack=80005508 size=4096 prio_level=4 time_slice=5
Scheduler mode: Preemptive
Starting RR-cursor based scheduler test suits...

=== Testing Bitmap and Task Count Consistency ===
task 3: entry=80000168 stack=80006634 size=4096 prio_level=4 time_slice=5
PASS: Bitmap is consistent when TASK_READY
PASS: Task count is consistent when TASK_READY
PASS: Bitmap is consistent when priority migration
PASS: Task count is consistent when priority migration
PASS: Bitmap is consistent when TASK_SUSPENDED
PASS: Task count is consistent when TASK_SUSPENDED
PASS: Bitmap is consistent when TASK_READY from TASK_SUSPENDED
PASS: Task count is consistent when TASK_READY from TASK_SUSPENDED
PASS: Bitmap is consistent when task canceled
PASS: Task count is consistent when task canceled
task 4: entry=80000178 stack=80006634 size=4096 prio_level=4 time_slice=5
PASS: Task count is consistent when task canceled
PASS: Task count is consistent when task blocked

=== Test Results ===
Tests passed: 12
Tests failed: 0
Total tests: 12
All tests PASSED!
RR-cursor based scheduler tests completed successfully.

Note

  1. The term TASK_CANCELLED in this document is used only for explanation. It is not an actual state in the task state machine, but represents the condition where a task has been removed from all scheduling structures and no longer exists in the system.
  2. The task states shown in parentheses (e.g., (TASK_READY)) refer to the state of the test tasks being created or manipulated, not the state of the controller task.

Implementation reference

3. Benchmark

The benchmark compares the original O(n) scheduler with the new O(1) scheduler under multiple task-load scenarios. Each scenario measures the average scheduling latency observed in QEMU using the existing benchmarking framework.

Test suits

Benchmark methodology

  • Same build configuration for both schedulers.
  • Each scenario repeatedly triggers scheduling events and measures the average scheduling latency.
  • The results reflect pure scheduler overhead (not influenced by application logic).
  • Reported latency is the average of multiple runs, with outliers filtered by the benchmark script.

Scenarios
The benchmark covers the following scenarios:

  • Minimal active – Few tasks, low diversity.
  • Moderate active – Medium task count with mixed priorities.
  • Heavy active – Many runnable tasks across all priority levels.
  • Stress test – Real-time–biased workload with uneven priority distribution.
  • Full load test -

Test results

Scenario 'Minimal Active':                                                                                                                                                                                                 
  mean improvement        = 2.68x faster                                                                                                                                                                                   
  std dev of improvement  = 0.34x                                                                                                                                                                                          
  min / max improvement   = 1.75x  /  3.35x                                                                                                                                                                                
  95% CI of improvement   = [2.54x, 2.83x]                                                                                                                                                                                 
  mean old sched time     = 5616.25 us                                                                                                                                                                                     
  mean new sched time     = 2119.0 us                                                                                                                                                                                      
  max  old sched time     = 47.0 us 
  max  new sched time     = 37.0 us 

Scenario 'Moderate Active':
  mean improvement        = 1.80x faster
  std dev of improvement  = 0.27x
  min / max improvement   = 1.27x  /  2.51x
  95% CI of improvement   = [1.68x, 1.92x]
  mean old sched time     = 3887.6 us 
  mean new sched time     = 2179.45 us 
  max  old sched time     = 40.0 us 
  max  new sched time     = 23.0 us 

Scenario 'Heavy Active':
  mean improvement        = 1.02x faster
  std dev of improvement  = 0.08x
  min / max improvement   = 0.84x  /  1.17x
  95% CI of improvement   = [0.98x, 1.06x]
  mean old sched time     = 2150.15 us 
  mean new sched time     = 2119.1 us 
  max  old sched time     = 73.0 us 
  max  new sched time     = 33.0 us 

Scenario 'Stress Test':
  mean improvement        = 0.93x (slower than OLD)
  std dev of improvement  = 0.11x
  min / max improvement   = 0.65x  /  1.20x
  95% CI of improvement   = [0.88x, 0.98x]
  mean old sched time     = 1874.35 us 
  mean new sched time     = 2032.55 us 
  max  old sched time     = 23.0 us 
  max  new sched time     = 20.0 us 

Scenario 'Full Load Test':
  mean improvement        = 0.89x (slower than OLD)
  std dev of improvement  = 0.11x
  min / max improvement   = 0.63x  /  1.07x
  95% CI of improvement   = [0.84x, 0.94x]
  mean old sched time     = 1798.8 us 
  mean new sched time     = 2048.55 us 
  max  old sched time     = 33.0 us 
  max  new sched time     = 52.0 us

imageImplementation reference

image

Future works

Notes

The draft PR #23 has been closed.

Previously, the scheduler performed a linear search through the global
task list (kcb->tasks) to find the next TASK_READY task. This approach
limited scalability as the search iterations increased with the number
of tasks, resulting in higher scheduling latency.

To support an O(1) scheduler and improve extensibility, a sched_t
structure is introduced and integrated into kcb. The new structure
contains:

- ready_queues: Holds all runnable tasks, including TASK_RUNNING and
  TASK_READY. The scheduler selects tasks directly from these queues.
- ready_bitmap: Records the state of each ready queue. Using the bitmap,
  the scheduler can locate the highest-priority runnable task in O(1)
  time complexity.
- rr_cursors: Round-robin cursors that track the next task node in each
  ready queue. Each priority level maintains its own RR cursor. The top
  priority cursor is assigned to kcb->task_current, which is advanced
  circularly after each scheduling cycle.
- hart_id: Identifies the scheduler instance per hart (0 for single-hart
  configurations).
- task_idle: The system idle task, executed when no runnable tasks exist.

In the current design, kcb binds only one sched_t instance (hart0) for
single-hart systems, but this structure can be extended for multi-hart
scheduling in the future.
Previously, the list operation for removal was limited to
list_remove(), which immediately freed the list node during the
function call. When removing a running task (TASK_RUNNING), the list
node in the ready queue must not be freed because kcb->task_current
shares the same node.

This change introduces list_unlink(), which detaches the node from
the list without freeing it. The unlinked node is returned to the
caller, allowing safe reuse and improving flexibility in dequeue
operations.

This API will be applied in sched_dequeue_task() for safely removing
tasks from ready queues.
When a task is enqueued into or dequeued from the ready queue, the
bitmap that indicates the ready queue state should be updated.

These three marcos can be used in mo_task_dequeue() and
mo_task_enqueue() APIs to improve readability and maintain
consistency.
Previously, sched_enqueue_task() only changed task state without inserting
into ready queue. As a result, the scheduler could not select enqueued task
for execution.

This change pushes the task into the appropriate ready queue using
list_pusback(), and initializes realated attribution such as the
ready bitmap and RR cursor. The ready queue for corresponging task
priority will be initialized at this enqueue path and never be
released afterward.

With this updated API, tasks can be enqueued into the ready queue and
selected by cursor-based O(1) scheduler.
Previously, mo_task_dequeue() was only a stub and returned immediately
without performing any operation. As a result, tasks remained in the
ready queue after being dequeued, leading to potential scheduler
inconsistencies.

This change implements the full dequeue process:
- Searches for the task node in the ready queue by task ID.
- Maintains RR cursor consistency: the RR cursor should always point
  to a valid task node in the ready queue. When removing a task node,
  the cursor is advanced circularly to the next node.
- Unlinks the task node using list_unlink(), which removes the node
  from the ready queue without freeing it. list_unlink() is used
  instead of list_remove() to avoid accidentally freeing
  kcb->task_current when the current running task is dequeued.
- Updates and checks queue_counts: if the ready queue becomes empty,
  the RR cursor is set to NULL and the bitmap is cleared until a new
  task is enqueued.
Previously, mo_task_spawn() only created a task and appended it to the
global task list (kcb->tasks), assigning the first task directly from
the global list node.

This change adds a call to sched_enqueue_task() within the critical
section to enqueue the task into the ready queue and safely initialize
its scheduling attributes. The first task assignment is now aligned
with the RR cursor mechanism to ensure consistency with the O(1)
scheduler.
Previously, the scheduler iterated through the global task list
(kcb->tasks) to find the next TASK_READY task, resulting in O(N)
selection time. This approach limited scalability and caused
inconsistent task rotation under heavy load.

The new scheduling process:
1. Check the ready bitmap and find the highest priority level.
2. Select the RR cursor node from the corresponding ready queue.
3. Advance the selected cursor node circularly.

Why RR cursor instead of pop/enqueue rotation:
- Fewer operations on the ready queue: compared to the pop/enqueue
  approach, which requires two function calls per switch, the RR
  cursor method only advances one pointer per scheduling cycle.
- Cache friendly: always accesses the same cursor node, improving
  cache locality on hot paths.
- Cycle deterministic: RR cursor design allows deterministic task
  rotation and enables potential future extensions such as cycle
  accounting or fairness-based algorithms.

This change introduces a fully O(1) scheduler design based on
per-priority ready queues and round-robin (RR) cursors. Each ready
queue maintains its own cursor, allowing the scheduler to select
the next runnable task in constant time.
Previously, mo_task_suspend() only changed the task state to
TASK_SUSPENDED without removing the task from the ready queue.
As a result, suspended tasks could still be selected by the
scheduler, leading to incorrect task switching and inconsistent
queue states.

This change adds a dequeue operation to remove the corresponding
task node from its ready queue before marking it as suspended.
Additionally, the condition to detect the currently running task
has been updated: the scheduler now compares the TCB pointer
(kcb->task_current->data == task) instead of the list node
(kcb->task_current == node), since kcb->task_current now stores
a ready queue node rather than a global task list node.

If the suspended task is currently running, the CPU will yield
after the task is suspended to allow the scheduler to select
the next runnable task.

This ensures that suspended tasks are no longer visible to the
scheduler until they are resumed.
Previously, mo_task_cancel() only removed the task node from the global
task list (kcb->tasks) but did not remove it from the ready queue.
As a result, the scheduler could still select a canceled task that
remained in the ready queue.

Additionally, freeing the node twice could occur because the same node
was already freed after list_remove(), leading to a double-free issue.

This change adds a call to sched_dequeue_task() to remove the task from
the ready queue, ensuring that once a task is canceled, it will no longer
appear in the scheduler’s selection path. This also prevents memory
corruption caused by double-freeing list nodes.
Previously, mo_task_resume() only changed resumed task state to TASK_READY,
but didn't enqueue it into ready queue. As a result, the scheduler could
not select the resumed task for execution.

This change adds sched_enqueue_task() to insert the resumed task into the
appropriate ready queue and update the ready bitmap, ensuring the resumed
task becomes schedulable again.
Previously, mo_task_wakeup() only changed the task state to TASK_READY
without enqueuing the task back into the ready queue. As a result, a
woken-up task could remain invisible to the scheduler and never be
selected for execution.

This change adds a call to sched_enqueue_task() to insert the task into
the appropriate ready queue based on its priority level. The ready
bitmap, task counts of each ready queue, and RR cursor are updated
accordingly to maintain scheduler consistency.

With this update, tasks transitioned from a blocked or suspended state
can be properly scheduled for execution once they are woken up.
This commit introduces a new API, sched_migrate_task(), which enables
migration of a task between ready queues of different priority levels.

The function safely removes the task from its current ready queue and
enqueues it into the target queue, updating the corresponding RR cursor
and ready bitmap to maintain scheduler consistency. This helper will be
used in mo_task_priority() and other task management routines that
adjust task priority dynamically.

Future improvement:
The current enqueue path allocates a new list node for each task
insertion based on its TCB pointer. In the future, this can be optimized
by directly transferring or reusing the existing list node between
ready queues, eliminating the need for an additional malloc() and free()
operations during priority migrations.
This change refactors the priority update process in mo_task_priority()
to include early-return checks and proper task migration handling.

- Early-return conditions:
  * Prevent modification of the idle task.
  * Disallow assigning TASK_PRIO_IDLE to non-idle tasks.
  The idle task is created by idle_task_init() during system startup and
  must retain its fixed priority.

- Task migration:
  If the priority-changed task resides in a ready queue (TASK_READY or
  TASK_RUNNING), sched_migrate_task() is called to move it to the queue
  corresponding to the new priority.

- Running task behavior:
  When the current running task changes its own priority, it yields the
  CPU so the scheduler can dispatch the next highest-priority task.
This commit introduces the system idle task and its initialization API
(idle_task_init()). The idle task serves as the default execution
context when no other runnable tasks exist in the system.

The sched_idle() function supports both preemptive and cooperative
modes. In sched_t, a list node named task_idle is added to record the
idle task sentinel. The idle task never enters any ready queue and its
priority level cannot be changed.

When idle_task_init() is called, the idle task is initialized as the
first execution context. This eliminates the need for additional APIs
in main() to set up the initial high-priority task during system launch.
This design allows task priorities to be adjusted safely during
app_main(), while keeping the scheduler’s entry point consistent.
When all ready queues are empty, the scheduler should switch
to idle mode and wait for incoming interrupts. This commit
introduces a dedicated helper to handle that transition,
centralizing the logic and improving readbility of the
scheduler path to idle.
Previously, when all ready queues were empty, the scheduler
would trigger a kernel panic. This condition should instead
transition into the idle task rather than panic.

The new sched_switch_to_idle() helper centralizes this logic,
making the path to idle clearer and more readable.
The idle task is now initialized in main() during system startup.
This ensures that the scheduler always has a valid execution context
before any user or application tasks are created. Initializing the
idle task early guarantees a safe fallback path when no runnable
tasks exist and keeps the scheduler entry point consistent.
This change sets up the scheduler state during system startup by
assigning kcb->task_current to kcb->harts->task_idle and dispatching
to the idle task as the first execution context.

This commit also keeps the scheduling entry path consistent between
startup and runtime.
Previously, both mo_task_spawn() and idle_task_init() implicitly
bound their created tasks to kcb->task_current as the first execution
context. This behavior caused ambiguity with the scheduler, which is
now responsible for determining the active task during system startup.

This change removes the initial binding logic from both functions,
allowing the startup process (main()) to explicitly assign
kcb->task_current (typically to the idle task) during launch.
This ensures a single, centralized initialization flow and improves
the separation between task creation and scheduling control.
Prepare for O(1) bitmap index lookup by adding a 32-entry De Bruijn
sequence table. The table will be used in later commits to replace
iterative bit scanning. No functional change in this patch.
Implement the helper function that uses a De Bruijn multiply-and-LUT
approach to compute the index of the least-significant set bit in O(1)
time complexity.

This helper is not yet wired into the scheduler logic; integration
will follow in a later commit. No functional change in this patch.
Replace the iterative bitmap scanning with the De Bruijn multiply+LUT
method via the new helper. This change makes top-priority selection
constant-time and deterministic.
Previously, _sched_block() only enqueued the task into the wait
queue and set its state to TASK_BLOCKED. In the new scheduler
design (ready-queue–based), a blocked task must also be removed
from its priority's ready queue to prevent it from being
selected by the scheduler.

This change adds the missing dequeue path for the corresponding
ready queue, ensuring behavior consistency.
Previously, sched_wakeup_task() was limited to internal use within
the scheduler module.
This change makes it globally visible so that it can be reused
in semaphore.c for task wake-up operations.
Previously, mo_sem_signal() only changed the awakened task state
to TASK_READY when a semaphore signal was triggered. In the new
scheduler design, which selects runnable tasks from ready queues,
the awakened task must also be enqueued for scheduling.

This change invokes sched_wakeup_task() to perform the enqueue
operation, ensuring the awakened task is properly inserted into
the ready queue.
Previously, mo_task_delay() only set TASK_BLOCKED and updated
delayed ticks. In the new ready-queue-based scheduler, delayed
tasks must also be removed from the ready queue.

This change calls sched_dequeue_task() in mo_task_delay() so
that the task is properly dequeued from its priority ready
queue when it is delayed.
@vicLin8712 vicLin8712 closed this Nov 18, 2025
@vicLin8712 vicLin8712 deleted the o1-sched branch November 18, 2025 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant