Single-Hart O(1) Enhancement #35

vicLin8712 · 2025-11-17T06:01:51Z

New O(1) time complexity scheduler

This PR introduces a new O(1) priority-based scheduler that replaces the original O(n) round-robin scheduler. The previous design scanned the global task list linearly to select the next runnable task (TASK_READY), which became a bottleneck as the number of tasks increased and did not support task selection on different task priorities.

Changes

The following diagrams illustrate the differences between the original and new schedulers in this PR.

Original scheduler

As shown in the figure, the original scheduler selected a new task based on its state. Once a runnable task is found, it would be updated to the task_current and context switched.

This linear search introduces a significant performance issue in the scheduler, especially when the number of runnable tasks increases. The original scheduler iterates over the task list circularly, but because it cannot guarantee that all tasks are visited safely, the iteration count is capped with an artificial limit (IMAX = 500).

New scheduler design in this PR

The new scheduler introduces a sched_t structure that provides constant-time (O(1)) tracking and selection of runnable tasks. The main components are:

Bitmap (bitmap)
A compact bitmask where each bit (0–7) represents one priority level (from bit 0 critical to bit 7 idle). A bit is set when at least one task of the corresponding priority is runnable. This enables O(1) identification of the highest runnable priority via a De Bruijn–based least-significant-bit (LSB) helper.
Ready queues (ready_queue[])
An array of per-priority linked lists. Each ready queue contains only the runnable tasks of its priority level. Blocked, suspended, or delayed tasks are removed from this list; waking up or resuming a task re-enqueues it.
Round-robin cursor (rr_cursor[])
For each priority level, an RR cursor tracks the next task in the corresponding ready queue for round-robin scheduling among tasks of the same priority.

New task selection logic in this PR

The new scheduler selects the next task in three main steps:

Find the highest runnable priority from the bitmap
The scheduler uses a De Bruijn–based helper on the bitmap to obtain the index of the highest-priority runnable level in O(1) time.
Pick the next task via the round-robin cursor
For that priority level, the scheduler reads the corresponding rr_cursor, which points to the next runnable task in the ready queue, and assigns it to task_current.
Advance the round-robin cursor
After selecting the task, the rr_cursor is advanced to the next node in the ready queue (wrapping around when reaching the end), preserving round-robin scheduling among tasks of the same priority.

With this design, the scheduler no longer scans the entire task list. Instead, it uses the bitmap plus per-priority cursors to achieve deterministic O(1) task selection while still providing fairness within each priority level.

Features

The new scheduler includes the following features:

O(1) priority-based scheduler
- Per-priority (8 total) ready queues for runnable tasks.
- Bitmap to track which priority levels have ready tasks.
- O(1) task count tracing.
- De Bruijn–based helper to find the highest-priority ready task in O(1).
Strict priority scheduling policy
- Higher-priority tasks always preempt lower-priority ones.
- Round-robin within the same priority level (RR cursor preserved).
Default idle task (sysmem)
- Automatically initialized during system startup, ensuring the kernel always has a runnable task.
- Managed directly by the sched_t structure and does not appear in any ready queue or global task list.
- Serves as the initial running task and yields immediately once user tasks become available after system initialization.

Implementation detail

This PR introduces an O(1) scheduler by refactoring the internal scheduling logic and reorganizing task management around a new data structure sched_t. The sched_t instance contains three key components: a bitmap that tracks which priority levels contain runnable tasks, an array of per-priority ready queues, and round-robin cursors used to determine the next task within each priority level. All enqueue and dequeue operations now funnel through unified helpers (sched_enqueue_task() and sched_dequeue_task()), ensuring consistent updates to both the bitmap and ready queues. Task state transitions were updated accordingly: when a task becomes blocked, suspended, delayed, or cancelled, it is removed from its ready queue; when it becomes runnable again, it is reinserted into the appropriate queue. The scheduler's main selection function now uses the bitmap and a De Bruijn–based LSB helper to perform constant-time priority lookup, then reads and advances the per-priority round-robin cursor to select the next task. The idle task is handled specially: it is not placed in any ready queue and is selected only when all bitmap bits are clear. No changes were made to the global task list structure or the task state model; only the scheduling backend has been redesigned to provide deterministic behavior and strict priority semantics.

Task state transition

The task state transition is the same as the original scheduler; only the ready queue dequeue/enqueue path is added in the new scheduler.

All tasks with states TASK_READY and TASK_RUNNING are in the ready queue of the corresponding task priority. Leave/enter above task states will dequeue/enqueue from/into the ready queue.

Validation

1. Backward compatible

All applications under the app/ directory have been executed and verified to run correctly without modification. No functional regressions were observed when switching from the original scheduler to the new O(1) scheduler.

2. Unit test

This unit test focuses on verifying the consistency of the bitmap and the O(1) task count tracing maintained in sched_t during task state transitions and priority changes.

Approach

A dedicated controller task is created with priority TASK_PRIO_CRIT to orchestrate the entire test process and ensure deterministic sequencing.
After each state change of the test tasks, the unit test checks both bitmap correctness and per-priority task count consistency to ensure alignment with the ready-queue state.

Task types

Controller task: Responsible for coordinating the test flow and triggering all state transitions.
Delay task: A runnable task that enters TASK_BLOCKED through mo_task_delay(), allowing verification of dequeue behavior and ready-queue updates.
Normal task: A simple infinite-loop task that remains runnable unless externally suspended or cancelled, serving as the primary subject for state transition tests.

Verified state points
The bitmap and task count will be verified after the following actions.

Normal task state transitions
- Task creation (TASK_READY).
- Priority changes.
- Suspension initiated by the controller (TASK_READY → TASK_SUSPEND).
- Resumption by the controller (TASK_SUSPEND → TASK_READY).
- Cancellation by the controller (TASK_READY → TASK_CANCELLED), ensuring it is removed from ready queues and no bitmap bits remain set.
Blocked task behavior (TASK_RUNNING → TASK_BLOCKED)
- Delay task is created and its priority changed to match the controller's priority (TASK_READY).
- When the controller yields, the delay task becomes active, invokes mo_task_delay(), transitions into TASK_BLOCKED, and the controller resumes execution. The test verifies that the blocked task is fully removed from the ready queue and its priority bit is cleared in the bitmap.

Expected results
All state transitions maintain consistent bitmap states, correct ready-queue membership, and accurate per-priority task count tracking. No unexpected runnable tasks appear, and no ready-queue entries persist after a task transitions to BLOCKED, SUSPENDED, or CANCELLED.

Test result

Linmo kernel is starting...
Heap initialized, 130005992 bytes available
idle id 1: entry=80001900 stack=80004488 size=4096
task 2: entry=80000788 stack=80005508 size=4096 prio_level=4 time_slice=5
Scheduler mode: Preemptive
Starting RR-cursor based scheduler test suits...

=== Testing Bitmap and Task Count Consistency ===
task 3: entry=80000168 stack=80006634 size=4096 prio_level=4 time_slice=5
PASS: Bitmap is consistent when TASK_READY
PASS: Task count is consistent when TASK_READY
PASS: Bitmap is consistent when priority migration
PASS: Task count is consistent when priority migration
PASS: Bitmap is consistent when TASK_SUSPENDED
PASS: Task count is consistent when TASK_SUSPENDED
PASS: Bitmap is consistent when TASK_READY from TASK_SUSPENDED
PASS: Task count is consistent when TASK_READY from TASK_SUSPENDED
PASS: Bitmap is consistent when task canceled
PASS: Task count is consistent when task canceled
task 4: entry=80000178 stack=80006634 size=4096 prio_level=4 time_slice=5
PASS: Task count is consistent when task canceled
PASS: Task count is consistent when task blocked

=== Test Results ===
Tests passed: 12
Tests failed: 0
Total tests: 12
All tests PASSED!
RR-cursor based scheduler tests completed successfully.

Note

The term TASK_CANCELLED in this document is used only for explanation. It is not an actual state in the task state machine, but represents the condition where a task has been removed from all scheduling structures and no longer exists in the system.
The task states shown in parentheses (e.g., (TASK_READY)) refer to the state of the test tasks being created or manipulated, not the state of the controller task.

Implementation reference

3. Benchmark

The benchmark compares the original O(n) scheduler with the new O(1) scheduler under multiple task-load scenarios. Each scenario measures the average scheduling latency observed in QEMU using the existing benchmarking framework.

Test suits

Benchmark methodology

Same build configuration for both schedulers.
Each scenario repeatedly triggers scheduling events and measures the average scheduling latency.
The results reflect pure scheduler overhead (not influenced by application logic).
Reported latency is the average of multiple runs, with outliers filtered by the benchmark script.

Scenarios
The benchmark covers the following scenarios:

Minimal active – Few tasks, low diversity.
Moderate active – Medium task count with mixed priorities.
Heavy active – Many runnable tasks across all priority levels.
Stress test – Real-time–biased workload with uneven priority distribution.
Full load test -

Test results

Scenario 'Minimal Active':                                                                                                                                                                                                 
  mean improvement        = 2.68x faster                                                                                                                                                                                   
  std dev of improvement  = 0.34x                                                                                                                                                                                          
  min / max improvement   = 1.75x  /  3.35x                                                                                                                                                                                
  95% CI of improvement   = [2.54x, 2.83x]                                                                                                                                                                                 
  mean old sched time     = 5616.25 us                                                                                                                                                                                     
  mean new sched time     = 2119.0 us                                                                                                                                                                                      
  max  old sched time     = 47.0 us 
  max  new sched time     = 37.0 us 

Scenario 'Moderate Active':
  mean improvement        = 1.80x faster
  std dev of improvement  = 0.27x
  min / max improvement   = 1.27x  /  2.51x
  95% CI of improvement   = [1.68x, 1.92x]
  mean old sched time     = 3887.6 us 
  mean new sched time     = 2179.45 us 
  max  old sched time     = 40.0 us 
  max  new sched time     = 23.0 us 

Scenario 'Heavy Active':
  mean improvement        = 1.02x faster
  std dev of improvement  = 0.08x
  min / max improvement   = 0.84x  /  1.17x
  95% CI of improvement   = [0.98x, 1.06x]
  mean old sched time     = 2150.15 us 
  mean new sched time     = 2119.1 us 
  max  old sched time     = 73.0 us 
  max  new sched time     = 33.0 us 

Scenario 'Stress Test':
  mean improvement        = 0.93x (slower than OLD)
  std dev of improvement  = 0.11x
  min / max improvement   = 0.65x  /  1.20x
  95% CI of improvement   = [0.88x, 0.98x]
  mean old sched time     = 1874.35 us 
  mean new sched time     = 2032.55 us 
  max  old sched time     = 23.0 us 
  max  new sched time     = 20.0 us 

Scenario 'Full Load Test':
  mean improvement        = 0.89x (slower than OLD)
  std dev of improvement  = 0.11x
  min / max improvement   = 0.63x  /  1.07x
  95% CI of improvement   = [0.84x, 0.94x]
  mean old sched time     = 1798.8 us 
  mean new sched time     = 2048.55 us 
  max  old sched time     = 33.0 us 
  max  new sched time     = 52.0 us

Implementation reference

Future works

Notes

The draft PR #23 has been closed.

Previously, the scheduler performed a linear search through the global task list (kcb->tasks) to find the next TASK_READY task. This approach limited scalability as the search iterations increased with the number of tasks, resulting in higher scheduling latency. To support an O(1) scheduler and improve extensibility, a sched_t structure is introduced and integrated into kcb. The new structure contains: - ready_queues: Holds all runnable tasks, including TASK_RUNNING and TASK_READY. The scheduler selects tasks directly from these queues. - ready_bitmap: Records the state of each ready queue. Using the bitmap, the scheduler can locate the highest-priority runnable task in O(1) time complexity. - rr_cursors: Round-robin cursors that track the next task node in each ready queue. Each priority level maintains its own RR cursor. The top priority cursor is assigned to kcb->task_current, which is advanced circularly after each scheduling cycle. - hart_id: Identifies the scheduler instance per hart (0 for single-hart configurations). - task_idle: The system idle task, executed when no runnable tasks exist. In the current design, kcb binds only one sched_t instance (hart0) for single-hart systems, but this structure can be extended for multi-hart scheduling in the future.

Previously, the list operation for removal was limited to list_remove(), which immediately freed the list node during the function call. When removing a running task (TASK_RUNNING), the list node in the ready queue must not be freed because kcb->task_current shares the same node. This change introduces list_unlink(), which detaches the node from the list without freeing it. The unlinked node is returned to the caller, allowing safe reuse and improving flexibility in dequeue operations. This API will be applied in sched_dequeue_task() for safely removing tasks from ready queues.

When a task is enqueued into or dequeued from the ready queue, the bitmap that indicates the ready queue state should be updated. These three marcos can be used in mo_task_dequeue() and mo_task_enqueue() APIs to improve readability and maintain consistency.

Previously, sched_enqueue_task() only changed task state without inserting into ready queue. As a result, the scheduler could not select enqueued task for execution. This change pushes the task into the appropriate ready queue using list_pusback(), and initializes realated attribution such as the ready bitmap and RR cursor. The ready queue for corresponging task priority will be initialized at this enqueue path and never be released afterward. With this updated API, tasks can be enqueued into the ready queue and selected by cursor-based O(1) scheduler.

Previously, mo_task_dequeue() was only a stub and returned immediately without performing any operation. As a result, tasks remained in the ready queue after being dequeued, leading to potential scheduler inconsistencies. This change implements the full dequeue process: - Searches for the task node in the ready queue by task ID. - Maintains RR cursor consistency: the RR cursor should always point to a valid task node in the ready queue. When removing a task node, the cursor is advanced circularly to the next node. - Unlinks the task node using list_unlink(), which removes the node from the ready queue without freeing it. list_unlink() is used instead of list_remove() to avoid accidentally freeing kcb->task_current when the current running task is dequeued. - Updates and checks queue_counts: if the ready queue becomes empty, the RR cursor is set to NULL and the bitmap is cleared until a new task is enqueued.

Previously, mo_task_spawn() only created a task and appended it to the global task list (kcb->tasks), assigning the first task directly from the global list node. This change adds a call to sched_enqueue_task() within the critical section to enqueue the task into the ready queue and safely initialize its scheduling attributes. The first task assignment is now aligned with the RR cursor mechanism to ensure consistency with the O(1) scheduler.

Previously, the scheduler iterated through the global task list (kcb->tasks) to find the next TASK_READY task, resulting in O(N) selection time. This approach limited scalability and caused inconsistent task rotation under heavy load. The new scheduling process: 1. Check the ready bitmap and find the highest priority level. 2. Select the RR cursor node from the corresponding ready queue. 3. Advance the selected cursor node circularly. Why RR cursor instead of pop/enqueue rotation: - Fewer operations on the ready queue: compared to the pop/enqueue approach, which requires two function calls per switch, the RR cursor method only advances one pointer per scheduling cycle. - Cache friendly: always accesses the same cursor node, improving cache locality on hot paths. - Cycle deterministic: RR cursor design allows deterministic task rotation and enables potential future extensions such as cycle accounting or fairness-based algorithms. This change introduces a fully O(1) scheduler design based on per-priority ready queues and round-robin (RR) cursors. Each ready queue maintains its own cursor, allowing the scheduler to select the next runnable task in constant time.

Previously, mo_task_suspend() only changed the task state to TASK_SUSPENDED without removing the task from the ready queue. As a result, suspended tasks could still be selected by the scheduler, leading to incorrect task switching and inconsistent queue states. This change adds a dequeue operation to remove the corresponding task node from its ready queue before marking it as suspended. Additionally, the condition to detect the currently running task has been updated: the scheduler now compares the TCB pointer (kcb->task_current->data == task) instead of the list node (kcb->task_current == node), since kcb->task_current now stores a ready queue node rather than a global task list node. If the suspended task is currently running, the CPU will yield after the task is suspended to allow the scheduler to select the next runnable task. This ensures that suspended tasks are no longer visible to the scheduler until they are resumed.

Previously, mo_task_cancel() only removed the task node from the global task list (kcb->tasks) but did not remove it from the ready queue. As a result, the scheduler could still select a canceled task that remained in the ready queue. Additionally, freeing the node twice could occur because the same node was already freed after list_remove(), leading to a double-free issue. This change adds a call to sched_dequeue_task() to remove the task from the ready queue, ensuring that once a task is canceled, it will no longer appear in the scheduler’s selection path. This also prevents memory corruption caused by double-freeing list nodes.

Previously, mo_task_resume() only changed resumed task state to TASK_READY, but didn't enqueue it into ready queue. As a result, the scheduler could not select the resumed task for execution. This change adds sched_enqueue_task() to insert the resumed task into the appropriate ready queue and update the ready bitmap, ensuring the resumed task becomes schedulable again.

Previously, mo_task_wakeup() only changed the task state to TASK_READY without enqueuing the task back into the ready queue. As a result, a woken-up task could remain invisible to the scheduler and never be selected for execution. This change adds a call to sched_enqueue_task() to insert the task into the appropriate ready queue based on its priority level. The ready bitmap, task counts of each ready queue, and RR cursor are updated accordingly to maintain scheduler consistency. With this update, tasks transitioned from a blocked or suspended state can be properly scheduled for execution once they are woken up.

This commit introduces a new API, sched_migrate_task(), which enables migration of a task between ready queues of different priority levels. The function safely removes the task from its current ready queue and enqueues it into the target queue, updating the corresponding RR cursor and ready bitmap to maintain scheduler consistency. This helper will be used in mo_task_priority() and other task management routines that adjust task priority dynamically. Future improvement: The current enqueue path allocates a new list node for each task insertion based on its TCB pointer. In the future, this can be optimized by directly transferring or reusing the existing list node between ready queues, eliminating the need for an additional malloc() and free() operations during priority migrations.

This change refactors the priority update process in mo_task_priority() to include early-return checks and proper task migration handling. - Early-return conditions: * Prevent modification of the idle task. * Disallow assigning TASK_PRIO_IDLE to non-idle tasks. The idle task is created by idle_task_init() during system startup and must retain its fixed priority. - Task migration: If the priority-changed task resides in a ready queue (TASK_READY or TASK_RUNNING), sched_migrate_task() is called to move it to the queue corresponding to the new priority. - Running task behavior: When the current running task changes its own priority, it yields the CPU so the scheduler can dispatch the next highest-priority task.

This commit introduces the system idle task and its initialization API (idle_task_init()). The idle task serves as the default execution context when no other runnable tasks exist in the system. The sched_idle() function supports both preemptive and cooperative modes. In sched_t, a list node named task_idle is added to record the idle task sentinel. The idle task never enters any ready queue and its priority level cannot be changed. When idle_task_init() is called, the idle task is initialized as the first execution context. This eliminates the need for additional APIs in main() to set up the initial high-priority task during system launch. This design allows task priorities to be adjusted safely during app_main(), while keeping the scheduler’s entry point consistent.

When all ready queues are empty, the scheduler should switch to idle mode and wait for incoming interrupts. This commit introduces a dedicated helper to handle that transition, centralizing the logic and improving readbility of the scheduler path to idle.

Previously, when all ready queues were empty, the scheduler would trigger a kernel panic. This condition should instead transition into the idle task rather than panic. The new sched_switch_to_idle() helper centralizes this logic, making the path to idle clearer and more readable.

The idle task is now initialized in main() during system startup. This ensures that the scheduler always has a valid execution context before any user or application tasks are created. Initializing the idle task early guarantees a safe fallback path when no runnable tasks exist and keeps the scheduler entry point consistent.

This change sets up the scheduler state during system startup by assigning kcb->task_current to kcb->harts->task_idle and dispatching to the idle task as the first execution context. This commit also keeps the scheduling entry path consistent between startup and runtime.

Previously, both mo_task_spawn() and idle_task_init() implicitly bound their created tasks to kcb->task_current as the first execution context. This behavior caused ambiguity with the scheduler, which is now responsible for determining the active task during system startup. This change removes the initial binding logic from both functions, allowing the startup process (main()) to explicitly assign kcb->task_current (typically to the idle task) during launch. This ensures a single, centralized initialization flow and improves the separation between task creation and scheduling control.

Prepare for O(1) bitmap index lookup by adding a 32-entry De Bruijn sequence table. The table will be used in later commits to replace iterative bit scanning. No functional change in this patch.

Implement the helper function that uses a De Bruijn multiply-and-LUT approach to compute the index of the least-significant set bit in O(1) time complexity. This helper is not yet wired into the scheduler logic; integration will follow in a later commit. No functional change in this patch.

Replace the iterative bitmap scanning with the De Bruijn multiply+LUT method via the new helper. This change makes top-priority selection constant-time and deterministic.

Previously, _sched_block() only enqueued the task into the wait queue and set its state to TASK_BLOCKED. In the new scheduler design (ready-queue–based), a blocked task must also be removed from its priority's ready queue to prevent it from being selected by the scheduler. This change adds the missing dequeue path for the corresponding ready queue, ensuring behavior consistency.

Previously, sched_wakeup_task() was limited to internal use within the scheduler module. This change makes it globally visible so that it can be reused in semaphore.c for task wake-up operations.

Previously, mo_sem_signal() only changed the awakened task state to TASK_READY when a semaphore signal was triggered. In the new scheduler design, which selects runnable tasks from ready queues, the awakened task must also be enqueued for scheduling. This change invokes sched_wakeup_task() to perform the enqueue operation, ensuring the awakened task is properly inserted into the ready queue.

Previously, mo_task_delay() only set TASK_BLOCKED and updated delayed ticks. In the new ready-queue-based scheduler, delayed tasks must also be removed from the ready queue. This change calls sched_dequeue_task() in mo_task_delay() so that the task is properly dequeued from its priority ready queue when it is delayed.

vicLin8712 added 26 commits October 30, 2025 14:05

Add De Bruijn LUT for future O(1) priority selection

8096532

Prepare for O(1) bitmap index lookup by adding a 32-entry De Bruijn sequence table. The table will be used in later commits to replace iterative bit scanning. No functional change in this patch.

Use De Brujin-based top priority helper in scheduler

589b0e7

Replace the iterative bitmap scanning with the De Bruijn multiply+LUT method via the new helper. This change makes top-priority selection constant-time and deterministic.

Make sched_wakeup_task() globally visible

1abd962

Previously, sched_wakeup_task() was limited to internal use within the scheduler module. This change makes it globally visible so that it can be reused in semaphore.c for task wake-up operations.

vicLin8712 closed this Nov 18, 2025

vicLin8712 deleted the o1-sched branch November 18, 2025 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single-Hart O(1) Enhancement #35

Single-Hart O(1) Enhancement #35

Uh oh!

vicLin8712 commented Nov 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Single-Hart O(1) Enhancement #35

Single-Hart O(1) Enhancement #35

Uh oh!

Conversation

vicLin8712 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New O(1) time complexity scheduler

Changes

Original scheduler

New scheduler design in this PR

New task selection logic in this PR

Features

Implementation detail

Task state transition

Validation

1. Backward compatible

2. Unit test

3. Benchmark

Future works

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vicLin8712 commented Nov 17, 2025 •

edited

Loading