Skip to content

Conversation

@jserv
Copy link
Collaborator

@jserv jserv commented Oct 23, 2025

This implements cooperative multitasking for multi-hart systems using coroutines, enabling efficient SMP emulation with significant CPU usage reduction.

  • WFI instruction callback mechanism for power management
  • CPU usage optimization: ~90% reduction in idle systems
  • Maximum latency: 1ms (acceptable for typical 10ms timer interrupts)

Summary by cubic

Adds coroutine-based cooperative SMP for multi-hart emulation. Harts yield on WFI to cut idle CPU usage by ~90%, with up to 1ms latency.

  • New Features

    • Lightweight coroutine runtime (x86_64/ARM64 assembly, ucontext fallback).
    • Per-hart coroutines and round-robin scheduling in semu_run.
    • WFI callback to suspend a hart; sleep 1ms when all started harts are waiting.
    • Hart execution loop with batched steps and proper trap/ECALL handling.
    • Makefile adds coro.o; hart0 PC explicitly set to 0x0.
  • Bug Fixes

    • Initialize ppn in mmu_translate to avoid undefined behavior.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files

Prompt for AI agents (all 2 issues)

Understand the root cause of the following 2 issues and fix them.


<file name="main.c">

<violation number="1" location="main.c:962">
Fatal VM errors triggered inside hart_exec_loop now return success to the caller because the SMP path always returns 0 after the loop even when emu-&gt;stopped was set by vm_error_report. Please propagate a non-zero error code like the single-hart path does so crashes aren’t silently ignored.</violation>
</file>

<file name="coro.c">

<violation number="1" location="coro.c:290">
Reset coro_state.current_hart to CORO_HART_ID_IDLE when the coroutine leaves so callers don’t see a stale hart ID.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

@shengwen-tw
Copy link
Collaborator

This looks interesting, but wouldn’t using a real thread be faster than a coroutine for HART emulation?
Is this approach chosen to keep the simplicity of semu?

@jserv
Copy link
Collaborator Author

jserv commented Oct 24, 2025

This looks interesting, but wouldn’t using a real thread be faster than a coroutine for HART emulation? Is this approach chosen to keep the simplicity of semu?

Before refining the internal structure toward multiple harts, let’s stick to single-threaded emulation as early versions of QEMU did.

@jserv jserv force-pushed the wfi branch 5 times, most recently from 5b2a16d to 67fa2c5 Compare October 24, 2025 05:38
@sysprog21 sysprog21 deleted a comment from cubic-dev-ai bot Oct 24, 2025
@sysprog21 sysprog21 deleted a comment from cubic-dev-ai bot Oct 24, 2025
cubic-dev-ai[bot]

This comment was marked as outdated.

cubic-dev-ai[bot]

This comment was marked as resolved.

@shengwen-tw
Copy link
Collaborator

shengwen-tw commented Oct 25, 2025

This looks interesting, but wouldn’t using a real thread be faster than a coroutine for HART emulation? Is this approach chosen to keep the simplicity of semu?

Before refining the internal structure toward multiple harts, let’s stick to single-threaded emulation as early versions of QEMU did.

I see, I think this is an interesting one to follow.

jserv added 2 commits October 28, 2025 02:02
This implements cooperative multitasking for multi-hart systems using
coroutines, enabling efficient SMP emulation with significant CPU usage
reduction.
- WFI instruction callback mechanism for power management
- CPU usage optimization: ~90% reduction in idle systems
- Maximum latency: 1ms (acceptable for typical 10ms timer interrupts)
Previous implementation used usleep(1000) busy-wait loop in SMP mode,
causing high CPU usage (~100%) even when all harts were idle in WFI.

This commit implements platform-specific event-driven wait mechanisms:

Linux implementation:
- Use timerfd_create() for 1ms periodic timer
- poll() on timerfd + UART fd for blocking wait
- Consume timerfd events to prevent accumulation
- Reduces CPU usage from ~100% to < 2%

macOS implementation:
- Use kqueue() for event multiplexing
- EVFILT_TIMER for 1ms periodic wakeup
- Blocks on kevent() when all harts in WFI
- Reduces CPU usage from ~100% to < 2%

Benefits:
- Dramatic CPU usage reduction (> 98%) on both platforms
- Zero latency for UART input (event-driven vs. polling)
- Maintains 1ms responsiveness for timer interrupts
- Event-based architecture easier to extend

Tested on Linux with timerfd - 4-core boot succeeds, CPU < 2%
Tested on macOS with kqueue - 4-core boot succeeds, CPU < 2%

Note: UART input relies on u8250_check_ready() polling in periodic
update loop. Direct fd monitoring removed from macOS implementation
as kqueue does not support TTY file descriptors.
This moves peripheral polling into the coroutine loop, so SMP runs keep
same cadence as the single-core path, preventing delayed device IRQs.

It also clears the published coroutine hart id when yielding to avoid
exposing stale scheduler state to callers.
@jserv jserv merged commit 8f0c958 into master Oct 27, 2025
10 checks passed
@jserv jserv deleted the wfi branch October 27, 2025 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants