Skip to content

Conversation

@jserv
Copy link
Contributor

@jserv jserv commented Jan 8, 2026

This introduces three optimizations:

  1. Block-level cycle counting
    • Remove per-instruction cycle++ from RVOP macro
    • Pre-compute block->cycle_cost at translation time
    • Add cycle cost at block entry (interpreter) or exit (JIT)
    • Maintains accurate cycle counts with less overhead
  2. Timer derivation from cycle counter
    • Remove per-instruction rv->timer++ in SYSTEM mode
    • Derive timer on-demand: timer = csr_cycle + timer_offset
    • Compute timer only at interrupt check points (rv_check_interrupt)
    • Extend CSR sync to TIME/TIMEH registers for correct derivation
  3. Page-boundary block termination with fallthrough chaining
    • Terminate blocks at 4KB page boundaries
    • Add page_terminated flag to block_t structure
    • Implement fallthrough chaining for non-branch block endings
    • Use branch_taken pointer for fallthrough to next block
    • Enables future O(1) cache invalidation via page index

The combination maintains correctness while reducing per-instruction overhead. Block chaining still works across page-bounded blocks through the fallthrough mechanism.


Summary by cubic

Reduces hot-path overhead by deriving TIME from CYCLE, page-bounding blocks with fallthrough chaining, and using block-level cycle counting in JIT. Improves performance and enables O(1) cache invalidation per virtual page.

  • New Features

    • Block-level cycle counting (JIT): precompute cycle_cost; add at JIT exit; interpreter keeps per-instruction cycle++ to support chaining.
    • TIME derived from CYCLE: timer = csr_cycle + timer_offset; computed at interrupt checks; CSR sync extended to TIME/TIMEH.
    • Page-bounded blocks with fallthrough chaining and a page index for O(1) SFENCE.VMA invalidate.
  • Bug Fixes

    • Trap-safe MMU translation in JIT: skip RAM/MMIO ops when a fault occurs; treat faults as MMIO via flags.
    • Validate full access range before treating as RAM to prevent boundary overflows.
    • Fix JIT register allocation in GEN_LOAD/GEN_STORE to avoid stale regs after reset_reg and ensure correct mem_base/paddr handling.

Written for commit 3283953. Summary will update on new commits.

@jserv jserv added this to the release-2026.1 milestone Jan 8, 2026
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 7 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/jit.c">

<violation number="1" location="src/jit.c:2531">
P1: `vm_reg[0]` is used without initialization after `reset_reg()` clears the register mappings. Unlike `do_fuse9` which properly allocates with `map_vm_reg()`, this code uses an uninitialized register index that defaults to 0 (RAX/R0). This could generate incorrect JIT code that conflicts with calling conventions.</violation>
</file>

<file name="src/cache.c">

<violation number="1" location="src/cache.c:367">
P0: Missing call to `page_index_insert`. The page index is never populated because `page_index_insert` is defined but never called. This makes `cache_invalidate_va`'s O(1) lookup iterate over an empty list, failing to invalidate any blocks. Add a call to insert newly cached blocks into the page index.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor Author

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks

Details
Benchmark suite Current: 3283953 Previous: ff7e565 Ratio
Dhrystone 1642.333 DMIPS 1647.667 DMIPS 1.00
CoreMark 1011.426 iterations/sec 935.937 iterations/sec 0.93

This comment was automatically generated by workflow using github-action-benchmark.

@jserv jserv force-pushed the system-jit branch 3 times, most recently from c1ec84e to ff7e565 Compare January 9, 2026 14:31
This introduces three optimizations:
1. Block-level cycle counting
   - Remove per-instruction cycle++ from RVOP macro
   - Pre-compute block->cycle_cost at translation time
   - Add cycle cost at block entry (interpreter) or exit (JIT)
2. Timer derivation from cycle counter (SYSTEM mode)
   - Remove per-instruction rv->timer++
   - Derive timer on-demand: timer = csr_cycle + timer_offset
   - Extend CSR sync to TIME/TIMEH registers
3. Page-boundary block termination with fallthrough chaining
   - Terminate blocks at 4KB page boundaries
   - Implement fallthrough chaining via branch_taken pointer
   - Add page_index_insert() for O(1) cache invalidation

Fix JIT register allocation in GEN_LOAD/GEN_STORE macros:
   - After reset_reg(), vm_reg[0] was stale (not reallocated)
   - Use temp_reg for paddr, properly allocate registers for mem_base
   - Aligns with patterns used in fused instruction handlers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants