Skip to content

Conversation

@alees24
Copy link
Contributor

@alees24 alees24 commented Jul 25, 2025

  • Migrate TL-UL logic into a submodule to support multiple ports.
  • Separated I and D ports for increased performance.
  • Implemented read buffering on each port.
  • Write coalescing on D port, to form burst writes.
  • Routing of write notifications from D port to I port.
  • Updating of read buffers in response to write notifications.
  • Introduce a single-entry, zero-latency FIFO on each TL-UL connection to avoid a combinatorial loop that would otherwise exist between the LSU and Instruction fetch ports of the Ibex when the HyperRAM is presented to both ports.
  • Use a single SRAM model of the HyperRAM for simulation purposes (when requested) and for the Sonata XL synthesis target.
  • SRAM model is dual-ported on the TL-UL bus, increasing performance on the Sonata XL target too.
  • Support simulation of Sonata XL (with TARGET_XL_BOARD defined).

Background:

The HyperRAM interface was previously a FPGA-only design, making it difficult to develop a more sophisticated implementation that could offer higher performance. The first implementation was simple and reliable, but suffered comparatively poor performance on account of (i) issuing each individual TL-UL read/write transaction in isolation to the HBMC/HyperRAM device. (HyperRAM is intended to offer high throughput using burst transfers, but at the penalty of comparatively high latency), and (ii) the HyperRAM was presented as a single slave port to both the I and D ports.

This body of work on improving the performance of the HyperRAM interface thus started by aiming to bring up the HBMC in Verilator simulation, making it possible to develop/debug a more sophisticated interface with buffering and write coalescing logic.

Performance on FPGA

These figures are from executing the program sw/cheri/checks/hyperram_test which has been extended with some additional directed tests exercising the modified logic/functionality:

Previous implementation (the numbers are cycle counts; lower numbers represent faster execution):

Running RND cap test...PASS!
Running RND data test...PASS!
Running RND data & address test...PASS!
Running 0101 stripe test... (7340148 cycles)...PASS!
Running 1001 stripe test... (7340122 cycles)...PASS!
Running 0000_1111 stripe test... (7340132 cycles)...PASS!
Running Execution test... (15666097 cycles)...PASS!
Running performance test with icache enabled...
    copy:   17054 - cmp:  24631 - total:  41685...    copy:   18539 - cmp:  24634 - total:  43173...PASS!
Running performance test with icache disabled...
    copy:   17069 - cmp:  24887 - total:  41956...    copy:   42925 - cmp:  24883 - total:  67808...PASS!
Running alignment tests with cleaning...
without cleaning...
PASS!
Running buffering test...PASS!
Running write tests...
  Test type 0: 1024 iteration(s)
  Test type 1: 1024 iteration(s)
  Test type 2: 1024 iteration(s)
  Test type 3: 1024 iteration(s)
  Test type 4: 1024 iteration(s)
  Test type 5: 1024 iteration(s)
  Test type 6: 1024 iteration(s)
  Test type 7: 1024 iteration(s)
  result...PASS!
Single iteration took 128378015 cycles

Modified implementation, using the RTL from this PR:

Running RND cap test...PASS!
Running RND data test...PASS!
Running RND data & address test...PASS!
Running 0101 stripe test... (4947985 cycles)...PASS!
Running 1001 stripe test... (4947982 cycles)...PASS!
Running 0000_1111 stripe test... (4947986 cycles)...PASS!
Running Execution test... (9792540 cycles)...PASS!
Running performance test with icache enabled...
    copy:    3754 - cmp:   6327 - total:  10081...    copy:    3825 - cmp:   6325 - total:  10150...PASS!
Running performance test with icache disabled...
    copy:    3754 - cmp:   6194 - total:   9948...    copy:    3819 - cmp:   6194 - total:  10013...PASS!
Running alignment tests with cleaning...
without cleaning...
PASS!
Running buffering test...PASS!
Running write tests...
  Test type 0: 1024 iteration(s)
  Test type 1: 1024 iteration(s)
  Test type 2: 1024 iteration(s)
  Test type 3: 1024 iteration(s)
  Test type 4: 1024 iteration(s)
  Test type 5: 1024 iteration(s)
  Test type 6: 1024 iteration(s)
  Test type 7: 1024 iteration(s)
  result...PASS!
Single iteration took 87542355 cycles

Linear code execution from HyperRAM

Code (to be raised separately) is just an 8KiB sequence of cincoffset ca0, ca0, 0x1 instructions, used as a single-cycle 'no-op' but with the ability to check that all instructions have been executed as intended.

Previous implementation:

Running linear execution test with icache enabled...
100 iteration(s) took 1367932 cycles
PASS!
Running linear execution test with icache disabled...
100 iteration(s) took 1809637 cycles
PASS!

Modified implementation, using the RTL from this PR:

Running linear execution test with icache enabled...
100 iteration(s) took 440148 cycles
PASS!
Running linear execution test with icache disabled...
100 iteration(s) took 439208 cycles
PASS!

@alees24 alees24 requested a review from marnovandermaas July 25, 2025 14:56
@alees24 alees24 force-pushed the hyperram-rtl branch 3 times, most recently from 17e585f to 8933b68 Compare July 30, 2025 06:03
@alees24
Copy link
Contributor Author

alees24 commented Jul 30, 2025

Updated with lint fixes, comment correction and connection of an omitted parameter; no functional change.

Copy link
Contributor

@marnovandermaas marnovandermaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for putting this together. I've done a code review, and think this is working as expected. I've also checked CI and the HyperRAM tests are passing. Feel free to merge this once you're happy with it as you are in a better position to judge this than I am.

end

// Updating of validity bits.
always_ff @(posedge clk_i) begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is fine not to have a reset because configured will be forced to zero in the always_ff above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've generally tried to follow the existing design principle of not introducing reset logic where it is not logically required.

Comment on lines +197 to +200
if (!stalled) begin
if (wr_start | wr_storing) wr_timer <= {TimerW{1'b1}};
else if (wr_stored) wr_timer <= wr_timer - 'b1;
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, good to have this timer.

 - Migrate TL-UL logic into a submodule to support multiple ports.
 - Separated I and D ports for increased performance.
 - Implemented read buffering on each port.
 - Write coalescing on D port, to form burst writes.
 - Routing of write notifications from D port to I port.
 - Updating of read buffers in response to write notifications.
 - Introduce a single-entry, zero-latency FIFO on each TL-UL
   connection to avoid a combinatorial loop that would otherwise
   exist between the LSU and Instruction fetch ports of the Ibex
   when the HyperRAM is presented to both ports.
 - Use a single SRAM model of the HyperRAM for simulation purposes
   (when requested) and for the Sonata XL synthesis target.
 - SRAM model is dual-ported on the TL-UL bus, increasing performance
   on the Sonata XL target too.
 - Support simulation of Sonata XL (with TARGET_XL_BOARD defined).
@alees24 alees24 merged commit 21ad093 into lowRISC:main Jul 30, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants