[Question] Should we remove / relocate the caches? #793

stnolting · 2024-02-05T20:10:58Z

stnolting
Feb 5, 2024
Maintainer

In general, a cache can improve overall performance hiding the latency of slow memories by utilizing temporal and spatial data locality. Right now we have two optional caches right next to the CPU: a read-only cache for instructions and a read/write cache for data.

The default processor-internal / FPGA-internal memories have an access latency of 1 cycle so the CPU can access them as fast as possible. If the caches are enabled, the overall performance drops significantly as these maximum-speed memories are cached, which requires additional time. Even worse, the data-cache uses a write-back architecture so any store operation bypasses the cache with an extra delay of one cycle.

A setup that only uses internal IMEM/DMEM in combination with the caches enabled has about ~40% less CoreMark performance than a system without caches.

Right now there are just two module that have more than 1 cycle access latency: the execute in-place module (XIP) and the external memory interface (WISHBONE). So accesses to these modules really can benefit from having the caches implemented.

Now the question: do we really need the caches? And do we need them where they are right now?

I've been thinking about relocating the caches. The read-only i-cache could be moved inside the XIP module while the read/write d-cache could be moved to the external bus interface. Thus, we would only cache accesses that for sure have a very large latency compared to the other internal modules.

Of the course the caches are quite hand where they are right now. The reduce bus traffic as the CPU's instruction and data ports have to share a single processor-wide bus. But actually, this is not a real deal, as the CPU is equipped with an instruction prefetch buffer that speculatively fetches instructions whenever the CPU does not perform a load/store operation. So I think it won't hurt (too much) not having the caches right in front of the CPU.

Anyway, I'm curious what you think. 😉

NikLeberg · 2024-02-05T21:03:47Z

NikLeberg
Feb 5, 2024
Collaborator

I kinda agree. The caches are not useful for internal IMEM/DMEM i.e. if they are enabled. Once they are disabled then they make sense though.

In my SMP experiment I actively use such an arragement. Multiple harts, all with individual I$ and D$ and accessing a single shared SDRAM though the external memory interface. Internal IMEM and DMEM are disabled.

My two cents: why not do both? Keep the existing caches sort of an Level 1 cache (maybe rename it?). But also introduce caches at your mentioned places sort of an level 2 cache?

We could also choose to implement write-back instead of write-through. This could fix the latency issue with internal IMEM/DMEM?

2 replies

stnolting Feb 5, 2024
Maintainer Author

In my SMP experiment I actively use such an arragement. Multiple harts, all with individual I$ and D$ and accessing a single shared SDRAM though the external memory interface. Internal IMEM and DMEM are disabled.

I agree, a multi-core setup definitely benefits from CPU-local caches.

However, this is still a very user-defined use case. Don't get me wrong, I would love to add a multi-core option to this project 😅, but right now I think this is out of scope. Burst transfers, a more efficient data cache and a RISC-V compliant interrupt controller would be more or less required for this.

My two cents: why not do both? Keep the existing caches sort of an Level 1 cache (maybe rename it?). But also introduce caches at your mentioned places sort of an level 2 cache?

Good point! I did not think about L1/L2 caches. I'm just worried that we could end up in configuration hell because of the complex configuration options. 🙈

We could also choose to implement write-back instead of write-through. This could fix the latency issue with internal IMEM/DMEM?

The write-through strategy is a real performance killer. However, this concept allows precise trapping (what if only one address signals an error during a cache block write back??) and I think this a key feature - maybe even more important than actual performance. 🤔

stnolting Feb 9, 2024
Maintainer Author

So let's start this by implementing "L2 caches":

a dedicated cache for the XIP module - a simple read-only direct-mapped cache should be sufficient here, right? 🤔 (✨ add optional XIP cache #799)
a dedicated cache for the external memory interface; this will be based on the current CPU data cache; however, I think we should really implement a "write-back" architecture and maybe even a set-associative cache layout. What you think?

Obviously, if these caches are enabled the CPU's I/D caches should not be enabled.

stnolting · 2024-03-04T21:13:00Z

stnolting
Mar 4, 2024
Maintainer Author

I have been working on a more generic cache module that (hopefully) provides better efficiency by implementing "write-back" and "write-allocate" strategies instead of the current d-cache's "write-through" strategy. Right now, the cache is still direct-mapped. A 2-way set-associative cache might be much better - especially when using external memory for data and instructions.

I tried to address this by implementing a "virtual splitting" of the new cache (VSPLIT). With this feature being enabled, the lower half of the cache blocks are reserved for data-only and the upper half of cache blocks are reserved for instructions-only. For some workloads (like code that uses the C extension) this split setup has a performance boost of ~30% while for other workloads (only uncompressed instructions) the performance is even worse than before.

The cache is still in an early stage and needs some more optimization, but synthesis and simulation already seems fine: https://github.com/stnolting/neorv32/blob/generic_cache/rtl/core/neorv32_cache.vhd

I made some simple benchmarks / tests:

remove current I/D caches
add new generic cache to the processor's external memory interface
disable internal IMEM/DMEM memories and use external memories only (that are cached by the new cache)

Interestingly, the setup with the "external cache" (having the same size as the old I-cache + D-cache) is about ~10% faster (for a very synthetic workload). It seems like that bus congestion caused by the CPU's instruction and data interfaces are not a big performance issue.

Anyway, more tests are required. But I wonder if we should simply remove the CPU caches and just stick with the external bus cache? 🤔

Btw, the new cache module is truly generic - so it could also be used for something like a mutli-core setup ... just saying 😉

0 replies

esherriff · 2025-10-12T11:32:10Z

esherriff
Oct 12, 2025

Sorry to necropost, but related to this I have a closed source fork of your design tailored to Avalon/Qsys. It supports compile time configured SPI bootcopier from flash to RAM at address 0 or XIP from DSPI or QSPI flash. The Qsys hw.tcl file detects the target device and looks up the maximum size of the FPGA hardware image, then sets a generic on the CPU to the next 64kB block boundary beyond that image. My build tools then extract this generic from the .sopcinfo to write the linker script and supply it as a #define for user software.

To implement XIP, I basically made a new instruction and data cache that were connected to the CPU fetch and LSU respectively. In XIP mode, when OCD is off, the CPU connects directly to the XIP icache, no other slaves are available. If and only if OCD is enabled are IO bus switches synthesised to allow the CPU access to either the XIP icache or OCD so with no OCD you get minimum latency between XIP and CPU at the cost of no on-chip executable code. The XIP dcache attaches to the IO bus switch (my internal switch only has 4 entries because I ripped out every peripheral except the bootcopier/XIP, CFS, CLINT and OCD everything else is Avalon/Qsys). I allocated 16MB of the IO space to the XIP.

The icache and dcache send requests to either a DSPI or QSPI master which reads entire cache lines from flash (so the cache line length of both caches must be the same). The dcache is of course read only. For optional write access to the SPI flash, I offer single byte writes to and control of the CS line from a separate IO register, though I've not really used that yet.

1 reply

stnolting Oct 18, 2025
Maintainer Author

That sounds like a pretty cool setup!

To implement XIP, I basically made a new instruction and data cache that were connected to the CPU fetch and LSU respectively. In XIP mode, when OCD is off, the CPU connects directly to the XIP icache, no other slaves are available.

If there are absolutely no other devices enabled, then what is the point of having direct XIP access? I mean, th CPU cannot do anything without at least a single IO device?! 🤔

DSPI or QSPI master which reads entire cache lines from flash

That is a pretty cool feature I would love to have in NEORV32.. 😅

so the cache line length of both caches must be the same

This is already the case in the upstream version.

esherriff · 2025-10-18T20:49:03Z

esherriff
Oct 18, 2025

The CPU LSU can access other devices but when OCD is disabled the fetch unit connects directly to XIP. Of course this restricts you to only running code from flash but I dread how complex the linker script generation would get with options for random Avalon slaves attached.

In XIP mode the LSU connects first to an Avalon data master which looks for "local access" of addresses above 0xF0000000 and routes those to the IO bridge. I removed support for the atomic operations because while Avalon could support it, I don't think it is worth implementing for my use case.

The DSPI and QSPI masters weren't particularly hard to do, they only send one command to flash (0xBB or 0xEB respectively), then read back an entire cache line.

0 replies

esherriff · 2025-10-24T17:14:26Z

esherriff
Oct 24, 2025

An illustration:
Boot copier mode

Boot RAM is a 1kB 32-bit RAM. The content is a C bootcopier program that is built after the main executable is finished. The size of the main executable is supplied to the bootcopier during compilation as well as the location where the build scripts will place the user executable in flash memory (which in turn is derived from the FPGA part number) as above. The boot copier contains its own simple 32-bit wide SPI register which can be read and written to as part of the bootloader memory map. Build tooling auto injects the fresh boot loader into the .sof and builds .pof and .jic files for the flash image.

XIP mode, OCD = false

Fetch unit connects directly to the SPI cache which can only fetch from SPI. Data master is not cached, local access to 0xF0000000 and above are redirected to the IO switch.
XIP mode, OCD = true

Fetch unit connects to an IO switch with only two memory regions enabled SPI in 0xF0000000 to 0xF3FFFFFFF and OCD from 0xFC000000 to 0xFFFFFFFF. Dmaster routes as before.

XIP caches are at minimum a 1kB ring buffer prefetch holding two blocks, or a direct mapped cache derived from the neorv32_cache but with reduced tag size due to the 24-bit address range supported by the DSPI/QSPI interface.

0 replies

[Question] Should we remove / relocate the caches? #793

Uh oh!

stnolting Feb 5, 2024 Maintainer

Replies: 5 comments · 3 replies

Uh oh!

NikLeberg Feb 5, 2024 Collaborator

Uh oh!

stnolting Feb 5, 2024 Maintainer Author

Uh oh!

Uh oh!

stnolting Feb 9, 2024 Maintainer Author

Uh oh!

stnolting Mar 4, 2024 Maintainer Author

Uh oh!

esherriff Oct 12, 2025

Uh oh!

stnolting Oct 18, 2025 Maintainer Author

Uh oh!

esherriff Oct 18, 2025

Uh oh!

Uh oh!

esherriff Oct 24, 2025

stnolting
Feb 5, 2024
Maintainer

Replies: 5 comments 3 replies

NikLeberg
Feb 5, 2024
Collaborator

stnolting Feb 5, 2024
Maintainer Author

stnolting Feb 9, 2024
Maintainer Author

stnolting
Mar 4, 2024
Maintainer Author

esherriff
Oct 12, 2025

stnolting Oct 18, 2025
Maintainer Author

esherriff
Oct 18, 2025

esherriff
Oct 24, 2025